MFI-Net: A multi-resolution fusion input network for retinal vessel segmentation

Segmentation of retinal vessels is important for doctors to diagnose some diseases. The segmentation accuracy of retinal vessels can be effectively improved by using deep learning methods. However, most of the existing methods are incomplete for shallow feature extraction, and some superficial features are lost, resulting in blurred vessel boundaries and inaccurate segmentation of capillaries in the segmentation results. At the same time, the “layer-by-layer” information fusion between encoder and decoder makes the feature information extracted from the shallow layer of the network cannot be smoothly transferred to the deep layer of the network, resulting in noise in the segmentation features. In this paper, we propose the MFI-Net (Multi-resolution fusion input network) network model to alleviate the above problem to a certain extent. The multi-resolution input module in MFI-Net avoids the loss of coarse-grained feature information in the shallow layer by extracting local and global feature information in different resolutions. We have reconsidered the information fusion method between the encoder and the decoder, and used the information aggregation method to alleviate the information isolation between the shallow and deep layers of the network. MFI-Net is verified on three datasets, DRIVE, CHASE_DB1 and STARE. The experimental results show that our network is at a high level in several metrics, with F1 higher than U-Net by 2.42%, 2.46% and 1.61%, higher than R2U-Net by 1.47%, 2.22% and 0.08%, respectively. Finally, this paper proves the robustness of MFI-Net through experiments and discussions on the stability and generalization ability of MFI-Net.


Introduction
The retina contains a large number of blood vessels and is the only vascular system in the body that can be viewed in depth using non-invasive means. Common diseases such as retinal arterial and venous occlusion, high blood pressure and diabetes will have symptoms on the retinal blood vessels, so that timely detection of changes in the length, width, curvature, branching pattern, and transparency of retinal vessels [1] would have a high chance of avoiding blindness due to these diseases [2]. Segmentation of retinal vessel images has become an important task in modern medically assisted treatment and diagnosis, and traditional segmentation means are not only labor-intensive and time-consuming but also have great variability in segmentation results and are difficult to achieve accurate segmentation [3]. With the development of computer hardware and software and the increasing maturity of related technologies, the use of deep learning methods for automatic segmentation of medical images has not only reduced the burden of related workers but also significantly improved the segmentation accuracy. In view of the increasing significance of deep learning for medical image segmentation, more and more professionals are studying and using deep learning methods for automatic segmentation of retinal vascular images.
Retinal vessels are difficult to segment from the background, pathological regions, and other noise due to illumination imbalance problems. Therefore, a large amount of work has been devoted to solving these problems while improving the segmentation accuracy of the network. The following major research results have been presented in recent years: Zhang et al. [4] proposed a matched filter segmentation algorithm to detect blood vessels by thresholding the response of the retinal image to the matched filter, which significantly reduced the false detection generated by the original matched filter. Wang et al. [5] transformed the retinal image using 2D Gabor wavelets of different scales and applied morphological reconstruction. Oliveira et al. [6] proposed an unsupervised segmentation method using Frangi filters and Gabor wavelet filters to enhance and segment retinal vessel images. Liskowski et al. [7] proposed a supervised segmentation algorithm, which performed contrast normalization and zero-phase whitening on retinal images with a strong ability to combat noise, but the segmentation accuracy was low and could not segment some of the capillaries. Aslani et al. [8] used a mixed feature vector to train a random forest classifier for a supervised retinal vessel segmentation task, and the accuracy on DRIVE and STARE reached 95.13% and 96.05%, but with low sensitivity. Marin et al. [9] proposed a novel supervised method for vessel detection in digital retinal images. This method uses neural networks for pixel classification and computes a 7-dimensional vector consisting of gray levels and moment invariance-based features for pixel representation, obtaining 94.52% accuracy on the DRIVE dataset. Dharmawan et al. [10] proposed a novel hybrid algorithm based on U-Net [11] for retinal vessel segmentation of fundus images. Alom et al. [12] designed a recursive convolutional neural network RU-Net based on U-Net and R2U-Net using recursive residual convolution. Zhang et al. [13] took full advantage of low-level features and high-level features to design a novel network and utilized infinite convolution to obtain multi-scale features. Li et al. [14] improved the U-Net network by using an attention mechanism that can improve the segmentation at capillaries and improve the segmentation accuracy at vessel boundaries.
At present, the research on the segmentation of retinal blood vessels has also made great progress, but the accuracy and performance of the segmentation results of most methods still have a lot of room for improvement. At the same time, the problems of blurred blood vessel boundaries, obvious noise and inaccurate segmentation of capillaries have not been effectively solved. In order to solve the problem of vascular edge blurring, we designed a novel multi-resolution and multi-scale fusion input module, aiming to enhance the ability of accurate segmentation of boundaries by extracting more and richer feature information in shallow layers. The fusion of encoder and decoder information in the network is also redesigned, and the fusion of shallow and deep features is reconsidered so that the segmentation map recovered from the upsampling process can introduce less noise and segment a more complete blood vessel.
In summary, we make the following four contributions: ability to resist noise. At the same time, we combine channel attention to strengthen the contribution of the main features to the network and suppress the influence of the diseased area on the network.
2. We designed a multi-resolution fusion input module to extract shallower coarse-grained feature information and enhance capillary segmentation, and use spatial attention and channel attention to enhance vessel boundaries to obtain a more accurate and clear retinal vessel segmentation map.
3. We conducted ablation experiments and comparative experiments on three datasets DRIVE, CHASEDB1 and STARE. Experimental results show that the segmentation performance of the network model proposed in this paper is better than the current popular network model. 4. We experimented and analyzed the stability and generalization ability of the network, and discussed the robustness of the network model. We also give the experimental data and visualization results of the cross-test to provide a reference for subsequent researchers.

Related work
With the proposal of various semantic segmentation network architectures, image segmentation, as a major research hotspot of computer vision, has also developed tremendously in recent years. Among the currently popular segmentation network models, networks with skip connection and Encoder-Decoder structures perform well in segmentation tasks, so a lot of work is carried out on the basis of this type of network model. Multi-scale input can improve the feature extraction ability of network models, so it is widely used in many research work. Next, this article will introduce these related work, mainly including the discussion of the encoder-decoder type segmentation network model and the multi-scale input network model. Finally, this article will explain the main content of our work and the difference from other existing work.

Segmentation network of encoder-decoder structure
As early as 1985, David et al. [15] made the first attempt on the auto-encoder algorithm of the encoder-decoder structure on the Boltzmann machine. In 2006, Hinton et al. [16] used and systematically introduced the auto-encoder in his research. The auto-encoder is composed of two parts: an encoder and a decoder. The encoder part extracts image features through downsampling, and the decoder part gradually restores the image through up-sampling, and finally plays the role of image compression and image denoising. Because the auto-encoder has endto-end characteristics, Long et al. [17] tried to merge the encoder-decoder structure into the convolutional neural network, and proposed the FCN network model, and applied it to the segmentation task. FCN is an early successful network model that uses convolutional networks for segmentation tasks and has been widely used in various semantic segmentation tasks. In the task of medical image segmentation, the dataset is often difficult to obtain, and the number of samples in the obtained dataset is usually small. Since medical images contain different types of physiological structures of the human body, most of the medical images are very complex, but the sample images in the same dataset are highly similar, so if observed in the entire sample space, the medical image sample has the characteristics of single structure and simple semantic information. Because of the limitations of the network structure and the particularity of medical images, FCN performs generally in most medical image segmentation tasks. In 2015, the U-Net network model proposed by Ronneberger et al. [11] solved this problem to a certain extent. U-Net deepens the network depth on the basis of FCN, uses more down-sampling and convolution operations in the encoder part to fully extract the feature information in the image, and uses four consecutive up-sampling in the decoder stage to restore The edges of the resulting image are finer. U-Net uses skip connections in the corresponding stages of the encoder and decoder to promote the transmission and fusion of semantic information, ensuring that more low-level feature information can be used when performing image restoration in the decoder part. Compared with the FCN network, U-Net can extract more feature information from a data set with a small number of samples, and accurately segment complex medical images. The encoder-decoder type segmentation network structure has been further developed. Because U-Net performs well in the segmentation task of medical images, a lot of related work has begun to use it as the basic network. Zhou et al. [18] proposed UNet++, which rethinks and designs the skip connection part of the U-Net network, and discusses the optimal network layer number of the U-Net network. The R2-UNet proposed by Alom et al. [12] combines U-Net with residual network and recurrent neural network and improves the network result by strengthening the extraction of low-level feature information. Oktay et al. [19] integrated the attention mechanism into U-Net, and improved the network by suppressing irrelevant areas in the picture and strengthening more useful features.

Multi-scale input network
The purpose of multi-scale fusion is to extract different features by processing feature maps of different scales. Low-resolution feature maps can extract more complete global feature information, and high-resolution feature maps contain more detailed information. Cai et al. [20] proposed a multi-scale convolutional neural network model MS-CNN for object detection, which performs different processing on features of multiple scales, and can obtain and use different feature information. The atrous spatial pyramid pooling (ASPP) proposed by Chen et al. [21] can capture objects and their context information at multiple scales. Huang et al. [22] designed a unique multi-scale network model to extract shallow and deep feature information, which is conducive to extracting more complete shallow feature information. Zhao et al. [23] extracted multi-scale feature information through pyramid pooling and used local and global context information to make pixel prediction more reliable. Because the structure of multiscale fusion can extract richer features from the network model, the multi-scale input module is gradually applied to the segmentation network. Zhao et al. [24] used multi-scale input in their design of ICNet segmentation network model and achieved good results in the task of scene segmentation, proving that using multi-scale and multi-resolution input structure can improve the performance of the segmentation network model. Liu et al. [25] also used a multiscale input structure in the optical coherence tomography (OCT) segmentation task, which effectively enhanced the segmentation ability. Jiang et al. [26] used multi-scale input in the network model to ensure the transmission of original image features and achieved good results in the task of retinal blood vessel segmentation.

Methods of this paper
Compared with the existing methods, this paper rethinks multi-scale input and skip connection, and creatively designs a multi-resolution fusion input (MR, Multi-Resolution) module and fully aggregated skip connection (FAS, Fully Aggregated Skip connection), proposed the MFI-Net segmentation network model. The multi-resolution fusion input structure can extract and transfer more complete superficial features, further optimize the boundaries of retinal vessel segmentation results, and alleviate the problem of missing capillaries segmentation. The highly aggregated jump connection part can aggregate and fully integrate multiple levels of semantic and feature information, which greatly reduces the noise in the segmentation result.

MFI-Net segmentation network
The MFI-Net proposed in this paper is shown in Fig 1. Input the 48×48 pixel retinal blood vessel medical image to the MR module, the MR module processes the original image, gradually fusion generates five resolution feature maps, and then obtains Five feature images with different resolutions are input to the encoder part of the segmentation network for further feature extraction. The decoder part gradually restores the segmentation graphics through upsampling, and each step of upsampling is spliced with the shallow encoder information processed by the FAS of the highly aggregated skip connection part. Finally, the segmentation result is obtained after 1×1 convolution and Softmax operation. Next, we will introduce the important parts of the model in detail.

Residual unit
As the network structure continues to deepen, many networks will have the problem of gradient disappearance or gradient explosion, which leads to network degradation. In order to solve the above problems, He et al. [27] proposed a residual network. The residual unit is the basic structural unit of the residual network. Its structure is shown in Fig 2. The process can be expressed by Eqs (1) and (2).
Where x l and x l+1 are the input and output of the l-th residual unit, l represents the number of jumps, F is the residual function, which represents the learned residual relationship, and h(x l ) = x l represents the identity mapping, W l represents weight, f is ReLU activation. Therefore, the features learned from shallow l to deep L can be represented by Eq (3).
Inspired by He, this paper uses residual unit to replace ordinary convolution. Using the residual module to replace the ordinary convolution has the following advantages: 1. The residual network can be regarded as composed of multiple shallow networks, and there will be no network degradation inside each shallow network.
2. Because the residual module introduces a skip connection, the information in the network can flow into each residual unit without hindrance, which improves the information circulation capacity in the network.
3. In forward propagation, the feature information learned by the next layer must be equal to or more than the current layer.
4. In backpropagation, after introducing the residual module, each layer is more sensitive to changes in output, and the weight adjustment is more subtle.

Multi-resolution fusion input module
The multi-resolution fusion input module relies on the high-resolution image to gradually update the existing resolution feature image and generate the low-resolution feature image through repeated information exchange and information fusion in multiple stages. The process of updating the existing resolution feature image and the process of generating the new resolution can be represented by Eqs (4) to (6), where Eq (5) is the feature image updating process and Eq (6) is the new feature image generation process.
Where r i denotes the feature image of the i-th resolution, r 1 denotes the image of the first resolution, i.e., the original input image. r m denotes the current resolution, and r n+1 denotes the feature image of the new resolution. When i < m, the resolution of the feature image of the current operation is higher than the resolution of the target feature image that needs to be fused and updated, and the feature image is downsampled using a maximum pooling (Max-Pool) with a stride of 2 m−1 and a filter size of 2 m−1 × 2 m−1 and a convolution (Conv) of 3×3. When i = m, the resolution of the feature map of the current operation is the same as the resolution of the target feature map, and the feature map copy operation is performed. When i > m, the resolution of the feature map of the current operation is lower than the resolution of the target feature image, and the feature image of the current operation is upsampled using a transposed convolution (TranConv) with an upsampling rate of 2 i−m .
In order to ensure that the features added in the feature map fusion update process do not have too much noise, this module uses a spatial attention mechanism [28] in the fusion stage to suppress areas of non-main features such as noise, focusing on the retinal blood vessel area. At the same time, channel attention [28] is used for the feature image after fusion to increase the contribution of the main feature to the network. Spatial attention and channel attention are defined by Eqs (7) and (8).
r represents the feature map, σ is the Sigmoid activation, and MLP is the multilayer perceptron. Therefore, the final mathematical expression of the update and generation of the feature map in the multi-resolution fusion input module is Eq (9), wherer i means that the feature map of the i-th resolution is processed into the current resolution feature map (the m-th resolution feature map Figure) Results after size.
In simple terms, all feature maps above the current resolution are downsampled, all feature maps below the current resolution are upsampled, and the results are overlaid with the feature maps of the current resolution to update the current feature map. Therefore, the subsequent parallel feature map contains all the resolution feature information of the previous stage. In addition to updating the feature information of the existing resolution, a new feature map of lower resolution is generated, which contains all the important feature information of the existing resolution.
For a better understanding, we separately explain the second stage in the MR module in Fig  1. The previous stage generates two different resolution feature maps and passes them to the second stage, updates the 1x resolution feature map by stitching the x/2 resolution upsampled feature map and the 1x resolution feature map itself, updates the x/2 resolution feature map by stitching the x/2 resolution feature map itself and the 1x resolution feature map downsampled result and updates the x/2 resolution feature map by stitching the 1x resolution feature map downsampled result and the x/2 The new x/4 resolution feature map is generated by splicing the 1x resolution feature map downsampling result and the x/2 resolution feature map downsampling result, and then processed by the attention mechanism and input to the network or passed to the next stage, respectively.

Fully aggregated skip connections
As shown in the network structure in Fig 1, the skip-connected part FAS contains two parts: the information aggregation operation and the attention mechanism. The aggregation operation superimposes the feature maps of each resolution after up-sampling or down-sampling operations, similar to the information fusion part in the multi-resolution fusion input module, the aggregation approach in the FAS uses the information of the feature maps at different levels and then enhances the weight of the key information in the feature information after the channel attention. The output of each level of the FAS can be expressed by Eq (10), where S m denotes the output information of the m-th level of the skip connection part.

Dataset
We conducted experiments on three mainstream retinal vascular image datasets, DRIVE [3], CHASE_DB1 [29] and STARE [30]. Fig 3 shows the sample images from the three datasets, including the original retinal vascular medical images, the resultant maps manually segmented by the physicians, and the masked maps.
The DRIVE dataset consists of 40 retinal images containing 7 pathology images. Twenty were taken as training samples and the other 20 as test samples. The size of the images was 565×584 pixels and these images were taken by a Canon camera at 45 degree field of view (FOV) [3]. Each image was manually segmented by two experts, and the manual segmentation result of the first expert was selected as the label.
The CHASE_DB1 dataset consists of 28 retinal images of 14 affected children. Twenty images were selected for training and another 8 images for testing. The size of the images was 960×999 pixels and these images were taken by the Nidek camera at a 30 degree field of view (FOV) [29]. Again, the segmentation result of the first of the two experts was chosen to be used as a label. Unlike the DRIVE images, the sample images in the CHASE_DB1 dataset have less visible blood vessels in the images due to uneven illumination.
The STARE dataset consists of 20 retinal images, including 10 pathological images. Due to the small number of samples in the data set, we use the leave-one-out method [31] for training and testing. The size of each picture is 700×605 pixels, and these pictures were taken by a Top-Con camera in a 35 degree field of view (FOV) [30]. Similarly, the segmentation result of the first expert is selected as the label from the manual segmentation results of the two experts.

Pre-processing
In this paper, we adopted the preprocessing method proposed by Jiang et al. [26] for retinal images, and performed channel separation, grayscale processing, normalization processing, CLAHE processing, and gamma nonlinearization processing on the images. Fig 4 shows the results of the retinal images after the above mentioned treatments were applied to each of the retinal images, and it is clear that after the above preprocessing operations the contrast between the blood vessels and other parts of the retinal images is more obvious and the vascular parts become clearer, thus allowing our network to learn the data distribution of the images better when performing training [26].
It is necessary to perform channel separation before grayscale processing. Due to the individualization of fundus vascular images, the image noise of G-channel after channel separation is less and the vessels are clearer, which is more conducive to image segmentation. In Fig 4(a)-4(d) show the original image and the separated single-channel grayscale images of red, green and blue, respectively. It has been experimentally demonstrated [26] that the blood vessels are

PLOS ONE
clearest in the grayscale images after the fusion of the red, green and blue channels according to the ratio of 29.9%, 58.7% and 11.4%. The results are shown in Fig 4(e). Fig 4(f) shows the image after grayscale processing of the graph, and there is a sharper contrast between the image area and the background. To further enhance the contrast, the contrast-constrained adaptive histogram equalization (CLAHE) method is used in the preprocessing to enhance the contrast between the effective regions and the background of the whole data set, as shown in Fig 4(g). Finally, gamma correction is used to further improve the image qual-ity, and the gamma value is set to 1.2, and the processing results are shown in Fig 4(h).

Evaluation indicators
To quantitatively evaluate the results of the proposed MFI-Net network in the fundus vascular segmentation task, we used several commonly used evaluation metrics to assess the overall performance of our method, including Sensitivity, Specificity, Accuracy, and F1. In the defined Eqs (11) to (16), TP denotes the number of pixels that label the vascular pixels correctly, TN denotes the number of pixels that label the background correctly, FP is the number of pixels that fail to label the vascular pixels correctly, and FN is the number of pixels that fail to label the background pixels correctly.
Sensitivity is defined by Eq (11), which indicates the percentage of pixels of correctly segmented vessels in the segmentation result map in the image. The sensitivity ultimately reflects the proportion of missed unsegmented blood vessel pixels in the segmentation result, and the closer the sensitivity is to 1.0, the less the missed unsegmented blood vessel fraction is and the better the segmentation effect is.
The definition of Specificity is shown in formula (12), which represents the proportion of mis-segmented blood vessel pixels in the image of the segmentation result image. Specificity ultimately reflects the size of the proportion of mis-segmented pixels. High specificity (close to 1.0) indicates that the fewer pixels that are incorrectly segmented in the segmentation result, the better the segmentation effect.
Accuracy (ACC) is defined as shown in Eq (13), which indicates the percentage of correctly segmented pixels in the segmentation result map (including vascular pixels and background pixels) in the whole segmentation map, reflecting the overall segmentation accuracy. However, since the pixel points in the black area of the segmentation result map account for the majority of the whole map, this index can reflect the segmentation result to a certain extent, but it cannot accurately evaluate the performance of the segmentation method.
F1 is a very common measure in binary classification models, which takes into account both Precision and Recall of classification models and is the summed average of Precision and Recall. Eqs (14) to (16) are the definitions of Precision, Recall and F1, respectively. When the F1 is higher (close to 1.0), it indicates better segmentation. Recall

Training
The CPU model of the machine used in the experiment is Intel(R) Xeon(R) Gold 5218 CPU@2.30GHz, the memory size at runtime is 187G, the GPU memory size is 24G, and the model is Quadro RTX 6000. The operating system used in the experiment is Linux, and the programming language used to build the network model is Python 3.7, the main library packages used are Pytorch 1.4, OpenCV 4.1.2, Numpy 1.18.1, and so on. All experiments are done in this experimental environment. The setting of hyperparameters has a great influence on the performance and results of the experiment. The setting of hyperparameters in the experiment will be introduced below. For the DRIVE dataset with 40 images, 20 samples containing lesion images are selected as the training set, and the other 20 samples containing lesion images are reserved as the test set. Similar to the DRIVE data set, 20 images in the CHASE_DB1 dataset are used as the training set, and the other 8 images are used as the test set. Since there are only 20 images in the STARE dataset, in order to make the training effect as good as possible, the leave-one-out method was used for training in the experiment.
Dividing the dataset image into small patches is conducive to data expansion, can effectively reduce the risk of overfitting, and improve the performance of the model, so the dataset is divided into patches. Experimental verification shows that when the image is divided into patches of 48×48 pixels, the performance of the model can reach a better level. Both the training set and the test set are Divided. Due to the different sizes of images in different datasets, black pixels need to be used to complement the image to ensure that the image is accurately cut into integer patches.
For the DRIVE and CHASE_DB1 datasets, we set the training batch size to 128, the patch size to 48 pixels, the overlap sampling step to 5, the number of dynamically extracted patches to 10,000, and the threshold to 0.47. Since there are only 20 images in the STARE dataset, the leave-one-out method is used in the experiments in order to make the training effect as good as possible. The training batch size was set to 512, the patch size was set to 48 pixels, the overlap sampling step was set to 5, the number of dynamically extracted patches was set to 38,000, and the threshold was set to 0.52. The experiments on each dataset were conducted for 200 cycles, and the global random number of Pytorch was set to 1234. The default initialization method of the convolutional layer in Pytorch is used to assign the weight and bias in the network. The learning rate used in training was 0.001, and the Adam function was used as the optimizer of the network, where β1 = 0.9, β2 = 0.999, and ε = e-8 for Adam. In addition to using geometric transformation for data enhancement in the experiment, we also used the random local replacement algorithm used by Jiang et al. [26] in their work. To avoid overfitting while speeding up the training, we also used learning rate decay in our experiments [32,33].

Structural ablation.
The MFI-Net proposed in this paper takes U-Net as the backbone network structure and creatively designs a new multi-scale and multi-resolution input structure to ensure that the coarse-grained feature information can be delivered and used to the maximum extent while improving the skip-connection part of the network to more fully fuse semantic information. We designed ablation experiments to verify the enhancement of MFI-Net by adding the MR module and the FAS module respectively to the U-Net network. In order to prove that the MR module has a boosting effect on the network, we also conducted experiments on the U-Net network with the addition of the traditional multi-scale input module MI (Multi-input Block). Table 1 shows the experimental results of the structural ablation of each module, the bolded ones in the table is the best result. From the experimental results on the three datasets, the U-Net network with the addition of the MR module shows a significant improvement in all metrics compared to the U-Net network with the addition of the MI module, with an improvement of 1.04% in the F1 on the DRIVE dataset, and a significant improvement in the F1 on both the STARE and CHASE_DB1 datasets. Analysis of the experimental results on the DRIVE, CHASE_DB1 and STARE datasets showed that the FAS module had a huge improvement on the U-Net network, with F1 increasing by 1.60%, 2.42% and 1.54% on the three datasets, respectively.
By comparing the enhancement effects of the MI and MR on the U-Net network in Table 1, it can be seen that the MR module has a greater improvement in the segmentation ability of the underlying model. MFI-Net (using the MR module and FAS module jointly) is at a higher level of sensitivity and specificity metrics compared to other ablation structures, with the best F1 and accuracy rates. The MFI-Net showed 2.42%, 2.46% and 1.81% improvement in F1 on the DRIVE, CHASE_DB1 and STARE datasets, respectively, and 0.45%, 0.22% and 0.23% improvement in ACC, respectively, compared with the benchmark network U-Net, further demonstrating the effectiveness of our proposed modules for network enhancement. The results show that the ROC values and PR values obtained from the U-Net network using both FAS and MR modules are the largest on both DRIVE dataset and STARE dataset and CHASE_DB1 dataset, indicating that this model obtains the best performance to better convey shallow feature information while extracting more deep feature information and better segmenting the blood vessels. To demonstrate that the results obtained from our ablation experiments are statistically significant, we did a P-value analysis of the F1 metric in the segmentation results of each network model in the ablation experiments. We proposed the hypothesis that the former of the two given models has a better F1 than the latter on the given dataset. The results of the P-value analysis for this hypothesis are shown in Table 2.
In Table 2, the probability of occurrence of the sample observations or more extreme results obtained when the original hypothesis is true for each ablation structure after adding different

PLOS ONE
modules is less than the significance level 0.05. According to statistical principles, we are justified to reject the original hypothesis and believe that the F1 of the former is better than the latter for the network structures under comparison, i.e., the enhancement of the network by adding MI, FAS, and MR modules to U-Net is effective, and the results of using FAS module and MR module at the same time are optimal. The visualized images can show our segmentation results more intuitively, so in this paper, we visualize and compare the segmentation results of each ablation structure on each of the three datasets. The segmentation results of U-Net, U-Net with MI module added, U-Net with FAS module added, U-Net with MR module added, and U-Net with both FAS and MR modules are shown in Fig 6, respectively. For the segmentation task of fundus vessels, arterioles (the thicker vessels in the figure) are easier to segment, while the segmentation of capillaries is often not accurate enough. Therefore, we focused on the capillaries in the segmentation results for comparison. The first column of the figure shows the original retinal medical image, the second column shows the manually segmented labeled map, and the third to seventh columns show the segmentation results of U-Net, U-Net with MI module added, U-Net with FAS module added, U-Net with MR module added, and U-Net with both FAS module and MR module added (i.e., MFI-Net), respectively. From top to bottom are the medical image maps or segmentation result maps on the DRIVE, CHASE_DB1 and STARE datasets, respectively. For easy comparison, we have enlarged some areas in the retinal images, labeled images and segmentation result maps, and the focused observation areas are marked using red boxes.
As can be seen in Fig 6, for the thicker arterial vessels, the above network models are able to segment them clearly, but at the capillaries, there are certain differences in the segmentation results. Compared with the backbone network U-Net, adding the common multi-input module MI does improve the segmentation results of the network to a certain extent but compared with the segmentation results obtained from the network with the MR module, there are still many capillaries that are not segmented, which proves that the multiple resolution input images provided by MR can provide more feature information to the network, enhance the accuracy of the boundary, and improve the accuracy of capillary segmentation It has positive significance for the improvement of segmentation accuracy. Comparing the segmentation results of the base network U-Net and the U-Net using the FAS module, it is obvious that the

PLOS ONE
latter has higher segmentation accuracy at fine details and less noise in the segmentation results, which proves that the information fusion of the FAS module is effective. Overall, during the process of structural ablation, the accuracy of the segmentation results gradually improved, the noise was significantly reduced, and both the boundary part and capillary part of the vessels were segmented more precisely and clearly, and the contrast of the areas marked in the figure was especially obvious. The above facts show that the network structure proposed in this paper is feasible and effective in a real segmentation task, and the improved network model can obtain better segmentation results.

Model comparison.
The experiment is also designed to compare the segmentation results of MFI-Net and the current popular fundus blood vessel segmentation network models such as UNet++, M2-UNet and AA-UNet on the three data sets of DRIVE, CHASE_DB1 and STARE. Tables 3 to 5 are the segmentation results of different model methods on three data sets. We have summarized the four evaluation indicators of F1, accuracy, sensitivity, and specificity. It can be seen from the table that compared with other model methods, MFI-Net has relatively good performance in various indicators. On the DRIVE dataset, our method can increase F1 by 0.16% to 1.47%, with a high level of accuracy, sensitivity and specificity. On the CHASE_DB1 dataset, our method can increase F1 by 0.11% to 2.67%, the accuracy rate can be increased by 1.54%, and the sensitivity can be increased by 5.53%. The F1 and accuracy of our method on the STARE data set have been improved, the F1 has increased by at least 0.08%, and the accuracy has increased by at least 0.13%. In order to compare and evaluate the MFI-Net performance more visually, we implemented the UNet++ network model and AA-UNet network model, which are still very popular, and applied them to the segmentation task of retinal vessels. Fig 7 shows the visualization of UNet++, AA-UNet and MFI-Net segmentation results.
On the DRIVE dataset and CHASE_DB1 dataset, some blood vessels are not segmented in the segmentation result map of UNet++, and there is obvious noise present in the segmented image of AA-UNet, and the segmentation is relatively blurred at the capillaries. Since MFI-Net has fully fused the feature information between the encoder and decoder, and utilized multiresolution input to reduce the loss of coarse-grained information at the lower layers of the network when extracting features, it can extract more comprehensive features and segment more complete blood vessels. The low sensitivity of UNet++ and the relatively high sensitivity and specificity of MFI-Net are also shown correspondingly in Table 3, the bold ones of the table is the best result. UNet++ uses a large number of convolution and upsampling operations in the skip connection part, which greatly deepens the perceptual field of the network and enables the network to extract more feature information so that relatively clearer segmentation results can be obtained in the special region of the lesion, and this The segmented images of AA-UNet appear to be inaccurately segmented and noisy. In contrast, our method uses fewer computational resources compared to UNet++ but achieves better segmentation results.
Combining Tables 3 to 5 and Fig 7, we can conclude that the MFI-Net model performs well in all evaluation indexes. while focusing on the extraction of shallow features, MFI-Net focuses on amplifying the influence of important information, which not only effectively alleviates the loss of information of shallow features, but also focuses on the key features. And we use a more reasonable information fusion method to pass the useful information from the encoder to the decoder layers and minimize the noise in the segmentation results. Through the analysis of the segmentation results, our proposed method has better performance and is more advantageous in practical segmentation applications.

Model parameter quantity and computation time analysis.
To evaluate the spatial and temporal spending of our network models, we calculated the number of parameters for each ablation network structure as well as UNet++ and AA-UNet and recorded the training time of the above network models on different datasets and the segmentation time of the models on a single image in the same experimental setting. The details of the number of parameters and time overhead for each model are shown in Table 6. According to the data in Table 6, we can see that UNet is at a low level of time overhead and number of parameters due to its simple structure. On the contrary, UNet++, AA-UNet and MFI-Net use more modules and a more dense structure, which makes the number of network parameters relatively large and the time overhead is also higher. Combining Tables 3 to 5, we conclude that although UNet++ has relatively higher segmentation results compared to other network models, the large number of parameters and more time overhead often make it difficult to accept. The number of parameters and time spent are smaller and the results are better than those of UNet++ and AA-UNet.

Robustness testing
4.6.1 Testing on a single image. Fig 8 shows the F1 performance evaluation curves of MFI-Net and U-Net, UNet++ and AA-UNet for each image in the three datasets to further observe the performance of the network in the task of a single image. As can be seen from the figure, on the three datasets, the method in this paper is able to extract feature information stably, and perform to maintain a relatively stable segmentation effect with less fluctuation on both healthy and diseased retinal images, and the generalization of the model is strong. 4.6.2 Crossover dataset testing. Retinal images are acquired in different ways, and the three datasets used in the experiments used their own different acquisition methods. Therefore, the retinal images used in clinical practice often have differences in illumination, noise, etc. To verify the generalization ability of MFI-Net on different datasets, we designed a crosstest experiment between multiple datasets. The so-called cross-testing is to train the model on one dataset and then test it on another dataset. Table 7 shows the results of the cross-test,  where the horizontal table header is the dataset used for training and the vertical table header indicates the dataset used for testing. To visualize the test results, we visualized the segmentation results as shown in Fig 9. The F1 and accuracy of the models trained on the DRIVE dataset decreased by 5.52% and 0.4% when tested with the STARE dataset, and by 42.52% and 2.46% when tested with CHAS-E_DB1. The F1 and accuracy of the model trained on the STARE dataset decreased by 5.30% and 0.96% when tested with the DRIVE dataset, and by 24.11% and 2.13% when tested with CHASE_DB1. The F1 and accuracy of the models trained on the CHASE_DB1 dataset decreased by 15.33% and 2.33% when tested with the DRIVE dataset, and by 2.98% and 0.68% when tested with STARE.
The CHASE_DB1 dataset contains more feature information because the sample images are affected by illumination, and the variability between samples is larger, and STARE can extract a lot of useful feature information because the overall variability between sample images is relatively small, but the lesion areas are significantly different. The DRIVE dataset, on the other hand, can extract fewer general features because of the smaller image size on the one hand and the smaller differences between individual samples on the other. Therefore, the  To evaluate the test results, we compared with the work done by others in this area, but since most of the work was cross-tested only on STARE and DRIVE, only these two datasets were compared, and the results are shown in Table 8, the bold part of the table is the best result. Since MFI-Net focuses on the extraction of coarse-grained feature information at a shallow level, the results obtained by our method trained on the STARE dataset, where the original feature information is more adequate and tested on the DRIVE dataset are very impressive, with a 0.88% improvement in the F1 over the method of Jiang et al. [26]. In contrast, the M3FCN [26] network model proposed by Jiang et al. is a multi-path network model that can extract more fine-grained feature information, and thus achieves good results on the DRIVE dataset, which has relatively little shallow information, with a 1.10% improvement in F1 over our method. Combining all evaluation metrics as a whole, our method still has strong competitive power compared with most other methods. The cross-test results show that our method  still has an excellent performance in the face of retinal images acquired by different image acquisition devices.

Shortcomings of current work and future research directions.
Our work is dedicated to improving the segmentation accuracy of segmentation networks of retinal vessels while discussing the performance of the network model from several aspects. However, we still have the following shortcomings: (1) In order to improve the segmentation accuracy of the network, we use more time and storage as swaps, which makes our network model highly demanding on the hardware. (2) Also the datasets used in our work are limited to three public datasets, DRIVE, CHASE_DB1 and STARE, and we did not experiment on more and larger datasets, so we cannot present more convincing evidence to prove the superior performance of our network model. (3) Although we conducted experiments on different data and cross-tests and the results outperformed many other current methods, the segmentation results of the training model obtained using datasets with less noise and focal regions were not as good as expected on the strongly noisy datasets.
Based on these shortcomings, we have developed directions and priorities for continued research in the future. First, we will continue to optimize our deep learning approach by optimizing the network structure, adding blocks to the network segmentation speed, and optimizing the generalization ability of the network model to make it possible for production and clinical applications. Second, there are often differences in the retinal vascular images used in clinical diagnosis due to different acquisition devices, and using new pre-processing methods to resolve these differences caused by illumination, etc., and building a network model with stronger generalization ability will be of great significance in clinical applications.

Conclusion
The MFI-Net retinal vessel segmentation network model proposed in this paper enhances the performance of the network model by enhancing the fusion of semantic information. The multi-resolution input module enhances the extraction and transmission of feature information in the shallow layer, which alleviates the problems of blurred boundaries of segmentation results and inaccurate capillary segmentation. The redesigned skip connection part makes the information transfer between the deep and shallow layers of the network more smooth, and richer feature information is fully used in the information fusion process of each layer, which greatly reduces the noise in the segmented images. We tested the MFI-Net network model proposed in this paper on DRIVE, CHASE_DB1 and STARE datasets, and the accuracy reached 97.05%, 97.62% and 97.66%, respectively, and the F1 reached 83.18%, 81.56% and 84.83%, respectively. Experiments were also designed to analyze and demonstrate that MFI-Net has better stability and generalization ability. By analyzing and comparing the segmentation results and discussing the robustness of the model, the MFI-Net network model proposed in this paper has more advantages compared with other methods.

Author Contributions
Methodology: Chao Wu, Ge Wang.