HDC-Net: A hierarchical dilation convolutional network for retinal vessel segmentation

The cardinal symptoms of some ophthalmic diseases observed through exceptional retinal blood vessels, such as retinal vein occlusion, diabetic retinopathy, etc. The advanced deep learning models used to obtain morphological and structural information of blood vessels automatically are conducive to the early treatment and initiative prevention of ophthalmic diseases. In our work, we propose a hierarchical dilation convolutional network (HDC-Net) to extract retinal vessels in a pixel-to-pixel manner. It utilizes the hierarchical dilation convolution (HDC) module to capture the fragile retinal blood vessels usually neglected by other methods. An improved residual dual efficient channel attention (RDECA) module can infer more delicate channel information to reinforce the discriminative capability of the model. The structured Dropblock can help our HDC-Net model to solve the network overfitting effectively. From a holistic perspective, the segmentation results obtained by HDC-Net are superior to other deep learning methods on three acknowledged datasets (DRIVE, CHASE-DB1, STARE), the sensitivity, specificity, accuracy, f1-score and AUC score are {0.8252, 0.9829, 0.9692, 0.8239, 0.9871}, {0.8227, 0.9853, 0.9745, 0.8113, 0.9884}, and {0.8369, 0.9866, 0.9751, 0.8385, 0.9913}, respectively. It surpasses most other advanced retinal vessel segmentation models. Qualitative and quantitative analysis demonstrates that HDC-Net can fulfill the task of retinal vessel segmentation efficiently and accurately.


Introduction
The study found that the number of patients with retinopathy increases with the advent of an aging population. There are many reasons for retinopathy, such as diabetes, nephritis, anemia, influenza, which may cause fundus diseases. The clinical symptoms of retinopathy are mainly manifest in changes in the length, width, curvature, and angle of the retinal blood vessels [1]. For instance, diabetic retinopathy [2] is associate with swelling of the blood vessels, and hypertensive retinopathy [3] is accompanied by increased retinal vessel curvature and narrowing of blood vessels. Although retinopathy can be observed in many ways, the most critical characteristic is the variation of retinal blood vessels.
To enable sufferers to receive reasonable treatment, ophthalmologists usually diagnose related diseases by observing the morphological features of the abnormal blood vessels. Therefore, to observe exceptional blood vessels more intuitively, it is most crucial to analyze blood have highly praised the supervised methods. Supervised learning requires manually labeling the data to establish an optimal predictive model. Researchers input the processed image into an excellent prediction model to obtain the corresponding probability prediction map. Fundus image datasets are susceptible to quality degradation due to noise and illumination during acquisition, so dataset pre-processing is a key step in image analysis. Datasets are augmented in various ways, such as random rotation, random flipping, color Jittering [20] and a host of other ways to increase the number of images. As the target vessels and background are not easily distinguishable in fundus images, it is common to use contrast limited adaptive histogram equalization (CLAHE) to improve image contrast. In addition, some scholars have continued to innovate on this basis; for example, Li et al. proposed to combine CLAHE with the discrete wavelet transform [21] to preserve good image detail and suppress noise, Khursheed Aurangzeb et al. proposed to tune the CLAHE parameters using particle swarm optimization algorithm [22] to improve the contrast of the images of green channel.
U-Net has an important position within the field of medical imaging analysis. As shown in Fig (1), the leading architecture of U-Net is mainly composed of a convolutional coding unit and decoding unit. The basic convolution operation is performed, followed by ReLU activation in the encoding and decoding unit. The 2×2 max-pooling operation is used for down-sampling in the encoding unit. The transposed convolution operation is used to perform up-sampling in the decoding unit. The original U-Net utilizes cropping and copying feature maps to fuse coding unit information. U-Net has the following advantages: First, the U-Net embraces an extraordinary encoding and decoding unit, which can simultaneously get overall locations and context. Since most medical imaging is representative small sample datasets, U-Net can work with fewer training samples and achieve superior performance.
At present, many excellent medical image segmentation models are based on improvements made by U-Net. For instance, Tarek M et al. proposed R2U-Net [23], which improves U-Net   [25], which adopts structured Dropblock instead of Dropout in the conventional convolutional layer to prevent overfitting. Although it can overcome overfitting, it does not adequately detect blood vessels when segmenting tiny blood vessels in fundus images. Wang et al. proposed DEU-Net [26], which significantly heightens the network's performance by pixel-level prediction. It tends to ignore the tiny blood vessels during training. Guo et al. proposed spatial attention U-Net (SA-UNet) [27], which applies a spatial attention mechanism to concentrate on more valuable pixels and suppress background pixels to heighten the expressive capacity of the model, the segmentation effect of this network at the intersection of thick and thin blood vessels is not good.
To enhance the algorithm's performance, the researchers mainly focused on the three elements of the network: depth, width, and cardinality. Except for these factors, "attention" has a powerful effect on the network's performance. Woo, et al. proposed the convolutional block attention module (CBAM) [28], which connects different attention modules in series to learn what to emphasize or suppress. The CBAM module performed well in classification tasks. Fu et al. proposed the dual attention network (DANet) [29] to integrate local and global features adaptively to overcome the difficulty of capturing context information in computer vision tasks. Although the above attention mechanism models enhance the network's performance, it makes the network model more complex and accompanied by increased parameters. Wang et al. proposed efficient channel attention (ECA-Net) [30] to achieve the trade-off between performance and complexity models. It reaches the local cross-channel information exchange without dimensionality reduction, which diminishes the complexity of the model whereas keeping up performance. The fundus image will be affected by uneven illumination and other factors during imaging, and the discontinuous characteristics of some small blood vessels, which will cause the blood vessel pixels not to be sufficiently detected by the model. Therefore, we proposed a structure containing attention mechanisms and a U-shaped structure, which can better locate and extract the tiny blood vessels in the fundus image.

Methodology
This paper is devoted to proposing a valuable deep learning model to obtain a clear fundus blood vessel structure. Each pixel of fundus images is classified as a vessel (1) or background (0) pixel by the vessel segmentation model. Existing retinal vessel segmentation models are representative binary classification models.
This section describes the structure of the HDC-Net for medical imaging analysis in detail. The HDC-Net architecture diagram is shown in Fig (2). We adopt SD-Net as the backbone network. SD-UNet can better overcome the problems caused by fewer samples in the training set. In the HDC-Net model, basic convolution operations are carried out in the encoding and decoding units, followed by the HDC module, to detect multi-scale vascular information in fundus images adequately. The operation flow of each layer in the encoding and the decoding unit are shown in Fig (3). Skip connection with the RDECA module can realize local crosschannel information exchange to improve the network's ability to segment blood vessels.

The Dropblock of regularization method
As we all know, marking the retinal blood vessels is laborious work, and the quantity of images is insufficient in most of the existing fundus datasets. Although the datasets have been augmented before inputting to the network, the network will still be overfitting during the training process. As shown in Fig (4) (left), When the training time reaches 80 epochs, the accuracy of the training set improves significantly while the validation set improves very slowly, it is an overfitting phenomenon. Dropblock is a structured form of Dropout that successfully avoids overfitting issues in our network. The distinction between Dropout and Dropblock is that Dropout randomly discards a single pixel, while Dropblock randomly discards a small pixel patch in the feature map. In addition, batch normalization (BN) and ReLU can significantly reduce the time required for network convergence in the basic convolution unit with Dropblock. The Dropblock module can perfectly solve overfitting in the HDC-Net. As shown in Fig (4) (right), The difference in accuracy between the training and validation sets is relatively stable over the overall training process.

HDC module
Recent medical studies have shown the importance of high-quality segmentation of vascular structures for the early treatment of ophthalmic diseases. However, fundus images have many fragile vessels that are difficult to visualize with the naked eye and often overlooked by researchers. This section introduces the HDC module that allows for adequate detection and segmentation of retinal vessels.
The HDC module is a hierarchical structure, and it divides the input feature map into two parts along the channel axis. The feature conversion process takes place in these two parallel branches [31]. The feature maps generated by the two parallel branches are concatenated into a new feature map along the channel axis. In this case, each filter is responsible for a particular function in the HDC module. From the HDC module diagram in Fig (5), we can see that the

PLOS ONE
channel number and resolution of the feature maps are unchanged between out and input features so that the HDC module can be used as a general module for fundus image segmentation tasks.
The input feature map (F) is divided evenly along the channel axis into two parts, denoted by X1 and X2, respectively. To effectively collect context information of each spatial position within the image, the convolution feature transformation is carried out in two spaces of different scales. The different receptive fields of the kernel can detect different scale information, and it can realize the comprehensive detection of blood vessels by the fusion of multi-scale structures. Dilated convolutions [32] with dilation rates of 1 and 2 were used to extract edge structure information of the retinal vessels, and Y1 and Y2 respectively represented the transformed feature maps. Dilated convolution changes the receptive field of the kernel to extract structural more fully and edge information of the vessels. The Y1 and Y2 concatenated along the channel axis to form a new feature map (Y3), and then the SAM was utilized for adaptive feature refinement. It is an approach that can detect neglected fragile blood vessels.
In the SAM module, average-pooling can aggregate spatial information, while max-pooling can highlight different object features in an image. SAM models that contain two different

PLOS ONE
pooling methods can infer more refined information, enhancing the network's multi-scale perception capabilities and optimally capturing global key details. The operation flow of SAM is shown in Fig (6). The output F S of the SAM module can be express as: Where f 7×7 means a convolution operation with a kernel size of 7, σ(�) represents the Sigmoid functions, and cat[�] presents the concatenate operation. In addition, the residual connection (RC) between the input and the output feature maps are utilized to prevent the overfitting and compensate for the loss of characteristic information to feature transformation.

RDECA module
According to recent studies, it is common to apply attention mechanisms to deep learning models to heighten performance. However, most basic strategies are devoted to creating more complex attention modules to obtain superior performance, which unavoidably increases the difficulty of realization. Wang et al. proposed an ECA-Net, which adopts a 1D convolution operation to realize the information exchange between adjacent channels, significantly reducing the model's parameters while keeping up with good performance. The ECA-Net only utilizes average-pooling to aggregate spatial information in feature maps, but max-pooling can gather more prominent information. The RDECA module utilizes the max-pooling and average-pooling simultaneously to gather more abundant feature information, so it achieves accurate segmentation to some extent. The complete structure of the RDECA module is shown in Fig (7). The RDECA

PLOS ONE
module utilizes different forms of pooling operations to generate different attention descriptors. The different channel attention descriptors are concatenated along the channel axis to retain more practical information than the element-wise summation. The 2D convolution with a kernel size of 1 is adopted to reduce the channels, followed by ReLU to activate the module. The 1D convolution is utilized to realize local cross-channel information exchange without the dimensionality reduction, and then the sigmoid function is adopted to generate the final channel attention descriptor.
Last but not least, the RC [33] is applied between the input and the final output to effectively prevent the overfitting caused by the network being too complex, and it also plays a role in supplementing information. In our experiment, the kernel size of 1D convolution is 3. In addition, Fig (8) shows the structure when the RDECA module is applied to SD-UNet only.

The datasets
Although deep learning networks can effectively capture feature information from data that has not been pre-processed, they tend to perform better on pre-processed images. In addition, DRIVE [34], CHASE-DB1 [35] and STARE [36] are typical small sample datasets, so it is essential to pre-process the data before training. The DRIVE consists of 40 color images, The DRIVE dataset consists of 40 fundus images with a resolution of 584×565, of which training and test images each account for half. As the image's resolution does not match the network, we change the image's resolution by padding 0 pixels around the image. The resolutions of the DRIVE, CHASE-DB1, and STARE datasets were 565×584, 999×960, and 700×605. We adjusted the resolution of the images in the DRIVE, CHASE-DB1, and STARE datasets to 592×592, 1008×1008, and 704×704, respectively. The image resolutions were adjusted to be consistent with the original images in the three datasets during the evaluation process. In addition, we utilized four data augmentation methods: (1) random angle (0-360 degrees) rotation; (2) adding Gaussian noise; (3) adjust the hue, contrast, and brightness; (4) horizontal, vertical and diagonal flips; The images after each pre-processing step shown in Fig (9). In addition, the resolution of the image is too large for the network to train. Each image is cropped into four images with a resolution of 512×512 on the CHASE-DB1 dataset. The images after the cropping step are shown in Fig (10).

The metrics
The output result of the HDC-Net is a probability prediction map, which describes the possibility of pixels as blood vessels. On the paper, the threshold is set as 0.5. If the predicted value of pixels in the probability map is greater than the threshold, it is considered a blood vessel pixel; otherwise, it is considered a background pixel. The probability maps compared with the corresponding ground truths, each element of the output image classified as True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). Sensitivity (SE) measures the proportion to which 1 pixel is predicted as blood vessels in the probability map. Specificity (SP) measures the proportion to which 0 pixels are predicted as background in the probability map. Accuracy (ACC) measures the proportion to which pixels are correctly predicted in the probability map. In addition, we also calculated the f1-score(F1) because it can better measure precision and recall at the same time. Recall We also utilized the area under the curve (AUC) to evaluate our model to evaluate the network's performance further. AUC is usually used to measure the performance of a binary classification model. If the AUC value is closer to 1, it means that the model's performance is better.

Implementation details
The HDC-Net model was evaluated on the DRIVE, CHASE-DB1, and STARE datasets, respectively. All models were trained from scratch on the training set and evaluated on the testing set. We use the Adam optimizer and a binary cross-entropy loss function to optimize our network. For the DRIVE dataset, we set the training epoch, learning rate, and batch size to 100, 0.008, and 2, respectively. For the CHASE-DB1 dataset, we set the training epoch, learning rate, and batch size to 50, 0.008, and 2, respectively. For the STARE dataset, we set the training epoch, learning rate, and batch size to 80, 0.008, and 2, respectively. In addition, for the Dropblock, we set the discard blocks and dropout rates to 7 and 0.15, respectively. The implementation is based on the public Pytorch, and all experiments run on Tesla V100-PCIE-16GB.

Ablation experiment
The SD-UNet was selected serves as our baseline. Tables 1-3 show the results of SD-UNet, SD-UNet + RDECA, SD-UNet + HDC, and HDC-Net on the three datasets (DRIVE, CHAS-E-DB1, STARE) respectively. In addition, to prove that the RC in the RDECA module plays an essential role in our model, we also included SD-UNet+RDECA(no RC) and HDC-Net(no RC) in the ablation experiment. The ablation experiments show that the RDECA module was applied to the baseline, and the SP and ACC have increased by 0.02%/0.14%/0.09%, 0.04%/ 0.06%/0.02% on the three datasets, respectively. When the HDC module is applied to the baseline, the ACC, F1, and AUC of the SD-UNet+HDC increased by 0/0.11%/0.07%, 0.16%/ 0.88%/0.21%, and 0.03%/0.15%/0.12% on the three datasets, respectively, which shows our proposed HDC module can extract more vascular information.
Furthermore, the ablation experiments show that RDECA modules with an RC structure perform better than those without an RC structure. Therefore, the RC structure is conducive to improve the performance of the model. The segmentation performance of HDC-Net that combines the advantages of these two modules is better than applying the RDECA module or HDC module to the baseline alone.

PLOS ONE
In Fig (11), we show the visualization image of the test example on the CHASE-DB1 dataset, including the segmentation results obtained by U-Net, SD-UNet, SD-UNet+RDECA, SD-U-Net+HDC, SA-UNet, HDC-Net, and the corresponding ground truth. We know that the segmentation results obtained by SD-UNet are not accurate enough when segmenting small curved blood vessels from the visualization images. Although the segmentation results of SD-UNet + RDECA and SD-UNet + HDC are more accurate than SD-UNet, the edge structure of blood vessels is exceptionally rough and unsmooth. Compared with SD-UNet+RDECA and SD-UNet+HDC, the blood vessels segmented by SA-UNet perform better in terms of edge structure, but it performs poorly at the intersection between small and thick blood vessels. In

Comparative experiment
To assess the effectiveness of HDC-Net, we compared the segmentation results of HDC-Net with other models applied to medical image segmentation. As shown in Table [ 4], HDC-Net reached to 0.8258, 0.9829, 0.9692, 0.8239, and 0.9871 for SE, SP, ACC, F1, and AUC, respectively on the DRIVE datasets, it shows that HDC-Net has outperformed than most other retinal vessel segmentation methods. From Table [5], we can see that compared with other advanced methods, the HDC-Net achieved the highest SP, ACC, and AUC, which are 0.9853, 0.9745, and 0.9884, respectively on the CHASE-DB1 dataset. Although the SE and F1 are not superior to other methods, they are also comparable to other methods. Table [6] shows the results of HDC-Net compared to other state-of-the-art methods. HDC-Net has the highest ACC, and other metrics are better than most other existing methods on the STARE dataset. In general, HDC-Net performs better than other existing methods when performing retinal vessel segmentation tasks. In the segmentation diagram, the segmented vessels are not only more precise but also have better continuity. The experimental results show that the HDC-Net algorithm with multi-scale awareness and enhanced discrimination capabilities performs well in the retinal vessel segmentation task and can detect and extract vessels adequately

PLOS ONE
and accurately, which can be used for other retinal vessel segmentation tasks. In addition, we further compared the parameters of HDC-Net in relation to other models. As shown in Table [7], although HDC-Net does not have the fewest parameters, it has the best performance in retinal vessel segmentation, and it has significantly fewer parameters than R2-UNet.
Generalization ability is an important basis for evaluating deep learning models, and it is very important in real applications. We adopt a cross-training approach to assess the generalization ability of HDC-Net. In Table [ 8], we compare the generalization ability of two existing methods with HDC-Net, it uses the DRIVE dataset to train the model and then evaluates it on the STARE dataset, and vice versa. Table [8] shows that except SP in all indicators have

PLOS ONE
reached the highest for testing on STARE dataset, and reached the highest SP, ACC, AUC for testing on DRIVE dataset. In general, based on the data analysis, it can be known that the generalization ability of HDC-Net is the best.

Conclusion
High-quality fundus segmentation images are good for clinical diagnosis. We have developed a retinal vessel segmentation framework based on deep learning. The pre-processed retinal images were fed into the network for training, and then the trained model was further evaluated. In HDC-Net, the HDC module can detect vascular structure information of different scales, and the RDECA module in the skip connection part facilitates the information exchange between the encoding and decoding units. The proposed model we put forward was evaluated on three publicly available datasets (DRIVE, CHASE-DB1, STARE). The experimental results show that the performance achieved is comparable to or even better than that achieved by most of the existing state-of-the-art methods. Based on the analysis of ablation experiments on three different datasets (DRIVE, CHASE-DB1, STARE), the overall improvement in the performance of HDC-Net compared to baseline was significant. The ACC, F1, and AUC improved by {0.05%, 0.82%, 0.2%}, {0.41%, 0.74%, 0.08%}, {0.16%, 0.88%, 0.13%} respectively, and it demonstrated that the proposed HDC and RDECA module are helpful for retinal vessel segmentation. The proposed HDC-Net is effective and achievable. In addition, most retinal lesions remain some similar symptoms, such as microaneurysms, hemorrhages, exudates, and other abnormalities found in the retina, so the proposed HDC-Net we put forward can be used as a general network to perform other retinal vascular segmentation tasks competently.