Table 1.
Components of the DBNet and DPNet models.
Fig 1.
The overall architecture of the proposed DPNet.
Firstly, the feature encoder serves as the backbone network of the model, comprising ResNet and the proposed CESAM and SESAM to increase the receptive field. Secondly, the output of the backbone network is processed through the feature decoder to extract more detailed feature information. Finally, the output of the feature decoder is processed through the DB module to produce the final detection result.
Table 2.
Changes of channel dimensions of DBNet and DPNet.
Fig 2.
The structural design of the dual perspective transformer high-level semantic feature extraction module.
On the left is the design of the Channel Self-Attention Module (CESAM), and on the right is the design of the Spatial Enhanced Self-Attention Module (SESAM). This paper embeds CESAM and SESAM into the original backbone network ResNet to enhance the network’s feature extraction capabilities based on dual perspective.
Fig 3.
The structural design of the feature decoder, exemplified here by C4 and C5.
Fig 4.
Traditional path (green flow) and our path (red flow).
The green arrows represent the standard binarization process. The red arrows represent differentiable binarization, the method adopted in this paper, which can adaptively predict the threshold for each position in the image.
Fig 5.
Visualization of the results of our method on different types of text instances, including multi-oriented, multi-lingual, and curved text.
The second to fourth rows correspond to the probability map, the threshold map, and the binarization map for each text instance in the images, respectively.
Table 3.
Test results in MSRA-TD500 dataset.
Table 4.
Parameter settings and performance comparison on feature encoder.
Fig 6.
Comparison of visualization results for baseline and DPNet.
Under the combined effect of the proposed CESAM, SESAM, and feature decoder, DPNet achieves more precise text box positioning, effectively addressing the issues of missed and false detections, proving to be more effective compared to the Baseline.
Table 5.
Test results for the total-text dataset (values in parentheses refer to the height of the input image).
Table 6.
Test results for the ICDAR 2015 dataset (values in parentheses indicate the height of the input image).
Table 7.
Test results on MSRA-TD500 dataset (values in parentheses are heights of input images).
Fig 7.
Visualization results of text instances for our method DPNet and the baseline on different types of datasets.
The images are randomly selected from three datasets, which better demonstrate the robustness of our model.