DPNet: Scene text detection based on dual perspective CNN-transformer

doi:10.1371/journal.pone.0309286

Table 1.

Components of the DBNet and DPNet models.

More »

Expand

Fig 1.

The overall architecture of the proposed DPNet.

Firstly, the feature encoder serves as the backbone network of the model, comprising ResNet and the proposed CESAM and SESAM to increase the receptive field. Secondly, the output of the backbone network is processed through the feature decoder to extract more detailed feature information. Finally, the output of the feature decoder is processed through the DB module to produce the final detection result.

More »

Expand

Table 2.

Changes of channel dimensions of DBNet and DPNet.

More »

Expand

Fig 2.

The structural design of the dual perspective transformer high-level semantic feature extraction module.

On the left is the design of the Channel Self-Attention Module (CESAM), and on the right is the design of the Spatial Enhanced Self-Attention Module (SESAM). This paper embeds CESAM and SESAM into the original backbone network ResNet to enhance the network’s feature extraction capabilities based on dual perspective.

More »

Expand

Fig 3.

The structural design of the feature decoder, exemplified here by C₄ and C₅.

More »

Expand

Fig 4.

Traditional path (green flow) and our path (red flow).

The green arrows represent the standard binarization process. The red arrows represent differentiable binarization, the method adopted in this paper, which can adaptively predict the threshold for each position in the image.

More »

Expand

Fig 5.

Visualization of the results of our method on different types of text instances, including multi-oriented, multi-lingual, and curved text.

The second to fourth rows correspond to the probability map, the threshold map, and the binarization map for each text instance in the images, respectively.

More »

Expand

Table 3.

Test results in MSRA-TD500 dataset.

More »

Expand

Table 4.

Parameter settings and performance comparison on feature encoder.

More »

Expand

Fig 6.

Comparison of visualization results for baseline and DPNet.

Under the combined effect of the proposed CESAM, SESAM, and feature decoder, DPNet achieves more precise text box positioning, effectively addressing the issues of missed and false detections, proving to be more effective compared to the Baseline.

More »

Expand

Table 5.

Test results for the total-text dataset (values in parentheses refer to the height of the input image).

More »

Expand

Table 6.

Test results for the ICDAR 2015 dataset (values in parentheses indicate the height of the input image).

More »

Expand

Table 7.

Test results on MSRA-TD500 dataset (values in parentheses are heights of input images).

More »

Expand

Fig 7.

Visualization results of text instances for our method DPNet and the baseline on different types of datasets.

The images are randomly selected from three datasets, which better demonstrate the robustness of our model.

More »

Expand