SVTRv2X: Enhanced scene text recognition via self-distilled mixture-of-experts

Jian Guo; Hanxin Cui; Wengang Tang; Xuehai Zhou; Xing Xu; Qianqian Cheng

doi:10.1371/journal.pone.0349085

Abstract

Scene Text Recognition (STR) is a fundamental component of intelligent perception systems and plays a crucial role in a wide range of real-world applications such as autonomous driving, document understanding, and human–computer interaction. STR still faces several challenges in practical applications, including high sensitivity to spatial perturbations, limited representational capacity of lightweight Connectionist Temporal Classification(CTC)-based models, and the difficulty of handling diverse text styles within a single unified architecture. Although SVTRv2 enhances the recognition ability of CTC models through a combination of local and global mixing mechanisms, its robustness and generalization capability remain insufficient when dealing with geometric distortions, complex backgrounds, or text with large stylistic variations. To address these issues, we propose SVTRv2X, an enhanced STR framework built upon SVTRv2 that integrates three complementary improvement modules. The Jumble Module strategically rearranges input patches before the patch embedding stage, fundamentally reducing the model’s reliance on fixed spatial structures and significantly improving robustness to rotated, misaligned, and irregular text. The Self-Distillation Module transfers deep-layer knowledge to shallow features, effectively strengthening early-stage representations while maintaining lightweight inference. The Mixture-of-Experts (MoE) Module expands model capacity through sparsely activated expert networks, allowing specialized processing of different text styles without introducing substantial computational overhead. Extensive experiments demonstrate that SVTRv2X achieves state-of-the-art performance on multiple STR benchmarks, substantially advancing the model’s recognition capability in real-world scene text scenarios.

Citation: Guo J, Cui H, Tang W, Zhou X, Xu X, Cheng Q (2026) SVTRv2X: Enhanced scene text recognition via self-distilled mixture-of-experts. PLoS One 21(6): e0349085. https://doi.org/10.1371/journal.pone.0349085

Editor: Hikmat Ullah Khan, University of Sargodha, PAKISTAN

Received: January 20, 2026; Accepted: April 25, 2026; Published: June 1, 2026

Copyright: © 2026 Guo et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the manuscript and its Supporting information files.

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

1. Introduction

Scene text recognition, also known as optical character recognition (OCR), plays a crucial role in computer vision due to its wide application in various fields, drawing considerable attention from researchers. Scene text recognition has important applications across various domains. In automated document processing, it improves efficiency by extracting and digitizing text [1], and in assistive tools for visually impaired users, it enables independent access to written information by converting printed or handwritten text into speech or Braille [2]. Additionally, it plays a key role in intelligent transportation systems for license plate and traffic sign recognition [1], and contributes to autonomous driving, human-computer interaction, vehicle plate recognition, and translation services [2].

Deep learning has greatly advanced scene text recognition, significantly improving the accuracy of recognizing different fonts and languages. One category consists of CTC-based models (Fig 1(a)), which offer fast inference but limited robustness [3,4], while the other category comprises attention-based Encoder-Decoder models (Fig 1(b)), which achieve high accuracy at the cost of greater computational overhead. SVTR [5] (Fig 1(c)) introduces a three-stage backbone network that employs local and global mixing operations to capture stroke-level features and long-range dependencies, enabling efficient linear prediction. SVTRv2 [6] (Fig 1(d)) further enhances performance with Multi-Scale Resizing (MSR) and a Feature Rearrangement Module (FRM), strengthening CTC-based recognition while maintaining a simple structure and efficient inference.

Download:

Fig 1. (a) CNN-RNN based models.

(b) Encoder-Decoder models. MHSA and MHA denote multi-head self-attention and multi-head attention, respectively. (c) SVTR. (d) SVTRv2.

https://doi.org/10.1371/journal.pone.0349085.g001

Recent works [7–9] on scene understanding and non-Latin text recognition have achieved remarkable results. Despite the rapid development of scene text recognition (STR) models, several practical challenges remain unresolved. Current methods are highly sensitive to spatial perturbations, limiting their robustness when text appears in rotated, misaligned, or irregular layouts. Lightweight CTC-based models, while efficient, often suffer from insufficient representational capacity, making it difficult to capture complex character patterns. Moreover, handling diverse text styles within a single unified architecture remains challenging, as existing models struggle to generalize across geometric distortions, complex backgrounds, or large stylistic variations. Even advanced architectures like SVTRv2, which leverage local and global mixing mechanisms to enhance recognition, still exhibit noticeable limitations in robustness and generalization under real-world conditions.

To address these issues, We present SVTRv2X, an enhanced framework built upon SVTRv2 that integrates three synergistic, modules to boost model performance. First, we introduce a Jumble Module before the patch embedding stage to strategically augment the input data, thereby bolstering the model’s robustness against spatial perturbations. Second, we incorporate a Self-Distillation Module to facilitate knowledge transfer from deeper to shallower layers, which enriches the model’s feature representation capabilities. Finally, a Mixture of Experts (MoE) module is employed to enable specialized processing of textual features from diverse scenes, achieving a “divide and conquer” recognition strategy. Experimental results demonstrate that SVTRv2X significantly improves recognition accuracy on benchmark datasets compared to SVTRv2, all while maintaining its highly competitive inference speed.

Our contributions are summarized as follows:

We propose a Jumble Module placed before the patch embedding stage, which strategically rearranges local image patches during training. This encourages the model to learn position-invariant and robust features, significantly improving resilience to rotated, misaligned, and irregular text.
We integrate a Self-Distillation Module that transfers knowledge from deeper layers to shallower ones during training, enhancing early-stage feature representations while maintaining a lightweight architecture and efficient inference.
We employ a Mixture-of-Experts (MoE) Module, where a sparsely activated set of expert subnetworks dynamically processes each input. This enables specialized modeling of diverse text styles without introducing substantial computational overhead.
Extensive experiments on multiple open-source STR benchmarks demonstrate that SVTRv2X achieves state-of-the-art (SOTA) performance, improving recognition accuracy while maintaining highly efficient inference.

2. Related work

2.1 Scene text recognition

Scene text recognition has drawn huge attention of many researchers in computer vision field and has been studied for many decades. Traditionally, as before deep learning era, scene text recognition can be divided into two parts, text detection and text recognition. For text detection, Chen et al. [5] use texture based method, which use texture features of text to detect text region in the image. Lee et al. [6] use Sliding Window, which use different size of windows sliding over image, classifying each window containing text or not. For text recognition, tradition methods use character features for classification, such as character key-points (Phan et al.) [10] and bottom-up and top-down cues (Mishra et al.) [11]. Compared with deep learning methods, these method are at disadvantage in accuracy.

After deep neural networks’ rapid and tremendous development and application, performance of scene text recognition greatly improves. Huang et al. [12] use convolutional neural networks (CNNs) to detect text in images. Zhan et al. [13] and Liao et al. [14] use methods and techniques inspired by object detection to achieve scene text recognition. He et al. [15] use RNN-based models for text recognition. Lee et al. [16] proposed a Encoder-Decoder model, using recursive convolutional layers as encoder and RNN as decoder. Aberdam et al. [17] propose CLIP Text Recognition using vision-language transformer model to provide rich scene-level information to the crop-based recognizer in scene text recognition, thus obtain entire image informantion in case of poor-quality text. Xue et al. [18] bring up I2C2W, decomposing the task into image-to-character and character-to-word which detect characters in an non-sequential way, with performance well in sense with noises. Yu et al. [19] propose an architecture consisting of a CCR-CLIP(Chinese character recognition) pre-training stage and a CTR(Chinese Text Recognition) stage specialized for scene Chinese text.

To address the recognition challenges posed by the diversity of text instances, numerous recent works have enhanced CTC-based methods. One prevailing strategy involves introducing rectification modules, inspired by prior work. These modules aim to transform irregular text into a more regular, recognizable format through geometric or perspective corrections. Another widely adopted approach is to integrate an attention-based decoder. Such decoders leverage attention mechanisms to sequentially decode visual features, allowing them to dynamically attend to and identify each character, thus offering a more flexible alignment mechanism than conventional CTC. However, current works still have weakness and unsolved problems.

2.2 Patch-level spatial augmentation

Data augmentation plays a crucial role in enhancing the generalization ability and robustness of vision models. Existing techniques span from global image-level transformations to fine-grained patch manipulation and mixing-based sample synthesis. This section reviews two representative lines of work—traditional augmentation methods and mixup-based augmentation—with emphasis on how they manipulate global or local patches to improve model robustness and representation diversity.

Traditional data augmentation typically targets individual images, performing basic geometric transformations and color conversions [20]. Image-level methods such as random flipping and random cropping are among the most widely used augmentations and consistently improve the generalization performance of neural networks on clean data. Other appearance-based techniques—including adjustments of sharpness, brightness, and Gaussian noise—have also been utilized in various works [21]. With the increasing interest in combining multiple augmentation strategies, automatic augmentation frameworks such as AutoAugment [22], RandAugment [23], and TrivialAugment [24] have been proposed to search or generate diverse combinations of augmentations. Beyond global transformations, patch-level augmentation focuses on perturbing local spatial structures. Cutout [25] randomly masks a portion of the image, Patch Gaussian [26] injects Gaussian noise into a selected patch to balance accuracy and robustness, while Random Erasing [27] introduces varying levels of occlusion without parameter learning. Although effective, these traditional augmentation methods generally preserve the overall structure of the image and thus provide limited diversification of feature representations.

Another important direction is mixup-based augmentation, which synthesizes new samples by combining multiple images or patches. Linear mixing, represented by Mixup [28], linearly interpolates two images and their labels in a random ratio, acting as a powerful regularizer that improves model accuracy. Manifold Mixup [29] extends this idea to hidden-layer features to preserve the manifold structure among samples, while Subgroup Mixup [30] performs pairwise mixup to encourage fair and accurate decision boundaries across different subgroups. In terms of nonlinear mixing, CutMix [31] replaces a randomly cropped patch of one image with a patch from another image, with labels mixed proportionally to patch areas. PuzzleMix [32] and SaliencyMix [33] incorporate saliency maps to guide patch selection, ensuring that discriminative regions remain intact. AutoMix [34] further generates mixed samples adaptively based on learned feature maps and mixing ratios in an end-to-end manner. Despite producing rich and reliable sample variations, mixup-based methods typically incur non-negligible computational overhead and introduce challenges in label assignment due to the involvement of multiple images.

3. Materials and methods

3.1 SVTRv2

This chapter delves into the technical details of the enhanced SVTRv2X architecture. Fig 2 shows the overall of our proposed SVTRv2X architecture. We begin with an overview of the baseline SVTRv2 model and then introduce the two main contributions of this work: the jumble module and the self-distillation module. Fig 2 illustrates the overall architecture of our proposed SVTRv2X. SVTRv2X evolves from the highly efficient SVTRv2 model. The visual backbone of SVTRv2 is composed of three stages, with each stage containing six Mixing Blocks designed to progressively extract hierarchical visual features. To extract discriminative representations, SVTRv2 meticulously designs two types of mixing blocks: Local Mixing and Global Mixing. Notably, to accommodate the processing of multi-sized text instances, SVTRv2 innovatively replaces computationally intensive windowed attention [14] with two consecutive group convolutions for its local mixing operation. This design efficiently models positional information and effectively captures local character features such as edges, textures, and strokes. Building upon this strong baseline, we integrate three enhancement modules to construct the SVTRv2X framework.

Download:

Fig 2. The architecture of the SVTRv2X.

https://doi.org/10.1371/journal.pone.0349085.g002

3.2 Jumble Module

We introduce a patch-level shuffling module before the patch embedding layer, which is applied exclusively during the training phase. Unlike conventional data augmentation strategies that rely on cross-image mixing, the proposed module operates within each individual image and performs controlled spatial perturbations to enhance representation robustness.

Given an input image , it is first divided into N non-overlapping patches of size p × p:

(1)

Let denote the set of patches. Instead of globally shuffling all patches, we randomly select a subset with a predefined ratio , In our experiments, the shuffle ratio P is set to 0.1 by default, and only permute the patches within this subset:

(2)

where is a random permutation defined over . The resulting form the augmented image .

This design introduces a partial patch shuffling strategy, where only a portion of patches are rearranged while the remaining patches preserve their original spatial positions. Such a mechanism maintains a balance between structural consistency and spatial disruption. Importantly, each patch remains intact, meaning that local visual patterns are preserved while inter-patch spatial relationships are selectively perturbed.

By disrupting local spatial continuity in a controlled manner, the model is encouraged to rely less on fixed spatial adjacency and instead learn more robust and position-invariant representations. This is particularly beneficial for scene text recognition, where text may appear with irregular layouts, distortions, or perspective variations.

In practice, JM is only activated during training and introduces negligible computational overhead. During inference, the input is processed without any modification, ensuring full compatibility with standard patch embedding pipelines. Fig 3 illustrates examples of augmented images generated by the proposed module.

Download:

Fig 3. Illustration of the Jumble Module.

The input image is transformed into a partially shuffled version.

https://doi.org/10.1371/journal.pone.0349085.g003

The Jumble Module can be viewed as a lightweight structural regularization technique. By exposing the model to diverse spatial configurations within each image, it effectively improves generalization ability and alleviates overfitting, without introducing label inconsistency or cross-image artifacts.

3.3 Self-distillation module (SDM)

Some recent works on knowledge distillation and multi-scale attention mechanisms [36,37] have inspired the design of our Self-distillation module. To facilitate feature learning, we introduce a self-distillation module (SDM) integrated into the baseline SVTRv2 architecture after stage 1 and stage 2, following the design paradigm of SVTRv2. Let F₁ and F₂ denote the intermediate features extracted from stage 1 and stage 2, respectively, where and .

To enable effective feature alignment, both features are first projected into a shared latent space via a 1 × 1 convolution:

(3)

where . A subsequent non-overlapping 3 × 3 convolution is applied to enhance local spatial interactions within each feature map:

(4)

Finally, another 1 × 1 convolution is used to match the feature dimensions:

(5)

Unlike traditional distillation methods that rely on an external teacher model, our SDM adopts an internal self-distillation strategy, where deeper features serve as implicit teachers for shallower ones. Specifically, we use the transformed feature as the teacher to supervise . The self-distillation loss is defined as:

(6)

where denotes the stop-gradient operation. In this formulation, the number of supervised feature levels is fixed as N = 2, corresponding to stage 1 and stage 2.

To stabilize training, we adopt a two-phase training strategy. In the first phase, the model is trained without the self-distillation loss to learn reliable feature representations. In the second phase, we enable L_sd and fine-tune the model for an additional 20 epochs. This design is motivated by the observation that applying self-distillation at early stages may introduce noisy supervision due to immature feature representations. Empirically, this two-stage strategy leads to more stable convergence and improved performance compared to joint training from scratch, as demonstrated in our ablation study.

3.4 Mixture of experts module

Within the Mixing blocks, we replace the MLP with a Sparse Mixture-of-Experts (MoE) layer, as illustrated in Fig 4. The MoE layer consists of a lightweight gating network and N parallel expert sub-networks. Unlike a standard FFN that applies the same transformation to all input tokens, the MoE layer dynamically routes each token to a small subset of the experts. This sparse activation mechanism significantly increases the model capacity while keeping the computational cost nearly unchanged.

Download:

Fig 4. MOE structure.

https://doi.org/10.1371/journal.pone.0349085.g004

Formally, given an input token representation , the gating network computes a score for each expert:

(7)

where W_g and b_g denote the gating parameters. To enforce sparsity, only the top-K experts (typically K = 1 or 2) are selected for each token. Let denote the indices of the top-K values in g. The routed output is then computed as:

(8)

where represents the i-th expert network, typically a two-layer feed-forward module:

(9)

In our experiments, the MoE module uses 4 experts. We adopt a top-1 routing strategy for activation sparsity. A load balancing loss is applied during training to encourage uniform expert utilization, but it is not applied during inference.

By activating only a few experts per token, the MoE layer adopts a “divide-and-conquer” strategy that allows distinct experts to specialize in different textual patterns such as font variations, stroke styles, distortions, or noise types. This specialization not only improves the expressive power of the Transformer blocks but also significantly enhances recognition accuracy with only marginal additional computational cost (FLOPs).

3.5 Loss function

During training, the SVTRv2X model is optimized by minimizing a combined loss function that integrates multiple supervision signals, including the CTC loss, semantic guidance module (SGM) loss, and self-distillation loss. The Connectionist Temporal Classification (CTC) loss [38] is applied to handle the misalignment between the input feature sequence and the target character sequence. Given the input feature sequence and the target label sequence Y, the CTC loss is defined as:

(10)

where represents a valid alignment path, and denotes the set of all paths that collapse to the target sequence Y after removing repeated characters and blank tokens.

To incorporate linguistic context during training, the semantic guidance module applies cross-entropy supervision to both left-to-right and right-to-left predictions of the character sequence. The SGM loss is formulated as:

(11)

where denotes the cross-entropy loss, L is the sequence length, and are the left-to-right and right-to-left predictions, and c_i is the ground truth character at position i.

To further enhance feature representation, the self-distillation module encourages knowledge transfer across different stages. Specifically, deeper features are used to guide shallower ones. Let and denote the transformed features from stage 1 and stage 2, respectively. The self-distillation loss is defined as:

(12)

where denotes the stop-gradient operation. In this work, the number of supervised feature levels is fixed as N = 2, corresponding to stage 1 and stage 2.

To ensure balanced expert utilization and avoid expert collapse, we follow standard practices and introduce a load-balancing auxiliary loss:

(13)

where f_i is the fraction of tokens assigned to expert i, and p_i is the average gate probability for that expert.

The total training loss combines these components with weighting factors , , and , formulated as:

(14)

In our experiments, , , and are set to 0.1, 1.0, and 0.5, respectively, which balances the contributions of alignment supervision, linguistic guidance, and self-distillation.

4. Experiments

To evaluate the effectiveness of the enhanced SVTRv2X, we conducted scene text recognition experiments on a baseline dataset. We compared our proposed approach to state-of-the-art models, including baseline SVTR and other lightweight architectures.

4.1 Experimental setup

4.1.1 Dataset.

We evaluated our approach on the IIIT 5K-Word Dataset [39], ICDAR 2013 Dataset [40] and the Chinese Text Recognition Scene Dataset [35].

The IIIT 5K-Word Dataset (IIIT5K) is a widely used benchmark for scene text recognition, collected from Google image search. It contains 5,000 cropped word images. The dataset covers a large variety of fonts, backgrounds, distortions, and noise, making it suitable for evaluating robustness in natural scene text recognition.

The ICDAR 2013 Dataset (ICDAR2013) is another standard benchmark introduced in the ICDAR Robust Reading Competition. It contains high-quality word images captured from real-world scenes. Compared to IIIT5K, ICDAR2013 is relatively cleaner and is often used to evaluate recognition performance under less noisy conditions.

Fig 5 and Table 1 illustrates the four datasets used in the Chinese Text Recognition Scene Dataset, including Scene, Web, Document, and Handwriting datasets. A unified processing pipeline was applied across all datasets to ensure consistency. Below we provide a concise description of their sources and construction, while detailed statistics are summarized in Table 1.

Download:

Table 1. Statistics of the Chinese Text Recognition Scene Dataset.

https://doi.org/10.1371/journal.pone.0349085.t001

Download:

Fig 5. Examples of the Chinese Text Recognition Scene Dataset.

https://doi.org/10.1371/journal.pone.0349085.g005

The Scene dataset is constructed by combining several publicly available Chinese scene text benchmarks, including RCTW [41], ReCTS [42], LSVT [43], ArT [44], and CTW [45]. All images were cropped into text-line samples using the official annotations. For RCTW, only the training split was used since the test set does not provide ground-truth labels. For LSVT, we selected only the fully annotated subset. The cropped samples from all five datasets were merged and randomly divided into training, validation, and testing subsets with an 8:1:1 ratio.

The Web dataset is derived from MTWI, which contains web text images spanning 17 categories from Taobao. We cropped text-line samples from the training split and then manually partitioned them into three subsets using an 8:1:1 ratio to account for the dataset’s diverse styles and typography.

The Document dataset consists of synthetic text images generated using the Text Render engine. Text sequences of lengths 1–15 were uniformly sampled, and the corpus was compiled from open-domain sources such as Wikipedia, movie subtitles, Amazon reviews, and encyclopedic entries. The generated images were randomly divided with an 8:1:1 split.

The Handwriting dataset is based on SCUT-HCCDoc, which contains naturally captured handwritten Chinese text. Following common practice, the original training portion was further divided into training and validation subsets with a 4:1 ratio, while the official test set was retained unchanged.

4.1.2 Implementation details.

We implemented our model using the PyTorch framework and trained it with the Adam optimizer. The initial learning rate was set to 0.00065 and decayed using a cosine annealing schedule to ensure stable convergence. All models were trained with a batch size of 8.

All experiments were conducted on a workstation equipped with 64 GB RAM and an NVIDIA RTX3090 GPU. The training environment was built on Ubuntu 20.04, CUDA 12.1, and PyTorch 2.1, ensuring full support for mixed-precision acceleration and efficient large-scale training.

4.2 Ablation study

To precisely quantify the contribution of the Jumble Module (JM), we conducted an ablation study by integrating it into the SVTRv2 baseline. As detailed in the Table 2, this addition results in a tangible improvement, elevating the average accuracy from 79.45% to 79.72%. A more granular analysis of the per-category performance reveals the module’s core strength: enhancing spatial robustness. The most significant gains are concentrated on datasets known for their geometric and stylistic irregularities. Specifically, we observe a 0.5% absolute accuracy improvement on the Sence dataset (from 77.8% to 78.3%) and a 0.4% improvement on the Handwriting dataset (from 62.0% to 62.4%). In contrast, its impact on the highly structured Document dataset is negligible, which is expected. To further assess its contribution within the full framework, we compare the configurations with and without JM under the same setting of other modules. Specifically, adding JM on top of the SDM + MoE configuration improves the average accuracy from 81.03% to 81.64% (+0.61%), with consistent gains observed across all subsets (e.g., Scene: + 1.1%, Web: + 0.6%, Handwriting: + 0.8%). These results indicate that JM serves as a complementary module that enhances robustness by introducing controlled spatial perturbations. While its standalone improvement is relatively smaller compared to SDM, its contribution becomes more pronounced when combined with other modules, leading to a more generalized and stable model.

Download:

Table 2. Ablation study results on different Text Recognition Scenarios.

https://doi.org/10.1371/journal.pone.0349085.t002

The effectiveness of the Self-distillation Module (SDM) is demonstrated through a significant performance leap across the board. Integrating SDM elevates the average accuracy by a substantial 1.36%, from 79.45% to 80.81%, marking it as a highly impactful component of our framework. This module acts as a powerful, universal feature enhancer, with its benefits being particularly pronounced on the more challenging data categories. The most dramatic improvement is seen on the Handwriting dataset, where accuracy surges by 2.3% (from 62.0% to 64.3%). Similarly, the Sence dataset benefits from a large 2.0% gain (from 77.8% to 79.8%). These results strongly suggest that the internal knowledge transfer from deeper “teacher” layers to shallower “student” layers successfully enriches the feature representations at every stage. This process not only improves final prediction accuracy but also acts as an implicit regularizer, leading to a more generalized and powerful model without incurring any additional inference overhead.

We investigate the impact of replacing the standard dense Feed-Forward Networks (FFNs) with our proposed Mixture of Experts (MoE) layers. The results clearly demonstrate the value of this architectural shift, with the MoE module boosting the average accuracy from 79.45% to 80.05%. The strength of the MoE architecture lies in its ability to increase model capacity efficiently, allowing for specialization. This is reflected in the balanced and significant improvements on the two most diverse and challenging datasets: both Sence and Handwriting accuracy increase by a solid 1.0% each (to 78.8% and 63.0%, respectively). This outcome supports our assertion that the “divide-and-conquer” approach, where different experts learn to handle distinct data patterns (e.g., varied fonts, noises, or styles), provides a significant advantage over a monolithic FFN. The MoE module successfully enhances the model’s expressive power and its ability to generalize to complex text instances, all while maintaining computational efficiency.

4.3 Comparison with state-of-the-arts

In this study, we perform extensive comparative experiments with existing methods to validate the effectiveness of the proposed approach. Furthermore, the model size settings are aligned with those of SVTRv2, ensuring fairness and consistency in the experimental comparisons.

According to the performance results summarized in Table 3, the SVTRv2X series achieves leading accuracy across different Chinese text recognition scenarios. The smallest variant, SVTRv2X-T, reaches an average accuracy of 81.64%, already surpassing strong baseline models such as DPTR (80.73%), CDistNet (80.58%), CPPD (78.55%), and IGTR (81.74%). Scaling up to SVTRv2X-S and SVTRv2X-B, the average accuracy further increases to 83.25% and 84.68%, respectively, establishing a new state-of-the-art across the benchmark. These results demonstrate that the proposed enhancement modules—Jumble, MoE, and self-distillation—consistently strengthen feature representations and improve performance across diverse text categories.

Download:

Table 3. Performance comparison on different Chinese Text Recognition Scenarios (%).

https://doi.org/10.1371/journal.pone.0349085.t003

In the Scene category, which is widely considered the most challenging due to complex backgrounds, occlusion, illumination variation, and geometric distortions, SVTRv2X-T achieves 81.1% accuracy, already higher than CDistNet (80.0%), CPPD (78.4%), and DPTR (80.0%). Scaling to SVTRv2X-S and SVTRv2X-B, the accuracy increases to 83.4% and 84.9%, respectively. This indicates that the Jumble module effectively enhances robustness against

spatial perturbations, while the MoE module enables specialization for complex visual patterns, collectively boosting generalization in natural scenes.

For Web images, SVTRv2X achieves 81.2%, 83.5%, and 85.0% for the T, S, and B variants, surpassing CDistNet (79.5%), CPPD (79.3%), and IGTR (81.7%). This highlights the importance of the self-distillation module, which improves consistency within the latent feature space and enhances robustness across heterogeneous text styles.

In the Document category, all models perform relatively well due to the clean and structured nature of document text. SVTRv2X-T and SVTRv2X-S achieve 99.4%, while SVTRv2X-B slightly improves to 99.5%, consistently exceeding CPPD (98.9%) and CDistNet (99.1%) and matching the highest scores of IGTR (99.5%). This suggests that the enhancement modules maintain stable feature representations even in less challenging scenarios.

In the Handwriting category, characterized by diverse writing styles and irregular strokes, SVTRv2X-T, S, and B achieve 64.9%, 66.7%, and 69.3%, respectively, outperforming DPTR (64.4%), CDistNet (63.7%), CPPD (57.6%), and IGTR (63.8%). The MoE module’s expert specialization effectively handles stroke-level variability, while the Jumble module contributes spatial robustness, together enhancing the model’s capacity to generalize to irregular handwritten text.

Table 4 presents the performance comparison of our proposed SVTRv2X models against several state-of-the-art STR methods on standard Latin-based datasets, including IIIT5K and ICDAR2013. From the results, it can be observed that the SVTRv2X series consistently achieves competitive or superior performance across different model scales.

Download:

Table 4. Performance comparison on common dataset (%).

https://doi.org/10.1371/journal.pone.0349085.t004

Specifically, SVTRv2X-T achieves 99.0% accuracy on IIIT5K and 98.2% on ICDAR2013, comparable to previous top-performing models such as CPPD and IGTR-AR. Scaling up to SVTRv2X-S, the model reaches 99.4% on IIIT5K and 98.8% on ICDAR2013, surpassing most existing methods. The largest variant, SVTRv2X-B, achieves 99.6% on IIIT5K and 98.9% on ICDAR2013, leading to an average accuracy of 99.2%, which is on par with or slightly better than SVTRv2-B and other recent SOTA approaches such as LISTER and CPPD.

These results demonstrate that our proposed SVTRv2X framework not only maintains strong performance on Chinese STR datasets but also generalizes effectively to Latin-based text recognition, highlighting the robustness and versatility of the model across different languages and scripts.

5. Discussion

5.1 Visual comparison of Chinese Text Recognition

To further demonstrate the effectiveness of SVTRv2X, we compare its recognition results with two representative models, ABINet and TransOCR, on challenging Chinese text samples as shown in the Fig 6. In the successful cases, SVTRv2X consistently recognizes fine-grained characters with similar strokes or radicals, which are often misclassified by ABINet and TransOCR. For example, characters with subtle bifurcated strokes or nested radicals are accurately distinguished by SVTRv2X, whereas ABINet may confuse these characters due to less discriminative feature representations, and TransOCR often struggles under complex backgrounds or non-standard fonts.

Download:

Fig 6. Visual comparison of Chinese Text Recognition: SVTRv2X vs. ABINet and TransOCR on success and failure cases.

https://doi.org/10.1371/journal.pone.0349085.g006

In the failure cases, all three models struggle with extremely low-resolution images, severe motion blur, or handwritten text. However, SVTRv2X exhibits higher robustness, producing fewer errors while maintaining clear separability between characters. These visual comparisons indicate that the integration of the Jumble Module, Self-Distillation Module, and MoE Module in SVTRv2X enhances intra-class compactness and inter-class separability, resulting in more reliable recognition performance than existing state-of-the-art models.

Overall, the visual analysis demonstrates that SVTRv2X not only outperforms ABINet and TransOCR in recognizing visually similar or complex characters but also maintains more stable performance in diverse and challenging real-world scenarios.

5.2 Visualization-based analysis of similar Chinese characters

To further validate the model’s capability in fine-grained character discrimination, we conduct a dedicated analysis on two groups of visually confusing Chinese characters. The first group includes characters with highly similar bifurcated stroke structures, as shown in ‌‌the Fig 7(a). The second group includes characters that contain nested radicals and exhibit similar spatial layouts, also illustrated in the Fig 7(b). For each group, we visualize the feature embeddings generated by SVTRv2 and SVTRv2X using t-SNE.

Download:

Fig 7. t-SNE visualization of fine-grained character clusters for baseline and SVTRv2X.

https://doi.org/10.1371/journal.pone.0349085.g007

As shown in the baseline visualizations, SVTRv2 produces scattered and overlapping clusters, indicating that similar characters are not well separated in the feature space. In contrast, SVTRv2X yields significantly more compact and clearly separated clusters, even for characters with extremely subtle structural differences. This improvement arises from our three modules: the Jumble Module enhances robustness to local stroke perturbations, the Self-Distillation Module strengthens early feature representation, and the MoE Module provides specialized processing for different character patterns.

These visual improvements align with the quantitative results, confirming that SVTRv2X substantially enhances discriminability in similar-character scenarios. The model not only reduces feature confusion within clusters but also enlarges inter-character boundaries, thereby improving recognition accuracy for fine-grained Chinese text.

5.3 Statistical significance analysis

To further validate the reliability and stability of SVTRv2X, we conducted multiple runs on Latin-based STR datasets, including IIIT5K, ICDAR2013, and CTR, using different random seeds. Table 5 reports the mean accuracy and standard deviation over five runs for each variant of SVTRv2X.

Download:

Table 5. Statistical significance testing on Latin-based STR datasets (% mean ± std over 5 runs).

https://doi.org/10.1371/journal.pone.0349085.t005

The results indicate that SVTRv2X achieves highly consistent performance across all datasets. The small standard deviations confirm that the observed gains are stable and reproducible, rather than being influenced by random initialization or stochastic training variations.

Specifically, SVTRv2X-T already demonstrates strong performance with mean accuracies of 99.0% on IIIT5K, 98.2% on ICDAR2013, and 81.64% on CTR. Scaling to SVTRv2X-S and SVTRv2X-B leads to further improvements, reaching 99.4% / 98.8% / 83.25% and 99.6% / 98.9% / 84.68%, respectively. These trends highlight the consistent benefit of the proposed enhancement modules—Jumble, Mixture-of-Experts, and self-distillation—in improving both accuracy and robustness across different text recognition scenarios.

This statistical significance analysis confirms that the performance improvements of SVTRv2X are not only substantial but also statistically reliable, reinforcing the robustness and generalizability of our framework on multilingual and Latin-based STR datasets.

5.4 Model complexity analysis

To verify the claim of efficiency preservation, we evaluated both the number of trainable parameters and the computational cost (FLOPs) of SVTRv2X variants compared to the baseline SVTRv2 models. The analysis focuses on the impact of the MoE module, which enhances model capacity while aiming to maintain efficiency.

Table 6 summarizes the comparison of model complexity between the baseline SVTRv2-B and the proposed SVTRv2X-B. SVTRv2-B has 19.2M parameters, whereas SVTRv2X-B contains 55.2M parameters. The increase is mainly due to the addition of the MoE modules, which introduce multiple expert layers and auxiliary networks to enhance feature representation. Despite this substantial increase in parameters, these modules enable richer feature extraction and model specialization, which is critical for handling diverse text patterns and challenging scenarios.

Download:

Table 6. Parameter and FLOPs comparison between SVTRv2 and SVTRv2X variants.

https://doi.org/10.1371/journal.pone.0349085.t006

In terms of computational cost, the floating-point operations only slightly increase from 8.22G for SVTRv2-B to 8.33G for SVTRv2X-B, a marginal 1.3% overhead. This indicates that although the model size is significantly larger, the MoE design effectively limits additional computation by selectively activating expert layers during forward propagation. As a result, SVTRv2X-B maintains inference efficiency comparable to SVTRv2-B while benefiting from increased capacity.

5.5 Design analysis of SDM

We further analyze two key design choices in the proposed SDM, including the use of 3×3 convolution and the two-stage training strategy. The 3×3 convolution is adopted to enhance local spatial interactions among neighboring patches, providing a better trade-off between representation capability and computational efficiency compared with 1×1 (no spatial modeling) and larger kernels such as 5×5 (higher cost and potential noise). For training, we employ a two-stage strategy where the model is first trained without the self-distillation loss and then fine-tuned with it. This is because early-stage features are often unstable and may introduce noisy supervision if distillation is applied too early. Delaying the distillation allows the model to learn more reliable representations, leading to improved performance and more stable convergence.

The effectiveness of the proposed designs can be further validated through visualization results. As shown in Fig 8, compared with alternative structures, the use of 3×3 convolution leads to higher recognition accuracy, demonstrating its superiority in modeling local spatial information and enhancing feature representation.

Download:

Fig 8. Comparison of different convolutional designs.

https://doi.org/10.1371/journal.pone.0349085.g008

As illustrated in Fig 9, the training curves show that after introducing the self-distillation loss in the second stage, the overall loss is further reduced. This indicates that the proposed two-stage training strategy provides more reliable supervision and facilitates model optimization.

Download:

Fig 9. Effect of two-stage training strategy.

https://doi.org/10.1371/journal.pone.0349085.g009

These results demonstrate that the proposed designs not only improve recognition performance but also lead to lower training loss and better convergence behavior during optimization.

6. Conclusion

In this paper, we proposed SVTRv2X, a novel and enhanced framework for scene text recognition, built upon the strong SVTRv2 baseline. Our primary contribution is the integration of three synergistic modules: the Jumble Module (JM), the Self-Distillation Module (SDM), and the Mixture of Experts (MoE) module. The Jumble Module bolsters the model’s resilience to spatial distortions, the Self-Distillation Module enriches feature representations via internal knowledge transfer, and the MoE module efficiently scales model capacity to handle diverse text patterns with a “divide-and-conquer” approach. As demonstrated through extensive experiments, SVTRv2X sets a new state-of-the-art on benchmark datasets, significantly outperforming previous methods while preserving the high inference speed characteristic of CTC-based models. Future research directions could include exploring more sophisticated MoE routing algorithms (e.g., with load balancing) and applying this modular enhancement paradigm to other visual recognition tasks. We believe SVTRv2X provides a powerful and effective blueprint for building high-performance yet efficient recognition models, and we hope our work will inspire further research in the scene text recognition field.

Supporting information

S1 File. Original Code.

https://doi.org/10.1371/journal.pone.0349085.s001

(ZIP)

S1 Appendix.

https://doi.org/10.1371/journal.pone.0349085.s002

(DOCX)

References

1. Long S, He X, Yao C. Scene text detection and recognition: the deep learning era. Int J Comput Vis. 2020;129(1):161–84.
- View Article
- Google Scholar
2. Chen X, Jin L, Zhu Y, Luo C, Wang T. Text recognition in the wild. ACM Comput Surv. 2021;54(2):1–35.
- View Article
- Google Scholar
3. Zhai C, Chen Z, Li J, Xu B. Chinese image text recognition with BLSTM-CTC: a segmentation-free method. Proceedings of the Chinese Conference on Pattern Recognition; 2016. p. 525–36.
4. Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell. 2017;39(11):2298–304. pmid:28055850
- View Article
- PubMed/NCBI
- Google Scholar
5. Du Y, Chen Z, Jia C, Yin X, Zheng T, Li C, et al. SVTR: scene text recognition with a single visual model. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI); 2022. p. 884–90.
6. Du Y, Chen Z, Xie H, Jia C, Jiang Y-G. SVTRv2: CTC beats encoder-decoder models in scene text recognition. 2025 IEEE/CVF International Conference on Computer Vision (ICCV); 2025. p. 20147–56.
7. Li L, Cherouat A, Snoussi H, Wang T. Grasping with occlusion-aware ally method in complex scenes. IEEE Trans Automat Sci Eng. 2025;22:5944–54.
- View Article
- Google Scholar
8. Eli E, Wang D, Xu W, Mamat H, Aysa A, Ubul K. A comprehensive review of non-Latin natural scene text detection and recognition techniques. Eng Appl Artif Intell. 2025;156:111107.
- View Article
- Google Scholar
9. Song W, Ye Z, Sun M, Hou X, Li S, Hao A. AttriDiffuser: adversarially enhanced diffusion model for text-to-facial attribute image synthesis. Pattern Recognit. 2025;163:111447.
- View Article
- Google Scholar
10. Phan TQ, Shivakumara P, Tian S, Tan CL. Recognizing text with perspective distortion in natural scenes. Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2013. p. 569–76.
11. Mishra A, Alahari K, Jawahar CV. Top-down and bottom-up cues for scene text recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2012. p. 2687–94.
12. Huang W, Qiao Y, Tang X. Robust scene text detection with convolutional neural networks induced MSER trees. Proceedings of the European Conference on Computer Vision (ECCV); 2014.
13. Zhan F, Lu S, Xue C. Verisimilar image synthesis for accurate detection and recognition of texts in scenes. Proceedings of the European Conference on Computer Vision (ECCV); 2018. p. 249–66.
14. Liao M, Shi B, Bai X, Wang X, Liu W. TextBoxes: a fast text detector with a single deep neural network. AAAI. 2017;31(1).
- View Article
- Google Scholar
15. He P, Huang W, Qiao Y, Loy C, Tang X. Reading scene text in deep convolutional sequences. AAAI. 2016;30(1).
- View Article
- Google Scholar
16. Lee C-Y, Osindero S. Recursive recurrent nets with attention modeling for OCR in the Wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 2231–9.
17. Aberdam A, Bensaïd D, Golts A, Ganz R, Nuriel O, Tichauer R, et al. CLIPTER: looking at the bigger picture in scene text recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2023. p. 21649–60.
18. Xue C, Huang J, Zhang W, Lu S, Wang C, Bai S. Image-to-character-to-word transformers for accurate scene text recognition. IEEE Trans Pattern Anal Mach Intell. 2023;45(11):12908–21. pmid:37022831
- View Article
- PubMed/NCBI
- Google Scholar
19. Yu H, Wang X, Li B, Xue X. Chinese text recognition with a pre-trained CLIP-like model through image-IDS aligning. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2023. p. 11909–18.
20. Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):1–48.
- View Article
- Google Scholar
21. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [Preprint]. 2014. http://arxiv.org/abs/1409.1556
22. Cubuk ED, Zoph B, Mané D, Vasudevan V, Le QV. AutoAugment: learning augmentation strategies from data. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019. p. 113–23.
23. Cubuk ED, Zoph B, Shlens J, Le QV. Randaugment: practical automated data augmentation with a reduced search space. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 2020. p. 3008–17.
24. Muller SG, Hutter F. TrivialAugment: tuning-free yet state-of-the-art data augmentation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021. p. 754–62.
25. DeVries T, Taylor GW. Improved regularization of convolutional neural networks with Cutout. arXiv:1708.04552 [Preprint]. 2017. http://arxiv.org/abs/1708.04552
26. Lopes RG, Yin D, Poole B, Gilmer J, Cubuk ED. Improving robustness without sacrificing accuracy with Patch Gaussian augmentation. arXiv:1906.02611 [Preprint]. 2019. http://arxiv.org/abs/1906.02611
27. Zhong Z, Zheng L, Kang G, Li S, Yang Y. Random erasing data augmentation. AAAI. 2020;34(07):13001–8.
- View Article
- Google Scholar
28. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D. Mixup: beyond empirical risk minimization. arXiv:1710.09412 [Preprint]. 2017. http://arxiv.org/abs/1710.09412
29. Verma V, Lamb A, Beckham C, Najafi A, Mitliagkas I, Lopez-Paz D, et al. Manifold Mixup: better representations by interpolating hidden states. Proceedings of the International Conference on Machine Learning (ICML); 2019. p. 6438-47.
30. Navarro M, Little C, Allen GI, Segarra S. Data augmentation via subgroup Mixup for improving fairness. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2024. p. 7350–4.
31. Yun S, Han D, Oh SJ, Chun S, Choe J, Yoo Y. CutMix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2019. p. 6023–32.
32. Kim JH, Choo W, Song HO. Puzzle Mix: exploiting saliency and local statistics for optimal Mixup. Proceedings of the International Conference on Machine Learning (ICML); 2020. p. 5275–85.
33. Uddin AFM, Monira M, Shin W, Chung T, Bae SH. SaliencyMix: a saliency guided data augmentation strategy for better regularization. arXiv:2006.01791 [Preprint]. 2020. http://arxiv.org/abs/2006.01791
34. Liu Z, Li S, Wu D, Liu Z, Chen Z, Wu L, et al. AutoMix: unveiling the power of Mixup for stronger classifiers. Proceedings of the European Conference on Computer Vision (ECCV); 2022. p. 441–58.
35. Yu H, Chen J, Li B, Ma J, Guan M, Xu X, et al. Benchmarking Chinese text recognition: datasets, baselines, and an empirical study. arXiv:2112.15093 [Preprint]. 2021. http://arxiv.org/abs/2112.15093
36. Zhang D, Li B, Wang F, Zhao Z, Gao J, Li X. Cross-attention multi-scale state space model for remaining useful life prediction of aircraft engines. Adv Eng Inform. 2026;69:103817.
- View Article
- Google Scholar
37. Zhang D, Wang F, Li B, Zhao Z, Gao J, Li X. KAID: knowledge-aware interactive distillation for vision-language models. Proceedings of the 33rd ACM International Conference on Multimedia; 2025. p. 3212–21.
38. Graves A. Connectionist temporal classification. In: Supervised sequence labelling with recurrent neural networks; 2012. p. 61–93.
39. Mishra A, Alahari K, Jawahar CV. Scene text recognition using higher order language priors. BMVC; 2012.
40. Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R. ICDAR 2003 robust reading competitions: entries, results and future directions. Int J Doc Anal Recognit. 2003;7:105–22.
- View Article
- Google Scholar
41. Shi B, Yao C, Liao M, Yang M, Xu P, Cui L, et al. ICDAR2017 competition on reading Chinese text in the wild (RCTW-17). Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), vol. 1; 2017. p. 1429–34.
42. Zhang R, Yang M, Bai X, Shi B, Karatzas D, Lu S, et al. ICDAR 2019 robust reading challenge on reading Chinese text on signboard. Proceedings of the International Conference on Document Analysis and Recognition (ICDAR); 2019. p. 1577–81.
43. Sun Y, Karatzas D, Chan CS, Jin L, Ni Z, Chng C-K, et al. ICDAR 2019 competition on large-scale street view text with partial labeling - RRC-LSVT. Proceedings of the International Conference on Document Analysis and Recognition (ICDAR); 2019. p. 1557–62.
44. Chng CK, Ding E, Liu J, Karatzas D, Chan CS, Jin L, et al. ICDAR2019 robust reading challenge on arbitrary-shaped text - RRC-ArT. Proceedings of the International Conference on Document Analysis and Recognition (ICDAR); 2019. p. 1571–6.
45. Yuan T-L, Zhu Z, Xu K, Li C-J, Mu T-J, Hu S-M. A large Chinese text dataset in the wild. J Comput Sci Technol. 2019;34(3):509–21.
- View Article
- Google Scholar
46. Shi B, Yang M, Wang X, Lyu P, Yao C, Bai X. ASTER: an attentional scene text recognizer with flexible rectification. IEEE Trans Pattern Anal Mach Intell. 2019;41(9):2035–48. pmid:29994467
- View Article
- PubMed/NCBI
- Google Scholar
47. Luo C, Jin L, Sun Z. MORAN: a multi-object rectified attention network for scene text recognition. Pattern Recognit. 2019;90:109–18.
- View Article
- Google Scholar
48. Li H, Wang P, Shen C, Zhang G. Show, attend and read: a simple and strong baseline for irregular text recognition. AAAI. 2019;33(01):8610–7.
- View Article
- Google Scholar
49. Qiao Z, Zhou Y, Yang D, Zhou Y, Wang W. SEED: semantics enhanced encoder-decoder framework for scene text recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020. p. 13525–34.
50. Lu N, Yu W, Qi X, Chen Y, Gong P, Xiao R, et al. MASTER: multi-aspect non-local network for scene text recognition. Pattern Recognit. 2021;117:107980.
- View Article
- Google Scholar
51. Fang S, Xie H, Wang Y, Mao Z, Zhang Y. Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021. p. 7094–103.
52. Chen J, Li B, Xue X. Scene text telescope: text-focused scene image super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021. p. 12021–30.
53. Zhang Z, Lu N, Liao M, Huang Y, Li C, Wang M, et al. Self-distillation regularized connectionist temporal classification loss for text recognition: a simple yet effective approach. AAAI. 2024;38(7):7441–9.
- View Article
- Google Scholar
54. Yang M, Yang B, Liao M, Zhu Y, Bai X. Class-aware mask-guided feature refinement for scene text recognition. Pattern Recognit. 2024;149:110244.
- View Article
- Google Scholar
55. Cheng C, Wang P, Da C, Zheng Q, Yao C. LISTER: neighbor decoding for length-insensitive scene text recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2023. p. 19484–94.
56. Zhao S, Du Y, Chen Z, Jiang Y-G. Decoder pre-training with only text for scene text recognition. Proceedings of the 32nd ACM International Conference on Multimedia; 2024. p. 5191–200.
57. Zheng T, Chen Z, Fang S, Xie H, Jiang Y-G. CDistNet: perceiving multi-domain character distance for robust text recognition. Int J Comput Vis. 2023;132(2):300–18.
- View Article
- Google Scholar
58. Du Y, Chen Z, Jia C, Yin X, Li C, Du Y, et al. Context perception parallel decoder for scene text recognition. IEEE Trans Pattern Anal Mach Intell. 2025;47(6):4668–83. pmid:40031666
- View Article
- PubMed/NCBI
- Google Scholar
59. Du Y, Chen Z, Su Y, Jia C, Jiang Y-G. Instruction-guided scene text recognition. IEEE Trans Pattern Anal Mach Intell. 2025;47(4):2723–38. pmid:40030880
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Long S, He X, Yao C. Scene text detection and recognition: the deep learning era. Int J Comput Vis. 2020;129(1):161–84.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Chen X, Jin L, Zhu Y, Luo C, Wang T. Text recognition in the wild. ACM Comput Surv. 2021;54(2):1–35.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Zhai C, Chen Z, Li J, Xu B. Chinese image text recognition with BLSTM-CTC: a segmentation-free method. Proceedings of the Chinese Conference on Pattern Recognition; 2016. p. 525–36.

[ref4] 4. Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans Pattern Anal Mach Intell. 2017;39(11):2298–304. pmid:28055850
View Article
PubMed/NCBI
Google Scholar

[9] View Article

[10] PubMed/NCBI

[11] Google Scholar

[ref5] 5. Du Y, Chen Z, Jia C, Yin X, Zheng T, Li C, et al. SVTR: scene text recognition with a single visual model. Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI); 2022. p. 884–90.

[ref6] 6. Du Y, Chen Z, Xie H, Jia C, Jiang Y-G. SVTRv2: CTC beats encoder-decoder models in scene text recognition. 2025 IEEE/CVF International Conference on Computer Vision (ICCV); 2025. p. 20147–56.

[ref7] 7. Li L, Cherouat A, Snoussi H, Wang T. Grasping with occlusion-aware ally method in complex scenes. IEEE Trans Automat Sci Eng. 2025;22:5944–54.
View Article
Google Scholar

[15] View Article

[16] Google Scholar

[ref8] 8. Eli E, Wang D, Xu W, Mamat H, Aysa A, Ubul K. A comprehensive review of non-Latin natural scene text detection and recognition techniques. Eng Appl Artif Intell. 2025;156:111107.
View Article
Google Scholar

[18] View Article

[19] Google Scholar

[ref9] 9. Song W, Ye Z, Sun M, Hou X, Li S, Hao A. AttriDiffuser: adversarially enhanced diffusion model for text-to-facial attribute image synthesis. Pattern Recognit. 2025;163:111447.
View Article
Google Scholar

[21] View Article

[22] Google Scholar

[ref10] 10. Phan TQ, Shivakumara P, Tian S, Tan CL. Recognizing text with perspective distortion in natural scenes. Proceedings of the IEEE International Conference on Computer Vision (ICCV); 2013. p. 569–76.

[ref11] 11. Mishra A, Alahari K, Jawahar CV. Top-down and bottom-up cues for scene text recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2012. p. 2687–94.

[ref12] 12. Huang W, Qiao Y, Tang X. Robust scene text detection with convolutional neural networks induced MSER trees. Proceedings of the European Conference on Computer Vision (ECCV); 2014.

[ref13] 13. Zhan F, Lu S, Xue C. Verisimilar image synthesis for accurate detection and recognition of texts in scenes. Proceedings of the European Conference on Computer Vision (ECCV); 2018. p. 249–66.

[ref14] 14. Liao M, Shi B, Bai X, Wang X, Liu W. TextBoxes: a fast text detector with a single deep neural network. AAAI. 2017;31(1).
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref15] 15. He P, Huang W, Qiao Y, Loy C, Tang X. Reading scene text in deep convolutional sequences. AAAI. 2016;30(1).
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref16] 16. Lee C-Y, Osindero S. Recursive recurrent nets with attention modeling for OCR in the Wild. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2016. p. 2231–9.

[ref17] 17. Aberdam A, Bensaïd D, Golts A, Ganz R, Nuriel O, Tichauer R, et al. CLIPTER: looking at the bigger picture in scene text recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2023. p. 21649–60.

[ref18] 18. Xue C, Huang J, Zhang W, Lu S, Wang C, Bai S. Image-to-character-to-word transformers for accurate scene text recognition. IEEE Trans Pattern Anal Mach Intell. 2023;45(11):12908–21. pmid:37022831
View Article
PubMed/NCBI
Google Scholar

[36] View Article

[37] PubMed/NCBI

[38] Google Scholar

[ref19] 19. Yu H, Wang X, Li B, Xue X. Chinese text recognition with a pre-trained CLIP-like model through image-IDS aligning. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2023. p. 11909–18.

[ref20] 20. Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):1–48.
View Article
Google Scholar

[41] View Article

[42] Google Scholar

[ref21] 21. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 [Preprint]. 2014. http://arxiv.org/abs/1409.1556

[ref22] 22. Cubuk ED, Zoph B, Mané D, Vasudevan V, Le QV. AutoAugment: learning augmentation strategies from data. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2019. p. 113–23.

[ref23] 23. Cubuk ED, Zoph B, Shlens J, Le QV. Randaugment: practical automated data augmentation with a reduced search space. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW); 2020. p. 3008–17.

[ref24] 24. Muller SG, Hutter F. TrivialAugment: tuning-free yet state-of-the-art data augmentation. 2021 IEEE/CVF International Conference on Computer Vision (ICCV); 2021. p. 754–62.

[ref25] 25. DeVries T, Taylor GW. Improved regularization of convolutional neural networks with Cutout. arXiv:1708.04552 [Preprint]. 2017. http://arxiv.org/abs/1708.04552

[ref26] 26. Lopes RG, Yin D, Poole B, Gilmer J, Cubuk ED. Improving robustness without sacrificing accuracy with Patch Gaussian augmentation. arXiv:1906.02611 [Preprint]. 2019. http://arxiv.org/abs/1906.02611

[ref27] 27. Zhong Z, Zheng L, Kang G, Li S, Yang Y. Random erasing data augmentation. AAAI. 2020;34(07):13001–8.
View Article
Google Scholar

[50] View Article

[51] Google Scholar

[ref28] 28. Zhang H, Cisse M, Dauphin YN, Lopez-Paz D. Mixup: beyond empirical risk minimization. arXiv:1710.09412 [Preprint]. 2017. http://arxiv.org/abs/1710.09412

[ref29] 29. Verma V, Lamb A, Beckham C, Najafi A, Mitliagkas I, Lopez-Paz D, et al. Manifold Mixup: better representations by interpolating hidden states. Proceedings of the International Conference on Machine Learning (ICML); 2019. p. 6438-47.

[ref30] 30. Navarro M, Little C, Allen GI, Segarra S. Data augmentation via subgroup Mixup for improving fairness. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2024. p. 7350–4.

[ref31] 31. Yun S, Han D, Oh SJ, Chun S, Choe J, Yoo Y. CutMix: Regularization strategy to train strong classifiers with localizable features. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2019. p. 6023–32.

[ref32] 32. Kim JH, Choo W, Song HO. Puzzle Mix: exploiting saliency and local statistics for optimal Mixup. Proceedings of the International Conference on Machine Learning (ICML); 2020. p. 5275–85.

[ref33] 33. Uddin AFM, Monira M, Shin W, Chung T, Bae SH. SaliencyMix: a saliency guided data augmentation strategy for better regularization. arXiv:2006.01791 [Preprint]. 2020. http://arxiv.org/abs/2006.01791

[ref34] 34. Liu Z, Li S, Wu D, Liu Z, Chen Z, Wu L, et al. AutoMix: unveiling the power of Mixup for stronger classifiers. Proceedings of the European Conference on Computer Vision (ECCV); 2022. p. 441–58.

[ref35] 35. Yu H, Chen J, Li B, Ma J, Guan M, Xu X, et al. Benchmarking Chinese text recognition: datasets, baselines, and an empirical study. arXiv:2112.15093 [Preprint]. 2021. http://arxiv.org/abs/2112.15093

[ref36] 36. Zhang D, Li B, Wang F, Zhao Z, Gao J, Li X. Cross-attention multi-scale state space model for remaining useful life prediction of aircraft engines. Adv Eng Inform. 2026;69:103817.
View Article
Google Scholar

[61] View Article

[62] Google Scholar

[ref37] 37. Zhang D, Wang F, Li B, Zhao Z, Gao J, Li X. KAID: knowledge-aware interactive distillation for vision-language models. Proceedings of the 33rd ACM International Conference on Multimedia; 2025. p. 3212–21.

[ref38] 38. Graves A. Connectionist temporal classification. In: Supervised sequence labelling with recurrent neural networks; 2012. p. 61–93.

[ref39] 39. Mishra A, Alahari K, Jawahar CV. Scene text recognition using higher order language priors. BMVC; 2012.

[ref40] 40. Lucas SM, Panaretos A, Sosa L, Tang A, Wong S, Young R. ICDAR 2003 robust reading competitions: entries, results and future directions. Int J Doc Anal Recognit. 2003;7:105–22.
View Article
Google Scholar

[67] View Article

[68] Google Scholar

[ref41] 41. Shi B, Yao C, Liao M, Yang M, Xu P, Cui L, et al. ICDAR2017 competition on reading Chinese text in the wild (RCTW-17). Proceedings of the International Conference on Document Analysis and Recognition (ICDAR), vol. 1; 2017. p. 1429–34.

[ref42] 42. Zhang R, Yang M, Bai X, Shi B, Karatzas D, Lu S, et al. ICDAR 2019 robust reading challenge on reading Chinese text on signboard. Proceedings of the International Conference on Document Analysis and Recognition (ICDAR); 2019. p. 1577–81.

[ref43] 43. Sun Y, Karatzas D, Chan CS, Jin L, Ni Z, Chng C-K, et al. ICDAR 2019 competition on large-scale street view text with partial labeling - RRC-LSVT. Proceedings of the International Conference on Document Analysis and Recognition (ICDAR); 2019. p. 1557–62.

[ref44] 44. Chng CK, Ding E, Liu J, Karatzas D, Chan CS, Jin L, et al. ICDAR2019 robust reading challenge on arbitrary-shaped text - RRC-ArT. Proceedings of the International Conference on Document Analysis and Recognition (ICDAR); 2019. p. 1571–6.

[ref45] 45. Yuan T-L, Zhu Z, Xu K, Li C-J, Mu T-J, Hu S-M. A large Chinese text dataset in the wild. J Comput Sci Technol. 2019;34(3):509–21.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref46] 46. Shi B, Yang M, Wang X, Lyu P, Yao C, Bai X. ASTER: an attentional scene text recognizer with flexible rectification. IEEE Trans Pattern Anal Mach Intell. 2019;41(9):2035–48. pmid:29994467
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref47] 47. Luo C, Jin L, Sun Z. MORAN: a multi-object rectified attention network for scene text recognition. Pattern Recognit. 2019;90:109–18.
View Article
Google Scholar

[81] View Article

[82] Google Scholar

[ref48] 48. Li H, Wang P, Shen C, Zhang G. Show, attend and read: a simple and strong baseline for irregular text recognition. AAAI. 2019;33(01):8610–7.
View Article
Google Scholar

[84] View Article

[85] Google Scholar

[ref49] 49. Qiao Z, Zhou Y, Yang D, Zhou Y, Wang W. SEED: semantics enhanced encoder-decoder framework for scene text recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2020. p. 13525–34.

[ref50] 50. Lu N, Yu W, Qi X, Chen Y, Gong P, Xiao R, et al. MASTER: multi-aspect non-local network for scene text recognition. Pattern Recognit. 2021;117:107980.
View Article
Google Scholar

[88] View Article

[89] Google Scholar

[ref51] 51. Fang S, Xie H, Wang Y, Mao Z, Zhang Y. Read like humans: autonomous, bidirectional and iterative language modeling for scene text recognition. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021. p. 7094–103.

[ref52] 52. Chen J, Li B, Xue X. Scene text telescope: text-focused scene image super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); 2021. p. 12021–30.

[ref53] 53. Zhang Z, Lu N, Liao M, Huang Y, Li C, Wang M, et al. Self-distillation regularized connectionist temporal classification loss for text recognition: a simple yet effective approach. AAAI. 2024;38(7):7441–9.
View Article
Google Scholar

[93] View Article

[94] Google Scholar

[ref54] 54. Yang M, Yang B, Liao M, Zhu Y, Bai X. Class-aware mask-guided feature refinement for scene text recognition. Pattern Recognit. 2024;149:110244.
View Article
Google Scholar

[96] View Article

[97] Google Scholar

[ref55] 55. Cheng C, Wang P, Da C, Zheng Q, Yao C. LISTER: neighbor decoding for length-insensitive scene text recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); 2023. p. 19484–94.

[ref56] 56. Zhao S, Du Y, Chen Z, Jiang Y-G. Decoder pre-training with only text for scene text recognition. Proceedings of the 32nd ACM International Conference on Multimedia; 2024. p. 5191–200.

[ref57] 57. Zheng T, Chen Z, Fang S, Xie H, Jiang Y-G. CDistNet: perceiving multi-domain character distance for robust text recognition. Int J Comput Vis. 2023;132(2):300–18.
View Article
Google Scholar

[101] View Article

[102] Google Scholar

[ref58] 58. Du Y, Chen Z, Jia C, Yin X, Li C, Du Y, et al. Context perception parallel decoder for scene text recognition. IEEE Trans Pattern Anal Mach Intell. 2025;47(6):4668–83. pmid:40031666
View Article
PubMed/NCBI
Google Scholar

[104] View Article

[105] PubMed/NCBI

[106] Google Scholar

[ref59] 59. Du Y, Chen Z, Su Y, Jia C, Jiang Y-G. Instruction-guided scene text recognition. IEEE Trans Pattern Anal Mach Intell. 2025;47(4):2723–38. pmid:40030880
View Article
PubMed/NCBI
Google Scholar

[108] View Article

[109] PubMed/NCBI

[110] Google Scholar

Figures

Abstract

1. Introduction

2. Related work

2.1 Scene text recognition

2.2 Patch-level spatial augmentation

3. Materials and methods

3.1 SVTRv2

3.2 Jumble Module

3.3 Self-distillation module (SDM)

3.4 Mixture of experts module

3.5 Loss function

4. Experiments

4.1 Experimental setup

4.1.1 Dataset.

4.1.2 Implementation details.

4.2 Ablation study

4.3 Comparison with state-of-the-arts

5. Discussion

5.1 Visual comparison of Chinese Text Recognition

5.2 Visualization-based analysis of similar Chinese characters

5.3 Statistical significance analysis

5.4 Model complexity analysis

5.5 Design analysis of SDM

6. Conclusion

Supporting information

S1 File. Original Code.

S1 Appendix.

References