Fig 1.
Comprehensive workflow of the proposed solution, illustrating the iterative pipeline from dataset acquisition and preprocessing to model fine-tuning, validation, and the continuous feedback loop for deployment.
Table 1.
Structure of raw Binhvq News Corpus.
Table 2.
Structure of raw vi-error-correction-2.0 dataset.
Table 3.
Structure of raw OPUS Tatoeba dataset.
Fig 2.
Data acquisition flowchart detailing the automated crawling mechanism, including link validation, content extraction, HTML cleaning, and deduplication logic for constructing the raw corpus.
Table 4.
Structure of raw vietnamese crawling dataset.
Table 5.
Structure of raw english crawling dataset.
Table 6.
Structure of Binhvq News Corpus after processed.
Table 7.
Structure of vi-error-correction-2.0 after processed.
Table 8.
Structure of OPUS Tatoeba dataset after processed.
Table 9.
Structure of Vietnamese crawled dataset after processed.
Table 10.
Structure of English crawled dataset after processed.
Fig 3.
Class Imbalance in the Vietnamese Dataset.
Table 11.
Comparison between metric tables of model XLM-RoBERTa, BARTpho, E5.
Fig 4.
Comparative Model Performance.
(a) Evaluation Loss scaled relative to the maximum observed loss (E5 = 0.0906, set as 100%). Lower percentages indicate better convergence, with BARTPho achieving only 11.62% of the maximum loss. (b) Performance metrics (Accuracy, Precision, Recall, F1) scaled relative to the maximum score achieved across models.
Table 12.
Example used for comparing with each model.