Using transformer-based models for Vietnamese language detection

doi:10.1371/journal.pone.0342898

Fig 1.

Comprehensive workflow of the proposed solution, illustrating the iterative pipeline from dataset acquisition and preprocessing to model fine-tuning, validation, and the continuous feedback loop for deployment.

More »

Expand

Table 1.

Structure of raw Binhvq News Corpus.

More »

Expand

Table 2.

Structure of raw vi-error-correction-2.0 dataset.

More »

Expand

Table 3.

Structure of raw OPUS Tatoeba dataset.

More »

Expand

Fig 2.

Data acquisition flowchart detailing the automated crawling mechanism, including link validation, content extraction, HTML cleaning, and deduplication logic for constructing the raw corpus.

More »

Expand

Table 4.

Structure of raw vietnamese crawling dataset.

More »

Expand

Table 5.

Structure of raw english crawling dataset.

More »

Expand

Table 6.

Structure of Binhvq News Corpus after processed.

More »

Expand

Table 7.

Structure of vi-error-correction-2.0 after processed.

More »

Expand

Table 8.

Structure of OPUS Tatoeba dataset after processed.

More »

Expand

Table 9.

Structure of Vietnamese crawled dataset after processed.

More »

Expand

Table 10.

Structure of English crawled dataset after processed.

More »

Expand

Fig 3.

Class Imbalance in the Vietnamese Dataset.

More »

Expand

Table 11.

Comparison between metric tables of model XLM-RoBERTa, BARTpho, E5.

More »

Expand

Fig 4.

Comparative Model Performance.

(a) Evaluation Loss scaled relative to the maximum observed loss (E5 = 0.0906, set as 100%). Lower percentages indicate better convergence, with BARTPho achieving only 11.62% of the maximum loss. (b) Performance metrics (Accuracy, Precision, Recall, F1) scaled relative to the maximum score achieved across models.

More »

Expand

Table 12.

Example used for comparing with each model.

More »

Expand