Offensive language detection in low resource languages: A use case of Persian language

doi:10.1371/journal.pone.0304166

Fig 1.

Workflow of the offensive language detection methodology in Persian language.

More »

Expand

Table 1.

Shared tasks in identification of abusive language in different types and languages.

More »

Expand

Fig 2.

Paper structure diagram.

More »

Expand

Table 2.

Distribution of annotated data in three levels of annotation schema.

A set of 6,000 out of 520,000 sampled data is randomly selected for annotation process.

More »

Expand

Fig 3.

Tweet samples (original and translated) from the annotated data with their categories for each level of the annotation schema.

More »

Expand

Table 3.

Baselines ML models.

More »

Expand

Table 4.

Baselines DL models.

More »

Expand

Table 5.

Description of the transformer-based neural network models used in identification of offensive language in Persian.

More »

Expand

Fig 4.

Diagram of the stacking K-Fold cross validation.

More »

Expand

Fig 5.

Preprocessing steps of the dataset.

More »

Expand

Table 6.

Results for offensive language identification (first level).

The bold and underlined numbers represent the first and second best scores, respectively, in each category: classical ML, DL, and transformer-based neural networks.

More »

Expand

Table 7.

Results for targeted offensive language identification (second level).

The bold and underlined numbers represent the first and second best scores, respectively, in each category: classical ML, DL, and transformer-based neural networks.

More »

Expand

Table 8.

Results for target type of offensive language identification (third level).

The bold and underlined numbers represent the first and second best scores, respectively, in each category: classical ML, DL, and transformer-based neural networks.

More »

Expand

Fig 6.

Pairwise Pearson Correlation Coefficient between the predicted probabilities of different single classifiers on out-of-fold test set.

First level (a) shows the correlation between the output predictions of classifiers trained on offensive vs non-offensive annotated data. Second level (b) shows the correlation between the output predictions of classifiers trained on targeted vs untargeted samples. Third level (c) shows the correlation between the output predictions of classifiers trained on targeted offensive towards individual or group.

More »

Expand

Fig 7.

Offensive language identification performance among all models in three levels of annotation.

First level (a), Second level (b), and Third level (c) indicate performance of selected base-level classifiers accompanying stacking ensemble classifier in identification of offensive vs non-offensive, targeted vs untargeted offensive content, and the target of offensive language towards individual or group, respectively.

More »

Expand