Fig 1.
Regional NER examples along with Standard Bangla and English.
Fig 2.
Development of ANCHOLIK-NER: A systematic pipeline for dataset creation.
Table 1.
Distribution of sentences across different data sources for Bangla regional dialects in the ANCHOLIK-NER dataset.
Table 2.
Sentence structure conversion by separating punctuation.
Table 3.
Dataset structure for Sylhet region after pre-processing and tokenization phase (Followed for all 5 regions).
Table 4.
Comprehensive overview of annotators’ background and expertise.
Table 5.
BIO Tagging scheme with examples for named entity recognition in Bangla regional dialects.
Fig 3.
Inter-annotator agreement (Cohen’s Kappa) across different regions.
Fig 4.
Average tagging speed (time per 1000 tokens) by region in minutes.
Table 6.
Dataset consists of 3 columns for each region, with the first two generated by a Python script and the third (BIO-Tags) verified by Bangla Regional Language experts.
Table 7.
Overview of our proposed dataset.
Fig 5.
Chittagong.
Fig 6.
Sylhet.
Fig 7.
Barishal.
Fig 8.
Noakhali.
Fig 9.
Mymensingh.
Table 8.
Total instances of named entity types in five regions.
Fig 10.
Frequency of named entities Chittagong dialects.
Fig 11.
Frequency of named entities Barishal dialects.
Fig 12.
Frequency of named entities Mymensingh dialects.
Fig 13.
Frequency of named entities Sylhet dialects.
Fig 14.
Frequency of named entities Noakhali dialects.
Fig 15.
Methodology.
Table 9.
Performance of Bangla BERT.
Table 10.
Performance of Bangla Bert base.
Table 11.
Performance of BERT base multilingual cased.
Fig 16.
Confusion matrices for the best performing model across Barishal regional dialect.
Fig 17.
Confusion matrices for the best performing model across Mymensingh regional dialect.
Fig 18.
Confusion matrices for the best performing model across Chittagong regional dialect.
Fig 19.
Confusion matrices for the best performing model across Noakhali regional dialect.
Fig 20.
Confusion matrices for the best performing model across Sylhet regional dialect.
Table 12.
Entity-wise F1-scores across the five dialect regions for the weighted-loss on Bangla BERT model.