ViSQA: A benchmark dataset and baseline models for Vietnamese spoken question answering

doi:10.1371/journal.pone.0340771

Fig 1.

Data construction pipeline for the ViSQA dataset.

More »

Expand

Table 1.

Transcription accuracy on clean audio using Google Speech-to-Text.

More »

Expand

Table 2.

Word Error Rate (WER) under noisy conditions.

More »

Expand

Table 3.

A qualitative sample from the ViSQA dataset.

The example shows the passage, question, gold answer, ASR transcripts from clean and noisy audio (with ASR errors highlighted in bold), and whether the gold span was successfully re-aligned.

More »

Expand

Table 4.

Comparison of dataset characteristics across UIT-ViQuAD, ViNewsQA, VlogQA, and ViSQA.

More »

Expand

Fig 2.

Overview diagram of the SQA framework.

More »

Expand

Table 5.

Comprehensive evaluation of state-of-the-art models demonstrating performance degradation across spoken data conditions.

All models were trained on the complete UIT-ViQuAD training dataset. The UIT-ViQuAD-dev and ViSQA-test represent the testing sets of UIT-ViQuAD and ViSQA, respectively.

More »