Fig 1.
Data construction pipeline for the ViSQA dataset.
Table 1.
Transcription accuracy on clean audio using Google Speech-to-Text.
Table 2.
Word Error Rate (WER) under noisy conditions.
Table 3.
A qualitative sample from the ViSQA dataset.
The example shows the passage, question, gold answer, ASR transcripts from clean and noisy audio (with ASR errors highlighted in bold), and whether the gold span was successfully re-aligned.
Table 4.
Comparison of dataset characteristics across UIT-ViQuAD, ViNewsQA, VlogQA, and ViSQA.
Fig 2.
Overview diagram of the SQA framework.
Table 5.
Comprehensive evaluation of state-of-the-art models demonstrating performance degradation across spoken data conditions.
All models were trained on the complete UIT-ViQuAD training dataset. The UIT-ViQuAD-dev and ViSQA-test represent the testing sets of UIT-ViQuAD and ViSQA, respectively.
Table 6.
Performance comparison of models on ViSQA-dev set.
“Text” indicates models trained on clean text documents (UIT-ViQuAD), while “Spoken” refers to models trained on ASR transcriptions (ViSQA).
Table 7.
Evaluation results of pre-trained language models on the ViSQA test set under clean and noisy conditions.
All models were trained on the same ViSQA training set.
Table 8.
WER of transcribed ViSQA context by Google STT and AssemblyAI.
Table 9.
Performance comparison of machine comprehension models trained on Google vs. Assembly transcripts, evaluated under matched ASR conditions.
Table 10.
Mean and standard deviation of model performance on the ViSQA-test set, averaged over 5 re-training runs with different random seeds.
Table 11.
Pairwise p-values from paired t-tests on EM and F1 scores respectively between models (scaled by 10−4, rounded to 1 decimal).
Table 12.
The count and accuracy rate of correct answers on the ViSQA test set, categorized by type, measured using the EM metric.