Accurate semi-supervised automatic speech recognition for ordinary and characterized speeches via multi-hypotheses-based curriculum learning

Ka Hyun Park; Junghun Kim; U Kang

doi:10.1371/journal.pone.0333915

Abstract

How can we build accurate transcription models for both ordinary speech and characterized speech in a semi-supervised setting? ASR (Automatic Speech Recognition) systems are widely used in various real-world applications, including translation systems and transcription services. ASR models are tailored to serve one of two types of speeches: 1) ordinary speech (e.g., speeches from the general population) and 2) characterized speech (e.g., speeches from speakers with special traits, such as certain nationalities or speech disorders). Recently, the limited availability of labeled speech data and the high cost of manual labeling have drawn significant attention to the development of semi-supervised ASR systems. Previous semi-supervised ASR models employ a pseudo-labeling scheme to incorporate unlabeled examples during training. However, these methods rely heavily on pseudo labels during training and are therefore highly sensitive to the quality of pseudo labels. The issue of low-quality pseudo labels is particularly pronounced for characterized speech, due to the limited availability of data specific to a certain trait. This scarcity hinders the initial ASR model’s ability to effectively capture the unique characteristics of characterized speech, resulting in inaccurate pseudo labels. In this paper, we propose a framework for training accurate ASR models for both ordinary and characterized speeches in a semi-supervised setting. Specifically, we propose MOCA (Multi-hypotheses-based Curriculum learning for semi-supervised Asr) for ordinary speech and MOCA-S for characterized speech. MOCA and MOCA-S generate multiple hypotheses for each speech instance to reduce the heavy reliance on potentially inaccurate pseudo labels. Moreover, MOCA-S for characterized speech effectively supplements the limited trait-specific speech data by exploiting speeches of the other traits. Specifically, MOCA-S adjusts the number of pseudo labels based on the relevance to the target trait. Extensive experiments on real-world speech datasets show that MOCA and MOCA-S significantly improve the accuracy of previous ASR models.

Citation: Hyun Park K, Kim J, Kang U (2025) Accurate semi-supervised automatic speech recognition for ordinary and characterized speeches via multi-hypotheses-based curriculum learning. PLoS One 20(10): e0333915. https://doi.org/10.1371/journal.pone.0333915

Editor: Anirban Bhowmick, VIT Bhopal University, INDIA

Received: March 12, 2025; Accepted: September 19, 2025; Published: October 21, 2025

Copyright: © 2025 Hyun Park et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All data are publicly available from the GitHub repository (https://github.com/snudatalab/MOCA).

Funding: This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) [No.RS-2021-II211343, Artificial Intelligence Graduate School Program (Seoul National University)], [No.RS-2021-II212068, Artificial Intelligence InnovationHub (Artificial Intelligence Institute, Seoul National University)], [No.2022-0-00641, XVoice: Multi-Modal Voice Meta Learning], [No.RS-2024-00509257, Global AI Frontier Lab], and [No.RS-2025-25442338, AI star Fellowship Support Program (Seoul National Univ.)]. The Institute of Engineering Research at Seoul National University and the ICT at Seoul National University provided research facilities for this work. The recipient of the funding awards listed above is U Kang. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

How can we train an accurate transcription model for speech of the general population and those of individuals with specific traits in a semi-supervised setting? Speech is a primary and essential means of human communication. Developing accurate automatic speech recognition (ASR) systems is essential to efficiently leverage the growing volume of speech data for various applications, including voice search [1,2], speech command recognition [3], automatic transcription of spoken content [4,5], information extraction [6,7], and machine translation [8,9].

Previous ASR models are categorized to address two types of speech: 1) ordinary speech from the general population, and 2) characterized speech from speakers with specific traits such as accents or speech disorders. ASR models for ordinary speech generate transcriptions without accounting for specific traits including gender or age [10]. On the other hand, ASR models for characterized speech are designed to transcribe speech from individuals with specific characteristics, such as regional accents or speech impairments [11–13].

The limited availability of labeled speech data and the high cost of manual annotation have recently driven increased attention toward advancing semi-supervised ASR approaches [14–16]. For instance, only a fraction of interactions in online banking are manually transcribed due to costly expense. Existing semi-supervised ASR methods generate pseudo labels from an initial model trained on a small set of labeled data, which are then used to further refine the model [17]. However, these pseudo-labeling schemes are constrained by their reliance on a single 1-best hypothesis as a fixed pseudo label, limiting the model’s ability to consider other potentially correct alternatives.

To illustrate this limitation, we present the distribution of ground-truth labels in Fig 1 using the LJSpeech dataset [18]. The number j on the x-axis denotes the -best hypothesis among the top-10 hypotheses, while the y-axis represents the counts of ground-truth label appearing in each -best hypothesis. To generate these hypotheses, we use a pre-trained wav2vec 2.0 [19] finetuned on the labeled instances of LJSpeech. It is noteworthy that about 35% of ground-truth labels are overlooked in 1-best-hypothesis-based approaches. To prevent such information loss, there arises the need for approaches incorporating alternative hypotheses alongside the 1-best one.

Download:

Fig 1. Analysis of the LJSpeech dataset showcasing how ground-truth labels are distributed across the top 10 hypotheses.

Note that ground-truth labels often appear in alternative hypotheses rather than being restricted to the 1-best prediction.

https://doi.org/10.1371/journal.pone.0333915.g001

For characterized speech, the problem of heavy reliance on inaccurate pseudo labels is exacerbated, as the scarcity of trait-specific data (e.g., Yorkshire English) often results in lower-quality pseudo labels. To supplement this scarcity, it is intuitive to incorporate unlabeled speech data from generic speakers (e.g., General American) since such data are readily available with the exponentially growing volume of speech resources today [20]. However, a key challenge lies in the significant phonetic and acoustic variation across speech traits.

To further explore this challenge, we train an ASR model on US-accented English data and compare the confidence levels of predictions across various accents in Fig 2. Specifically, we compare the likelihoods of non-US-accented speech instances and those of US-accented ones. Notice that the ASR model trained on US-accented speech achieves the highest likelihoods for US-accent speech as it is optimized to maximize the likelihoods of such data. It is also noteworthy that speech with accents similar to the US accent exhibit higher likelihoods compared to those with distinct accents; Canadian and Australian accented speech yield higher likelihoods due to similar vowel sounds with US English than Filipino-accented one. This highlights that incorporating unlabeled speech data from diverse speakers without addressing phonetic and acoustic variations fails to mitigate the issue of heavy reliance on inaccurate pseudo-labels, thereby limiting model robustness.

Download:

Fig 2. Distribution of likelihoods for the 1-best hypothesis generated by an ASR model trained on US-accented English, evaluated on speech with diverse accents.

Each vertical line indicates the mean likelihood for a specific accent. The model trained on US-accented data produces higher likelihoods for phonetically similar accents (Australian and Canadian), while dissimilar accent (Filipino) results in lower likelihoods.

https://doi.org/10.1371/journal.pone.0333915.g002

In this work, we propose a semi-supervised learning framework to train accurate ASR models for both ordinary and characterized speech. Specifically, we propose MOCA (Multi-hypotheses-based Curriculum learning for semi-supervised Asr) for ordinary speech and MOCA-S for characterized speech. MOCA and MOCA-S incorporate multi-hypotheses-based pseudo labels for each unlabeled instance to reduce the heavy reliance on inaccurate 1-best pseudo labels. This increases the likelihood of correctly identifying the ground-truth labels and improves the robustness of the model by leveraging the diversity within multiple hypotheses. In addition, MOCA-S incorporates both unlabeled speech instances without the target trait and those with the target trait to address the scarcity of trait-specific speech data. The key idea is to adjust the influence of each unlabeled instance according to its relevance to the target trait. To further reduce the heavy reliance on the quality of the generated pseudo labels, MOCA and MOCA-S exploit curriculum learning with our novel difficulty scores. This allows the ASR model to gather more information before encountering uncertain and challenging examples, leading to more stable training; the model becomes less sensitive to the quality of pseudo-labels for challenging instances. Our contributions are summarized as follows:

Method. MOCA and MOCA-S overcome the main limitation of existing methods: their heavy reliance on inaccurate 1-best pseudo labels. The key idea is to incorporate multi-hypotheses-based pseudo labels for each unlabeled instance. To minimize the model’s sensitivity to the quality of pseudo labels for difficult instances, which are often less accurate than those for easier instances, MOCA and MOCA-S exploit curriculum learning with our two novel difficulty scores.
Theory. We theoretically analyze the loss function of MOCA and MOCA-S by comparing them with a traditional ASR training loss.
Experiments. We conduct extensive experiments and demonstrate that MOCA and MOCA-S effectively enhance transcription performance for ordinary and characterized speech, respectively, outperforming previous methods.

The code and datasets are available at https://github.com/snudatalab/MOCA.

Related works

In this section, we formally define the problems of ASR for ordinary and characterized speech and introduce related works.

Problem definition

Semi-supervised ASR for ordinary speech.

We are given a set X_L of labeled speech instances and a set X_U of unlabeled ones with a pre-trained ASR model , parameterized by θ. Then the objective of semi-supervised ASR for ordinary speech is to train an accurate transcription model parameterized by that accurately transcribes previously unseen ordinary speech instances.

Semi-supervised ASR for characterized speech.

We are given sets and of labeled and unlabeled speech instances with the target trait, respectively. We are also given a set of unlabeled speech instances with traits different from the target trait and a pre-trained ASR model parameterized by θ. Our objective is to train an accurate ASR model with learnable parameters that accurately transcribes newly given instances with the target trait.

Pre-trained audio feature extractor for ASR

Recent advancements in speech recognition have been driven by the rise of unsupervised pre-training methods, which extract general features from speech [21–23]. These models learn from large amounts of unlabeled audio data, and their representations are directly applied in an end-to-end manner to enhance performance in downstream tasks such as speech emotion recognition [24], disease detection [25], and voice conversion [26]. The representations derived from audio properties are used to generate probable transcriptions, called hypotheses in ASR [27].

Among the commonly used pre-trained models, wav2vec [19] and wav2vec 2.0 [28] stand out for their strong performance [29,30]. These models utilize contrastive learning to distinguish similar audio pairs from dissimilar ones. The wav2vec and wav2vec 2.0 models take a speech signal as input, where each corresponds to a sampled value of the speech signal at time . Their encoder produces feature representations , which are then mapped into a sequence of states by a decoder. The decoder can be any model including an MLP. Each element of y_i represents the prediction vector at timestamp j for a speech input x_i, where C denotes the number of output categories, including alphabet characters, blank spaces, punctuation marks, and other symbols.

Semi-supervised ASR models

The challenge of insufficient labeled data has emerged as a critical issue across many machine learning domains [31,32], and speech is no exception. The limited availability of labeled speech data and the high costs of manual labeling have recently drawn significant attention to the development of semi-supervised ASR systems [33–36]. Previous semi-supervised ASR models employ the 1-best hypothesis as a definite pseudo label and are generally designed to accurately transcribe ordinary speech instead of characterized one. Higuchi et al. [16] employ online and offline models to enhance ASR representations through interaction. Park et al. [37] apply Noisy Student Training to ASR, introducing various levels of input augmentation. However, 1-best hypothesis approaches, which depend on a single model prediction, fail to reflect the range of possible outputs the model can generate.

Recently, there has been growing interest in developing speech models, including ASR, tailored for characterized speech [38]. These models focus on specific speech features and provide enhanced recognition accuracy in applications where general ASR systems often underperform [39,40]. Related efforts on handling characterized signals have also appeared in other domains, including wavelet-based approaches [41–43], and similar considerations have also been explored within the speech domain. Moreover, given that speech signals can be conceptually modeled as structured, high-dimensional data, insights from tensor factorization literature including spectral analysis [44,45], tensor factorization [46–48] hold promise for enhancing ASR and speech representation systems. Despite this interest, progress in semi-supervised ASR models for characterized speech remains limited. The main challenge lies in the scarcity of large volumes of both labeled and unlabeled data for specific traits [49]. To address this data scarcity, previous ASR models for characterized speech have focused on adapting to specific-featured data while leveraging general speech data [50–53]. However, these models require fully labeled data generated through a hand-crafted process, which further complicates the workflow. Note that the diversity of traits in target-featured instances makes manual transcription significantly more challenging than that of general speech, which in turn limits the model’s capabilities [54]. Kim et al. [55] propose multi-hypotheses-based pseudo labeling method for training semi-supervised ASR model, which is our preliminary work. MOCA in this paper is identical to the one proposed in our previous work [55]. In this work, we extend MOCA by additionally proposing MOCA-S, a generalized and trait-aware variant of MOCA designed for characterized speech settings. MOCA-S reduces to MOCA in the ordinary speech case, while providing extended applicability when speech traits such as accents, gender, or other characteristics are involved.

Connectionist Temporal Classification (CTC) loss

In ASR tasks, the connectionist temporal classification (CTC) loss is widely adopted to quantify the difference between the predicted sequence of states and the corresponding ground-truth text label l_i for a given speech instance x_i [56]. This approach allows the model to establish an alignment between input audio frames and output characters without relying on pre-segmented data.

The CTC algorithm optimizes the likelihood of correctly predicting the ground-truth transcription l_i given the speech input x_i, as formulated below:

(1)

where maps y_i to a transcription by removing blanks and duplicate labels. Consequently, denotes the collection of all possible prediction sequences that are collapsed into the label l_i.

The CTC loss, defined as the negative log-likelihood of , is given by:

(2)

where T is the number of data instances. Note that the standard CTC loss without any additional mechanism focuses solely on the 1-best pseudo label. This limits the model’s ability to apply soft labeling across alternative labels, and makes the alignment less flexible. In contrast, our modified CTC loss enables the model to incorporate multiple hypotheses during training, allowing it to leverage a broader range of plausible alignments. This approach provides a softer, more nuanced labeling process that improves the model’s adaptability to diverse data and enhances its robustness.

Proposed method

We propose a framework for training an accurate ASR model in a semi-supervised setting. Specifically, we propose MOCA (Multi-hypotheses-based Curriculum learning for semi-supervised Asr) for ordinary speech and MOCA-S for characterized speech. We show the overall process of MOCA and MOCA-S in Fig 3.

Download:

Fig 3. Overall framework of MOCA.

We begin by training an ASR model using a small set of labeled speech samples. After training, the model produces multiple hypotheses for every unlabeled instance. For an unlabeled example x_i, r_i,j represents the pseudo label sampled from the hypothesis set Z_i. MOCA enhances the ASR model by training with both labeled and unlabeled instances, incorporating multiple hypotheses to optimize the loss function . During training, a curriculum learning strategy incorporating a novel difficulty score is employed to mitigate over-reliance on pseudo-labels.

https://doi.org/10.1371/journal.pone.0333915.g003

Initially, both MOCA and MOCA-S train an ASR model with labeled speech instances. The trained model then generates multi-hypotheses-based pseudo labels for each unlabeled instance. The key difference is in how instance importance is handled with pseudo-labeling: MOCA assigns a fixed number of pseudo labels to all instances equally, while MOCA-S adjusts the number of pseudo labels per instance based on its relevance to the target trait. Through this mechanism, target-relevant instances naturally receive greater weight in training, whereas less relevant instances exert a smaller influence.

Then, MOCA and MOCA-S retrain the ASR model using both labeled and unlabeled instances with pseudo labels. This process follows a defined order accounting for the difficulty and uncertainty of each example. The detailed process of MOCA-S, reflecting its key difference with MOCA, is illustrated in Fig 4. The challenges are summarized as follows:

Download:

Fig 4. Overall framework of MOCA-S.

MOCA-S initially trains an ASR model using characterized speech instances with transcription labels. Notably, MOCA-S dynamically adjusts the number of pseudo labels assigned to each unlabeled instance based on its relevance to the target feature. The ASR model is subsequently retrained by integrating both labeled and unlabeled instances with multiple hypotheses while optimizing the loss function.

https://doi.org/10.1371/journal.pone.0333915.g004

C1. Exclusion of ground-truth labels in pseudo labels. For unlabeled instances, naive 1-best pseudo labels generated from a pre-trained ASR model inevitably carry uncertainties, especially when the model is trained on limited labeled instances. How can we address the uncertainties of pseudo labels?
C2. Defining pseudo labels for unlabeled instances. How can we train an ASR model with multiple label hypotheses for each unlabeled instance?
C3. Disparity among traits of speech. For training MOCA-S on characterized speech, unlabeled instances without treating distinctive traits of speech hinder the training of a trait-specific ASR model. How can we determine helpful unlabeled instances for improving the accuracy of a trait-specific ASR model?
C4. Robustness on difficult examples. For difficult examples, the ground-truth label may not be included in the multiple hypotheses, which leads to inaccurate pseudo labels. How can the model remain robust even for these low-accuracy instances?

We propose the following main ideas to tackle such challenges, which are discussed in detail in the following sections.

I1. Multiple hypotheses for each unlabeled instance. Instead of relying solely on the 1-best hypothesis, we also consider alternative hypotheses.
I2. Sampling-based loss function. MOCA and MOCA-S perform weighted sampling from the multi-hypotheses to generate pseudo labels for each unlabeled instance. This helps the model generate more diverse pseudo labels, allowing consideration of the uncertainties inherent in the hypotheses.
I3. Trait-based influence adjustment of unlabeled instances. MOCA-S reduces the influence of instances lacking the target trait while amplifying the impact of instances with the target trait during training. This is done by dynamically adjusting the number of pseudo labels for each unlabeled speech instance upon their relevance to the target trait.
I4. Ordering instances by difficulty. We perform curriculum learning with our novel difficulty scores. By initially training the model on easier, high-confidence instances, the model builds a strong foundation. Thus, the model becomes increasingly robust in handling challenging instances with uncertain pseudo labels as training progresses.

Multiple hypotheses for unlabeled instances (I1)

MOCA and MOCA-S leverage a pre-trained ASR model f, which is trained with a small amount of labeled data, to generate pseudo-labels for unlabeled instances. One straightforward approach is to use the predicted label from f as the pseudo-label for an unlabeled instance x_i. However, contains uncertainties, largely because the ASR model f is trained on a limited number of labeled examples. These uncertainties negatively affect the ASR performance.

Another approach is to use soft labels instead of the hard label as commonly done in other deep learning domains [57]. This approach prevents f from overfitting to potentially incorrect predictions, making it more robust than the 1-best-based method. However, applying soft labeling in ASR tasks is impractical due to the overwhelming number of possible target labels. For instance, constructing L-letter words with the English alphabet results in 26^L possible combinations.

To efficiently manage the uncertainties associated with the pseudo labels, MOCA and MOCA-S generate a set of N label hypotheses for each unlabeled speech signal x_i, where each z_i;j represents a distinct label hypothesis generated for x_i. This can be viewed as an adaptation of the soft labeling strategy tailored for ASR tasks with a constrained set of possible target labels. It provides an efficient approximation of the computationally intensive process associated with the naive soft labeling method.

We define the set Z_i of label hypotheses for each x_i as the top-N label candidates. These top-N candidates are obtained via beam search using , which is the pre-trained ASR model trained on a small amount of labeled data and parameterized by θ. The probability of selecting the j-th hypothesis z_i;j from the hypotheses set Z_i given a speech signal x_i is defined as the normalized form of beam search scores, which represent the log-likelihoods of the candidates. The sampling probability of z_i;j from Z_i is as given:

(3)

where s_i;j is the beam search score for z_i;j. Exponentiating the log-likelihood converts it into the likelihood . Therefore, in Eq (3) represents the normalized likelihood within the hypotheses set Z_i.

Training ASR model for ordinary speech with multiple hypotheses (I2)

Previous ASR models employ CTC loss for training, which is defined as the sum of negative log likelihoods over all speech instances. For a labeled speech instance x_i with its label l_i, the likelihood is computed as in Eq (1). However, defining likelihoods for unlabeled instances is challenging as ground-truth labels are unavailable. Instead, we have multiple label hypotheses for each instance.

We define the likelihood for each unlabeled instance as the probability of observing pseudo-labels sampled from the hypothesis set Z_i. Each element z_i;j is selected based on the probability in Eq (3), where θ represents the parameters of the pre-trained ASR model . This allows MOCA to generate pseudo-labels for unlabeled instances while reflecting their uncertainty levels. Sampling K pseudo-labels from the hypothesis set improves the model’s robustness. For example, drawing three samples from a set of 10 hypotheses provides more diverse pseudo labels compared to sampling from a set of only three hypotheses. As shown in Fig 1, a significant portion of ground-truth labels is included within the alternative hypotheses. Expanding the hypothesis set as candidates for pseudo labels enhances their selection and strengthens the model. The distribution of certainty levels among the hypotheses is also captured in the pseudo-labels, further contributing to model robustness.

Let S(Z_i,K) represent the list of K pseudo-labels sampled from the hypothesis set Z_i for each unlabeled instance x_i. The likelihood of observing the sampled hypotheses S(Z_i,K) is defined as:

(4)

We use this likelihood to represent each unlabeled instance x_i. Consequently, MOCA minimizes the negative log-likelihood for both labeled and unlabeled instances during the retraining of the ASR model with updated parameters .

(5)

where X_L and X_U are sets of labeled and unlabeled speech, respectively.

Training ASR model for characterized speech with multiple hypotheses (I3)

In the problem of semi-supervised ASR for characterized speech, we are given the three sets of speech instances: a set of labeled speech instances with the target trait, a set of unlabeled speech instances with the target trait, and a set of unlabeled speech instances without the target trait. We are also given a pre-trained ASR model parameterized by θ. MOCA-S first finetunes the pre-trained with to build an ASR model specialized for the target trait. A naive implementation of MOCA for unlabeled instances in and would be to construct a hypothesis set for each unlabeled instance and sample K hypotheses from the set to generate K pseudo labels. However, treating a non-target-featured instance in the same manner as a target-featured instance hinders the training of the ASR model specialized for the target trait, as contains traits distinct from the target trait.

To address this challenge, MOCA-S adjusts the weights of unlabeled instances in and based on each instance’s relevance to the target trait. The main idea is to dynamically decrease the number of pseudo labels for each according to its relation to the target trait. This prioritizes target-featured unlabeled instances in training by assigning more pseudo labels compared to .

To measure the relation of each non-target-featured instance to the target trait, we employ its confidence score , which is the beam search score of the 1-best hypothesis computed from the ASR model. Specifically, the number of pseudo labels for each is determined by the ratio of its likelihood to the mean of 1-best likelihoods of the target-featured labeled instances. Let and denote the hypotheses sets for and , respectively. Then the number M(i) of pseudo labels in the sampled hypotheses set for each is formally defined as follows:

(6)

where is the set of non-target-featured labeled instances used in the initial ASR model training and K is the number of pseudo labels for . MOCA-S employs as the likelihood for each following MOCA. MOCA-S minimizes the negative log likelihood for both labeled and unlabeled instances during the retraining of ASR model with a new parameter . The loss function of MOCA-S for retraining is expressed as follows:

(7)

where and are the pseudo labels sampled from the hypotheses sets for each unlabeled example and , respectively. contains target-featured labeled instances; and consist of unlabeled instances with and without the target trait, respectively. This dynamic sampling in MOCA-S adjusts the number of pseudo labels for non-target-featured unlabeled instances, enabling better control over the relative impact of target-featured and non-target-featured unlabeled instances in the loss calculation.

Curriculum learning (I4)

If all pseudo-labels within the sampled hypothesis set S for an unlabeled instance x_i are incorrect, relying on multiple pseudo-labels still inherits the uncertainty of the initial ASR model. To mitigate excessive dependence on pseudo-label quality, MOCA and MOCA-S introduce a curriculum learning strategy that incorporates a novel difficulty scoring mechanism for each speech instance. The core principle is to prioritize training on easier examples, which exhibit higher certainty, before gradually incorporating more challenging and uncertain ones. This progressive approach helps the model effectively learn complex decision boundaries.

We introduce two different difficulty scoring methods for curriculum learning. The first method considers a speech instance as more difficult if it is spoken at a faster rate. This is motivated by real-world situations, where speakers often slow down when communicating with young children, elderly listeners, or when repeating themselves to ensure better comprehension. The observation [58] that higher speech rates yield higher ASR error rates supports such motivation. The scoring method also considers longer transcriptions more difficult, as longer label sequences tend to yield higher word error rates [59]. The score is defined as:

(8)

where l_i represents a (pseudo) label for a speech instance x_i, and denotes a function returning the length of its input. This score is derived by multiplying the speech rate, , with the length of the uttered sentence, .

The second method, similar to the first, also considers longer sentences more challenging. However, it further differentiates instances of the same length based on prediction confidence, treating those with higher confidence as easier. This design naturally reflects real-world factors, since model confidence tends to decrease for speech that differs acoustically or contextually from the target domain. For instance, utterances recorded in noisy environments or with strong emotional expression are often recognized with lower confidence, making them effectively harder examples in the curriculum. Using the model’s posterior confidence adds a model-centric view of difficulty, which has been shown to improve curriculum learning in end-to-end ASR systems [60]. The corresponding difficulty score is given by:

(9)

For all instances including unlabeled ones, difficulty scores are computed, and the ASR model is trained starting with lower-scoring examples. The appropriate difficulty score for each dataset is selected based on experimental results.

Theoretical analysis

MOCA and MOCA-S use the sampling-based loss and in Eqs (5) and (7), respectively, to optimize the parameters of the ASR model . We theoretically analyze the relationship between the loss functions ( and ) and the CTC loss.

Theorem 1. (Relationship between of MOCA and CTC Loss) Let denote a latent variable representing the pseudo label of an unlabeled instance x_i where Z_i is the set of label hypotheses for x_i, be the model parameter, and S(Z_i,K) be the list of K-sampled pseudo labels from Z_i. Then the loss of MOCA for each unlabeled instance is the expectation of CTC loss in terms of z_i with a balancing factor |S(Z_i,K)|:

(10)

Proof: MOCA generates the set S(Z_i,K) of pseudo labels for each unlabeled instance x_i by sampling K examples from Z_i. The loss term of each unlabeled instance in Eq (10) is rewritten as follows:

(11)

where is the number of z_i in S(Z_i,K). Since follows the distribution in Eq (3), is the empirical probability of . Then in Eq (7) is expressed as follows:

(12)

which ends the proof. □

Theorem 2. (Relationship between of MOCA-S and CTC Loss) Let represent a latent variable corresponding to the pseudo label of a non-target-featured unlabeled instance , where denotes the set of label hypotheses for , be the model parameter, and be the list of M(i)-sampled pseudo labels from . The loss term of MOCA-S for each non-target-featured unlabeled instance is the expectation of the CTC loss over , scaled by the balancing factor :

(13)

Proof: MOCA-S constructs the set of M(i)-sampled pseudo labels for each non-target-featured unlabeled instance . The loss term for non-target-featured instances in Eq (13) is expressed as follows:

(14)

where is the count of in . The distribution for follows Eq (3) and is the empirical probability of . Reformulating in Eq (13) yields the following expression:

(15)

hence the proof. □

Experiments

To explore the following key research questions, we carry out experiments across four real-world datasets, each with distinct settings.

Q1. Transcription Performance for Ordinary Speech. How accurately does MOCA transcribe speech from general population into texts compared to the baseline models in a semi-supervised setting?
Q2. Transcription Performance for Characterized Speech. How accurately does MOCA-S transcribe speech instances from speakers with a specific trait compared to the baselines in a semi-supervised setting?
Q3. Training Trend under Multi-hypotheses. How does the number of hypotheses per unlabeled instance affect the training trajectory?
Q4. Ablation Study. Does each module of MOCA and MOCA-S improve transcription performance?

Experimental settings

We present the experimental settings, including datasets, baselines, and evaluation metrics. All experiments are conducted on a single GPU machine with a GTX 3080.

Dataset.

We use four real-world speech datasets to evaluate MOCA and MOCA-S. The data statistics are summarized in Table 1. We evaluate MOCA on the LJSpeech [18] and LibriSpeech-dev-clean [61] datasets, which contain recordings from a general population. LJSpeech dataset comprises 13,100 audio clips, totaling approximately 24 hours of clear English speech from a single female speaker. LJSpeech includes passages from seven non-fiction books, with each clip accompanied by a transcription as its label. LibriSpeech-dev-clean consists of 2,360 speech signals sourced mainly from the LibriVox project audiobooks, amounting to approximately 5.4 hours. In our experiments, we used dev-clean and test-clean for independent training to emulate a data-scarce environment, aiming to demonstrate the effectiveness of MOCA in semi-supervised learning scenarios. For MOCA-S, we evaluate the performance using CommonVoice [62] and SLR83 [63], which contain recordings from speakers with various characteristics. CommonVoice [62] includes 18,374 audio clips with specific traits of speakers such as age, gender, and accent. SLR83 [63] contains 10,627 audio speech instances of speakers from six distinct regions of England and Ireland.

Download:

Table 1. Summary of datasets.

https://doi.org/10.1371/journal.pone.0333915.t001

Baselines.

We compare MOCA and MOCA-S with previous ASR methods. Supervised [19] is a basic ASR model that loads pre-trained wav2vec 2.0 and fine-tunes it on labeled speech data. 1-best utilizes the trained Supervised model to generate 1-best pseudo labels for unlabeled speech instances. Self-train [64] is a self-training-based semi-supervised ASR method that dynamically generates pseudo labels during training. Both 1-best and Self-train build on the Supervised backbone and leverage both labeled and pseudo-labeled data. The key difference is that 1-best relies on fixed pseudo labels generated once at the beginning, whereas Self-train continually regenerates them with the updated model during training.

Evaluation metrics.

We use Word Error Rate (WER) and Character Error Rate (CER) as evaluation metrics to assess the performance of MOCA and MOCA-S. These metrics are commonly used in speech recognition to quantify transcription accuracy. The WER metric is defined as follows:

(16)

where , , and are the numbers of word-level substitutions, deletions, and insertions in t-th data instance, respectively. is the total number of words in the true transcription of the t-th data instance, and T is the total number of data instances. Similarly, the CER metric is calculated by:

(17)

where , , and denote the number of character-level substitutions, deletions, and insertions for the t-th instance, respectively, and is the total number of characters in the ground-truth transcription for the t-th instance.

Settings for MOCA.

The LJSpeech and LibriSpeech datasets are divided into training-labeled, training-unlabeled, and test sets with ratios of and , respectively, taking into account the limited number of speech samples in LibriSpeech. Specifically, we apply the split independently to the dev-clean and test-clean subsets of LibriSpeech, using each subset in its entirety to construct separate training and evaluation sets. The training-labeled set consists of speech instances paired with transcription labels, while the training-unlabeled set comprises instances lacking transcriptions. The hypothesis pool is set to include 10 candidates, and the number K of sampled pseudo label is selected from . Wav2vec 2.0 is adopted as the base ASR model, with its initial parameters configured as described in [21]. We train the initial ASR model on labeled instances for 100 epochs, followed by 60 epochs of fine-tuning on all instances, including labeled and pseudo-labeled ones.

Settings for MOCA-S.

For CommonVoice and SLR83, we set Northern Irish and Southern English accent as the target trait, respectively. We split instances of the target trait into training-labeled, training-unlabeled, and test sets with ratio of for CommonVoice and for SLR83. Note that all speech without the target trait are included in the training-unlabeled set; only a subset of speech with the target trait are labeled. We set the hypotheses pool to include 10 candidates and vary the maximum number K of sampled pseudo labels in Eq (6) among . We employ wav2vec 2.0 as our base ASR model, initializing parameters as specified in [21]. We train the initial model for 150 epochs, followed by 20 epochs of fine-tuning.

Transcription performance of MOCA for ordinary speech (Q1)

We evaluate the transcription performance of MOCA against the baselines, as shown in Table 2. MOCA significantly reduces both WER and CER compared to the 1-best model. Notably, using the number K = 5 of sampled pseudo labels yields better transcription accuracy than K = 3, highlighting the benefit of incorporating more pseudo labels from the hypothesis pool. However, an overly large K introduces excessive variability, negatively impacting the performance.

Download:

Table 2. ASR performance of MOCA and baselines measured by WER and CER.

Bold and underlined numbers represent the best and second-best results, respectively. MOCA outperforms the competitors across various settings.

https://doi.org/10.1371/journal.pone.0333915.t002

Transcription performance of MOCA-S for characterized speech (Q2)

We evaluate transcription performance of MOCA-S for the target-featured speech and the baselines in Table 3. WER and CER of MOCA-S improve compared to those of 1-best and self-train models. Sampling more pseudo labels for target-featured speech instances represents higher transcription accuracy. As MOCA-S lacks in quantity of target-featured speech instances, effectively employing more instances considering the relation to the target trait improves the transcription performance, relieving such data scarcity problem.

Download:

Table 3. ASR performance of MOCA-S and baselines measured by WER and CER.

Bold and underlined numbers represent the best and second-best results, respectively. MOCA-S outperforms the competitors across various settings.

https://doi.org/10.1371/journal.pone.0333915.t003

Training trend under multi-hypotheses (Q3)

We analyze the training behavior of MOCA with varying numbers K of hypotheses per unlabeled utterance in Fig 5. We evaluate two curricula, MOCA-K-conf and MOCA-K-speed, which use confidence-based and speed-based difficulty scores, respectively, with K sampled pseudo labels. The number of updates per epoch increases with K, since each unlabeled utterance is associated with K pseudo labels. Each epoch denotes one sweep over labeled data and pseudo labels per unlabeled instance.

Download:

Fig 5. Transcription performance of MOCA over epochs on the LJSpeech dataset.

MOCA shows stable convergence behavior across varying K. This confirms the robustness of MOCA in maintaining stable learning dynamics under varying degrees of pseudo-label uncertainty.

https://doi.org/10.1371/journal.pone.0333915.g005

Although a larger K increases pseudo-label diversity and may lead to performance fluctuations or divergent learning behaviors due to noisy labels, all models are observed to converge in a consistent and stable manner. For instance, MOCA-10-conf exhibits a more pronounced increase in error during the early stages, presumably due to the higher likelihood of incorporating incorrect pseudo labels. Nevertheless, it eventually converges stably without further degradation. In contrast, models with smaller K show a more stable start, followed by a similar convergence trend overall. This stability is particularly desirable in semi-supervised training. These results demonstrate that MOCA maintains robust learning dynamics even under high pseudo-label uncertainty, effectively mitigating the impact of noise.

Ablation study (Q4)

We analyze the contribution of each module in MOCA and MOCA-S through the following ablation variants, summarized in Tables 4 and 5:

Download:

Table 4. Ablation study for MOCA.

The best performance is highlighted in bold. Each module plays a crucial role in enhancing the overall transcription performance.

https://doi.org/10.1371/journal.pone.0333915.t004

Download:

Table 5. Ablation study for MOCA-S.

The best performance is highlighted in bold. Each module plays a crucial role in enhancing the overall transcription performance.

https://doi.org/10.1371/journal.pone.0333915.t005

MOCA/MOCA-S-1-best: uses only the top-1 pseudo label to examine the effect of leveraging multiple hypotheses versus a single one.
MOCA/MOCA-S-uniform-sampling: samples pseudo labels with equal weights rather than by model likelihoods to test the effect of likelihood-based weighting.
MOCA/MOCA-S-w/o-curriculum: removes curriculum learning to test the role of progressive difficulty scheduling.
MOCA/MOCA-S-inverse-curriculum: trains from the hardest to the easiest instances, used to evaluate the effect of reversing the proposed easy-to-hard order.
MOCA-S-target-curriculum: prioritizes speech with the target feature before non-target-featured instances to examine whether explicit trait-based ordering provides additional benefit beyond multiple hypothesis generation.
MOCA-S-fixed-K: uniformly samples a fixed number K of pseudo labels for non-target-featured unlabeled instances to assess the importance of dynamically adjusting M(i) according to trait relevance.

For MOCA, we set the number K of sampled hypotheses to 10 and use speed-based difficulty scores for curriculum learning. For MOCA-S, the same difficulty scores are used with K = 5 for CommonVoice and K = 10 for SLR83.

Tables 4 and 5 show that each module of MOCA and MOCA-S contributes meaningfully to transcription accuracy. Both MOCA and MOCA-S outperform the 1-best and uniform-sampling variants, confirming the benefit of weighting pseudo labels by model likelihoods. They also surpass the w/o-curriculum and inverse-curriculum variants, with inverse-curriculum yielding the lowest performance, underscoring the value of the proposed curriculum design.

In Table 5, MOCA-S achieves lower error rates than MOCA-S-target-curriculum, which performs the worst on CommonVoice. Because multiple hypothesis generation already addresses the disparity between target-featured and non-target-featured speeches, enforcing a fixed trait-based order harms performance. MOCA-S also outperforms MOCA-S-fixed-K, indicating that sampling a fixed number of pseudo labels without considering trait relevance reduces effectiveness. Overall, these ablation results demonstrate the necessity of both dynamic sampling and curriculum learning for robust semi-supervised training.

Supplementary experiments

Additional results on gender-based speech

To further validate the applicability of MOCA-S beyond accent-based domains, we conduct an additional experiment using the SLR83 dataset. While our main experiments treat accents as the defining characterization, we reinterpret SLR83 by splitting speech based on gender (male vs. female), thereby creating a different type of domain variation. Table 6 reports the results when targeting male speech. The baseline methods Supervised, 1-best, and Self-train yield WERs of 14.25, 13.90, and 14.08, respectively. In contrast, MOCA-S achieves a WER of 13.75, obtaining lower error rates than the baselines. A similar trend is observed for CER: Supervised, 1-best, and Self-train achieve CERs of 4.41, 4.34, and 4.38, respectively, while MOCA-S attains 4.33. These results demonstrate that MOCA-S is not restricted to accent-based characterization. MOCA-S effectively captures domain variations beyond accents, such as gender differences, thereby broadening its applicability to a wider range of characterized speech scenarios.

Download:

Table 6. Results on SLR83 when targeting male speech.

The best performance is highlighted in bold. MOCA-S consistently outperforms baselines.

https://doi.org/10.1371/journal.pone.0333915.t006

Robustness of MOCA and MOCA-S

We evaluate MOCA on LJSpeech and MOCA-S on SLR83 to examine its robustness. All experiments are conducted five times with different random seeds, and the reported results correspond to the mean and standard deviation. We use K = 5 with confidence-based difficulty score for MOCA and K = 10 with speed-based difficulty score for MOCA-S following the best performed model configuration in Tables 2 and 3, respectively. As shown in Tables 7 and 8, both MOCA and MOCA-S consistently outperform the baseline methods in terms of WER and CER. Furthermore, the relatively small deviations across runs in MOCA indicate that the improvements are stable and not sensitive to random initialization, demonstrating its robustness.

Download:

Table 7. Results on LJSpeech.

The best performance is highlighted in bold. MOCA consistently outperforms baselines.

https://doi.org/10.1371/journal.pone.0333915.t007

Download:

Table 8. Results on SLR83.

The best performance is highlighted in bold. MOCA-S consistently outperforms baselines.

https://doi.org/10.1371/journal.pone.0333915.t008

Discussion

This work demonstrates that multi-hypothesis pseudo labeling combined with curriculum learning significantly enhances semi-supervised ASR for both ordinary and characterized speeches, directly tackling the challenge of limited labeled data. By sampling multiple hypotheses and guiding training through difficulty scores, MOCA and MOCA-S capture richer information than conventional 1-best methods [33,34,36]. This avoids over-reliance on a single prediction and better reflects the range of potential ground-truth labels. Interestingly, although we expected explicit target-prioritized ordering to improve the performance of MOCA-S, it did not. This outcome indicates that additional heuristics may even hinder training, and that our multi-hypothesis design already manages trait disparities effectively. While we did not extend our experiments to low-resource or multilingual ASR, this boundary of scope highlights that such scenarios require original ideas to properly address cross-lingual variation. We regard this as an exciting avenue for future research. In this sense, our framework provides a strong foundation that not only advances semi-supervised ASR but also inspires broader extensions to increasingly diverse speech scenarios.

Future works

We outline how our multi-hypotheses-based curriculum-learning framework generalizes to diverse ASR settings.

Low-resource ASR.

Low-resource ASR corresponds to scenarios where only a small amount of transcribed speech is available, ranging typically just a few hours. In such cases, models must rely heavily on self-supervised representations and pseudo-labeling strategies. MOCA is well-suited to this context, as MOCA is designed to operate effectively with minimal supervision. The training process of MOCA is directly applicable by generating multiple hypotheses using a pretrained model (e.g., wav2vec 2.0 [28]) with beam decoding, followed by instance-level sampling and curriculum-based scheduling. Modification to the underlying architecture or loss function is not required, making MOCA a plug-in strategy for improving learning efficiency in data-scarce environments.

Multilingual ASR.

Multilingual ASR aims to build a unified model capable of transcribing speech from multiple languages. This setting introduces challenges such as language identification, cross-lingual generalization, and imbalanced data distributions across languages. While our current work does not target this setting directly, the core ideas behind MOCA-S can be extended to multilingual contexts. For example, each language can be treated as a distinct domain, with the number and training order of pseudo labels adjusted using cross-lingual similarity metrics (e.g., phonetic or embedding-based distance). However, directly realizing such an extension is difficult since substantial differences in word order, phonetic inventory, and vocabulary pools require not only technical adjustments but also novel ideas for modeling language-level uncertainty and adaptation. We therefore leave this as an important direction for future work.

To summarize, our framework is readily applicable to low-resource ASR settings. While additional mechanisms to account for language-specific variation and domain-level uncertainty are required, extensions to multilingual ASR are conceivable.

Conclusions

In this study, we propose MOCA and MOCA-S, robust semi-supervised ASR methods for ordinary and characterized speech, respectively. MOCA and MOCA-S address the critical limitations of existing pseudo-labeling based approaches, particularly in handling pseudo-label uncertainty. The main idea is to incorporate multi-hypotheses-based pseudo labels for the unlabeled instances. MOCA-S dynamically adjusts the number of pseudo labels for non-target-featured speech instances based on the target trait. This ensures that non-target-featured data meaningfully enriches training, effectively addressing data scarcity challenges of target-featured instances. Additionally, our curriculum learning strategy with a tailored difficulty score prioritizes easier examples in initial training phases, allowing the model to progressively tackle more complex cases. This structured learning process minimizes reliance on pseudo-label quality and improves the model’s robustness. Experimental results on real-world datasets demonstrate improved transcription performance and faster convergence, underscoring the efficiency of our multi-hypotheses and adaptive sampling techniques in building robust ASR models for both ordinary and characterized speeches.

References

1. Shan C, Zhang J, Wang Y, Xie L. Attention-based end-to-end speech recognition on voice search. In: ICASSP. 2018.
2. Joshi R, Kannan V. Attention based end to end speech recognition for voice search in Hindi and English. In: FIRE. ACM; 2021. p. 107–13.
3. Cantiabela Z, Pardede HF, Zilvan V, Sulandari W, Yuwana RS, Supianto AA. Deep learning for robust speech command recognition using convolutional neural networks (CNN). In: IC3INA; 2022.
4. Long Y, Li Y, Wei S, Zhang Q, Yang C. Large-scale semi-supervised training in deep learning acoustic model for ASR. IEEE Access. 2019;7:133615–27.
- View Article
- Google Scholar
5. Adedeji A, Joshi S, Doohan B. The sound of healthcare: Improving medical transcription ASR accuracy with large language models. CoRR. 2024.
- View Article
- Google Scholar
6. Zhao X, Liu F, Song C, Wu Z, Kang S, Tuo D. Disentangling content and fine-grained prosody information via hybrid ASR bottleneck features for voice conversion. In: ICASSP; 2022.
7. Furui S. Automatic speech recognition and its application to information extraction. In: ACL; 1999. p. 11–20.
8. D’Haro LF, Banchs RE. Automatic Correction of ASR Outputs by Using Machine Translation. In: INTERSPEECH; 2016.
9. Alabau V, Rodríguez-Ruiz L, Sanchís A, Martínez-Gómez P, Casacuberta F. On multimodal interactive machine translation using speech recognition. In: ICMI; 2011. p. 129–36.
10. Goyal A, Garera N. Building accurate low latency ASR for streaming voice search in E-commerce. In: ACL (industry). 2023. p. 276–83.
11. Shor J, Emanuel D, Lang O, Tuval O, Brenner MP, Cattiau J, et al. Personalizing ASR for dysarthric and accented speech with limited data. In: INTERSPEECH. ISCA; 2019. p. 784–8.
12. Zheng X, Phukon B, Hasegawa-Johnson M. Fine-tuning automatic speech recognition for people with Parkinson’s: An effective strategy for enhancing speech technology accessibility. CoRR. 2024.
- View Article
- Google Scholar
13. Takashima R, Sawa Y, Aihara R, Takiguchi T, Imai Y. Dysarthric speech recognition using pseudo-labeling, self-supervised feature learning, and a joint multi-task learning approach. IEEE Access. 2024;12:36990–9.
- View Article
- Google Scholar
14. Weninger F, Mana F, Gemello R, Andrés-Ferrer J, Zhan P. Semi-supervised learning with data augmentation for end-to-end ASR. In: 2020.
15. Zhang Y, Park DS, Han W, Qin J, Gulati A, Shor J. BigSSL: exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. CoRR. 2021.
- View Article
- Google Scholar
16. Higuchi Y, Karube K, Ogawa T, Kobayashi T. Hierarchical conditional end-to-end ASR with CTC and multi-granular subword units. In: ICASSP; 2022.
17. Xu Q, Likhomanenko T, Kahn J, Hannun AY, Synnaeve G, Collobert R. Iterative pseudo-labeling for speech recognition. In: INTERSPEECH; 2020.
18. Ito K, Johnson L. The LJ Speech Dataset. 2017. https://keithito.com/LJ-Speech-Dataset/
19. Schneider S, Baevski A, Collobert R, Auli M. Wav2vec: unsupervised pre-training for speech recognition. In: INTERSPEECH; 2019.
20. Vachhani B, Bhat C, Kopparapu SK. Data augmentation using healthy speech for dysarthric speech recognition. In: 19th Annual Conference of the International Speech Communication Association, Interspeech 2018, Hyderabad, India, 2018. p. 471–5.
21. Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust speech recognition via large-scale weak supervision. In: ICML; 2023.
22. Korkut C, Haznedaroglu A, Arslan L. Comparison of deep learning methods for spoken language identification. In: SPECOM; 2020.
23. Al-Zakarya MA, Al-Irhaim YF. Unsupervised and semi-supervised speech recognition system: a review. RJCSM. 2023;17(1):34–42.
- View Article
- Google Scholar
24. Liu M, Ke Y, Zhang Y, Shao W, Song L. Speech emotion recognition based on deep learning. In: TENCON; 2022.
25. Javanmardi F, Tirronen S, Kodali M, Kadiri SR, Alku P. Wav2vec-based detection and severity level classification of dysarthria from speech. In: ICASSP; 2023.
26. Nguyen TN, Pham NQ, Waibel A. Accent conversion using pre-trained model and synthesized data from voice conversion. In: Interspeech; 2022.
27. Kreyssig FL, Shi Y, Guo J, Sari L, Mohamed A, Woodland PC. Biased self-supervised learning for ASR. CoRR. 2022.
- View Article
- Google Scholar
28. Baevski A, Zhou Y, Mohamed A, Auli M. wav2vec 2.0: a framework for self-supervised learning of speech representations. In: NeurIPS; 2020.
29. Baevski A, Mohamed A. In: ICASSP; 2020.
30. Rouhe A, Virkkunen A, Leinonen J, Kurimo M. Low resource comparison of attention-based and hybrid ASR exploiting wav2vec 2.0. In: Interspeech; 2022.
31. Kim J, Yoon H, Park KH, Kang U. Accurate graph-based multi-positive unlabeled learning via disentangled multi-view feature propagation. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2. 2025. p. 1149–59. https://doi.org/10.1145/3711896.3736827
32. Kim J, Park KH, Yoon H, Kang U. Accurate link prediction for edge-incomplete graphs via PU learning. AAAI. 2025;39(17):17877–85.
- View Article
- Google Scholar
33. Drugman T, Pylkkonen J, Kneser R. Active and semi-supervised learning in ASR: benefits on the acoustic and language models. arXiv preprint 2019. https://arxiv.org/abs/1903.02852
34. Wallington E, Kershenbaum B, Klejch O, Bell P. On the Learning Dynamics of Semi-Supervised Training for ASR. In: Interspeech 2021 . 2021. p. 716–20. https://doi.org/10.21437/interspeech.2021-1777
35. Metze F, Gandhe A, Miao Y, Sheikh Z, Wang Y, Xu D, et al. Semi-supervised training in low-resource ASR and KWS. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2015. p. 4699–703. https://doi.org/10.1109/icassp.2015.7178862
36. Peyser C, Picheny M, Cho K, Prabhavalkar R, Huang WR, Sainath TN. A comparison of semi-supervised learning techniques for streaming ASR at scale. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. p. 1–5. https://doi.org/10.1109/icassp49357.2023.10095838
37. Park DS, Zhang Y, Jia Y, Han W, Chiu C, Li B. Improved noisy student training for automatic speech recognition. In: INTERSPEECH; 2020.
38. Kim J, Park KH, Yoon H, Kang U. Domain-aware data selection for speech classification via meta-reweighting. In: Interspeech 2024 . 2024. p. 797–801. https://doi.org/10.21437/interspeech.2024-2368
39. Qian Y, Gong X, Huang H. Layer-wise fast adaptation for end-to-end multi-accent speech recognition. IEEE/ACM Trans Audio Speech Lang Process. 2022;30:2842–53.
- View Article
- Google Scholar
40. Leung W, Cross M, Ragni A, Goetze S. Training data augmentation for dysarthric automatic speech recognition by text-to-dysarthric-speech synthesis. CoRR. 2024.
- View Article
- Google Scholar
41. Khan SI, Pachori RB. Automated bundle branch block detection using multivariate fourier–bessel series expansion-based empirical wavelet transform. IEEE Trans Artif Intell. 2024;5(11):5643–54.
- View Article
- Google Scholar
42. Khan SI, Pachori RB. Automated posterior myocardial infarction detection from vectorcardiogram and derived vectorcardiogram signals using MVFBSE-EWT method. Digital Signal Processing. 2025;163:105244.
- View Article
- Google Scholar
43. Khan SI, Pachori RB. Empirical wavelet transform-based framework for diagnosis of epilepsy using EEG signals. AI-enabled smart healthcare using biomedical signals. IGI Global Scientific Publishing; 2022. p. 217–39.
44. Park Y, Jang J-G, Kang U. Fast and accurate partial fourier transform for time series data. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021. p. 1309–18. https://doi.org/10.1145/3447548.3467293
45. Park Y, Kim J, Kang U. Fast multidimensional partial fourier transform with automatic hyperparameter selection. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024. p. 2328–39. https://doi.org/10.1145/3637528.3671667
46. Park Y, Kim K, Kang U. PuzzleTensor: a method-agnostic data transformation for compact tensor factorization. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2. 2025. p. 2234–44. https://doi.org/10.1145/3711896.3737095
47. Lee S, Park Y-C, Kang U. Accurate coupled tensor factorization with knowledge graph. In: 2024 IEEE International Conference on Big Data (BigData). 2024. p. 1009–18. https://doi.org/10.1109/bigdata62323.2024.10825614
48. Kim J, Park KH, Jang J-G, Kang U. Fast and accurate domain adaptation for irregular tensor decomposition. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024. 1383–94. https://doi.org/10.1145/3637528.3671670
49. Liu S, Geng M, Hu S, Xie X, Cui M, Yu J. Recent progress in the CUHK dysarthric speech recognition system. CoRR. 2022.
- View Article
- Google Scholar
50. Kleinert M, Helmke H, Siol G, Ehr H, Cerna A, Kern C, et al. Semi-supervised adaptation of assistant based speech recognition models for different approach areas. In: 2018 IEEE/AIAA 37th Digital Avionics Systems Conference (DASC). 2018. p. 1–10. https://doi.org/10.1109/dasc.2018.8569879
51. Nallasamy U, Metze F, Schultz T. In: 2012. 13–7.
52. Tomanek K, Zayats V, Padfield D, Vaillancourt K, Biadsy F. Residual adapters for parameter-efficient ASR adaptation to atypical and accented speech. arXiv preprint 2021. https://arxiv.org/abs/2109.06952
53. Vachhani B, Bhat C, Das B, Kopparapu SK. Deep autoencoder based speech features for improved dysarthric speech recognition. In: INTERSPEECH. ISCA; 2017. p. 1854–8.
54. Vakirtzian S, Tsoukala C, Bompolas S, Mouzou K, Stamou V, Paraskevopoulos G, et al. Speech recognition for greek dialects: a challenging benchmark. In: Interspeech 2024, 2024. p. 3974–8. https://doi.org/10.21437/interspeech.2024-2443
55. Kim J, Park KH, Kang U. Accurate semi-supervised automatic speech recognition via multi-hypotheses-based curriculum learning. In: PAKDD; 2024.
56. Vyas A, Madikeri SR, Bourlard H. Comparing CTC and LFMMI for out-of-domain adaptation of wav2vec 2.0 acoustic model. In: Interspeech; 2021.
57. Nguyen Q, Valizadegan H, Hauskrecht M. Learning classification models with soft-label information. J Am Med Inform Assoc. 2014;21(3):501–8. pmid:24259520
- View Article
- PubMed/NCBI
- Google Scholar
58. Mirzaei MS, Meshgi K, Kawahara T. Automatic Speech Recognition Errors as a Predictor of L2 Listening Difficulties. In: CL4LC@COLING 2016 . The COLING 2016 Organizing Committee; 2016. p. 192–201.
59. Li Y, Zhao Z, Klejch O, Bell P, Lai C. ASR and emotional speech: a word-level investigation of the mutual impact of speech and emotion recognition. In: INTERSPEECH. 2023. p. 1449–53.
60. Karakasidis G, Kurimo M, Bell P, Grósz T. Comparison and analysis of new curriculum criteria for end-to-end ASR. Speech Communication. 2024;163:103113.
- View Article
- Google Scholar
61. Panayotov V, Chen G, Povey D, Khudanpur S. Librispeech: an ASR corpus based on public domain audio books. In: ICASSP; 2015.
62. Ardila R, Branson M, Davis K, Henretty M, Kohler M, Meyer J. Common voice: a massively-multilingual speech corpus. arXiv preprint 2019. https://arxiv.org/abs/1912.06670
63. Demirsahin I, Kjartansson O, Gutkin A, Rivera C. Open-source multi-speaker corpora of the English accents in the British Isles. In: Proceedings of the 12th Language Resources and Evaluation Conference (LREC). 2020. https://www.aclweb.org/anthology/2020.lrec-1.804
64. Chen Y, Wang W, Wang C. Semi-supervised ASR by end-to-end self-training. arXiv preprint 2020. https://arxiv.org/abs/2001.09128

[ref1] 1. Shan C, Zhang J, Wang Y, Xie L. Attention-based end-to-end speech recognition on voice search. In: ICASSP. 2018.

[ref2] 2. Joshi R, Kannan V. Attention based end to end speech recognition for voice search in Hindi and English. In: FIRE. ACM; 2021. p. 107–13.

[ref3] 3. Cantiabela Z, Pardede HF, Zilvan V, Sulandari W, Yuwana RS, Supianto AA. Deep learning for robust speech command recognition using convolutional neural networks (CNN). In: IC3INA; 2022.

[ref4] 4. Long Y, Li Y, Wei S, Zhang Q, Yang C. Large-scale semi-supervised training in deep learning acoustic model for ASR. IEEE Access. 2019;7:133615–27.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref5] 5. Adedeji A, Joshi S, Doohan B. The sound of healthcare: Improving medical transcription ASR accuracy with large language models. CoRR. 2024.
View Article
Google Scholar

[8] View Article

[9] Google Scholar

[ref6] 6. Zhao X, Liu F, Song C, Wu Z, Kang S, Tuo D. Disentangling content and fine-grained prosody information via hybrid ASR bottleneck features for voice conversion. In: ICASSP; 2022.

[ref7] 7. Furui S. Automatic speech recognition and its application to information extraction. In: ACL; 1999. p. 11–20.

[ref8] 8. D’Haro LF, Banchs RE. Automatic Correction of ASR Outputs by Using Machine Translation. In: INTERSPEECH; 2016.

[ref9] 9. Alabau V, Rodríguez-Ruiz L, Sanchís A, Martínez-Gómez P, Casacuberta F. On multimodal interactive machine translation using speech recognition. In: ICMI; 2011. p. 129–36.

[ref10] 10. Goyal A, Garera N. Building accurate low latency ASR for streaming voice search in E-commerce. In: ACL (industry). 2023. p. 276–83.

[ref11] 11. Shor J, Emanuel D, Lang O, Tuval O, Brenner MP, Cattiau J, et al. Personalizing ASR for dysarthric and accented speech with limited data. In: INTERSPEECH. ISCA; 2019. p. 784–8.

[ref12] 12. Zheng X, Phukon B, Hasegawa-Johnson M. Fine-tuning automatic speech recognition for people with Parkinson’s: An effective strategy for enhancing speech technology accessibility. CoRR. 2024.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref13] 13. Takashima R, Sawa Y, Aihara R, Takiguchi T, Imai Y. Dysarthric speech recognition using pseudo-labeling, self-supervised feature learning, and a joint multi-task learning approach. IEEE Access. 2024;12:36990–9.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref14] 14. Weninger F, Mana F, Gemello R, Andrés-Ferrer J, Zhan P. Semi-supervised learning with data augmentation for end-to-end ASR. In: 2020.

[ref15] 15. Zhang Y, Park DS, Han W, Qin J, Gulati A, Shor J. BigSSL: exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. CoRR. 2021.
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref16] 16. Higuchi Y, Karube K, Ogawa T, Kobayashi T. Hierarchical conditional end-to-end ASR with CTC and multi-granular subword units. In: ICASSP; 2022.

[ref17] 17. Xu Q, Likhomanenko T, Kahn J, Hannun AY, Synnaeve G, Collobert R. Iterative pseudo-labeling for speech recognition. In: INTERSPEECH; 2020.

[ref18] 18. Ito K, Johnson L. The LJ Speech Dataset. 2017. https://keithito.com/LJ-Speech-Dataset/

[ref19] 19. Schneider S, Baevski A, Collobert R, Auli M. Wav2vec: unsupervised pre-training for speech recognition. In: INTERSPEECH; 2019.

[ref20] 20. Vachhani B, Bhat C, Kopparapu SK. Data augmentation using healthy speech for dysarthric speech recognition. In: 19th Annual Conference of the International Speech Communication Association, Interspeech 2018, Hyderabad, India, 2018. p. 471–5.

[ref21] 21. Radford A, Kim JW, Xu T, Brockman G, McLeavey C, Sutskever I. Robust speech recognition via large-scale weak supervision. In: ICML; 2023.

[ref22] 22. Korkut C, Haznedaroglu A, Arslan L. Comparison of deep learning methods for spoken language identification. In: SPECOM; 2020.

[ref23] 23. Al-Zakarya MA, Al-Irhaim YF. Unsupervised and semi-supervised speech recognition system: a review. RJCSM. 2023;17(1):34–42.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref24] 24. Liu M, Ke Y, Zhang Y, Shao W, Song L. Speech emotion recognition based on deep learning. In: TENCON; 2022.

[ref25] 25. Javanmardi F, Tirronen S, Kodali M, Kadiri SR, Alku P. Wav2vec-based detection and severity level classification of dysarthria from speech. In: ICASSP; 2023.

[ref26] 26. Nguyen TN, Pham NQ, Waibel A. Accent conversion using pre-trained model and synthesized data from voice conversion. In: Interspeech; 2022.

[ref27] 27. Kreyssig FL, Shi Y, Guo J, Sari L, Mohamed A, Woodland PC. Biased self-supervised learning for ASR. CoRR. 2022.
View Article
Google Scholar

[40] View Article

[41] Google Scholar

[ref28] 28. Baevski A, Zhou Y, Mohamed A, Auli M. wav2vec 2.0: a framework for self-supervised learning of speech representations. In: NeurIPS; 2020.

[ref29] 29. Baevski A, Mohamed A. In: ICASSP; 2020.

[ref30] 30. Rouhe A, Virkkunen A, Leinonen J, Kurimo M. Low resource comparison of attention-based and hybrid ASR exploiting wav2vec 2.0. In: Interspeech; 2022.

[ref31] 31. Kim J, Yoon H, Park KH, Kang U. Accurate graph-based multi-positive unlabeled learning via disentangled multi-view feature propagation. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2. 2025. p. 1149–59. https://doi.org/10.1145/3711896.3736827

[ref32] 32. Kim J, Park KH, Yoon H, Kang U. Accurate link prediction for edge-incomplete graphs via PU learning. AAAI. 2025;39(17):17877–85.
View Article
Google Scholar

[47] View Article

[48] Google Scholar

[ref33] 33. Drugman T, Pylkkonen J, Kneser R. Active and semi-supervised learning in ASR: benefits on the acoustic and language models. arXiv preprint 2019. https://arxiv.org/abs/1903.02852

[ref34] 34. Wallington E, Kershenbaum B, Klejch O, Bell P. On the Learning Dynamics of Semi-Supervised Training for ASR. In: Interspeech 2021 . 2021. p. 716–20. https://doi.org/10.21437/interspeech.2021-1777

[ref35] 35. Metze F, Gandhe A, Miao Y, Sheikh Z, Wang Y, Xu D, et al. Semi-supervised training in low-resource ASR and KWS. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2015. p. 4699–703. https://doi.org/10.1109/icassp.2015.7178862

[ref36] 36. Peyser C, Picheny M, Cho K, Prabhavalkar R, Huang WR, Sainath TN. A comparison of semi-supervised learning techniques for streaming ASR at scale. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. p. 1–5. https://doi.org/10.1109/icassp49357.2023.10095838

[ref37] 37. Park DS, Zhang Y, Jia Y, Han W, Chiu C, Li B. Improved noisy student training for automatic speech recognition. In: INTERSPEECH; 2020.

[ref38] 38. Kim J, Park KH, Yoon H, Kang U. Domain-aware data selection for speech classification via meta-reweighting. In: Interspeech 2024 . 2024. p. 797–801. https://doi.org/10.21437/interspeech.2024-2368

[ref39] 39. Qian Y, Gong X, Huang H. Layer-wise fast adaptation for end-to-end multi-accent speech recognition. IEEE/ACM Trans Audio Speech Lang Process. 2022;30:2842–53.
View Article
Google Scholar

[56] View Article

[57] Google Scholar

[ref40] 40. Leung W, Cross M, Ragni A, Goetze S. Training data augmentation for dysarthric automatic speech recognition by text-to-dysarthric-speech synthesis. CoRR. 2024.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref41] 41. Khan SI, Pachori RB. Automated bundle branch block detection using multivariate fourier–bessel series expansion-based empirical wavelet transform. IEEE Trans Artif Intell. 2024;5(11):5643–54.
View Article
Google Scholar

[62] View Article

[63] Google Scholar

[ref42] 42. Khan SI, Pachori RB. Automated posterior myocardial infarction detection from vectorcardiogram and derived vectorcardiogram signals using MVFBSE-EWT method. Digital Signal Processing. 2025;163:105244.
View Article
Google Scholar

[65] View Article

[66] Google Scholar

[ref43] 43. Khan SI, Pachori RB. Empirical wavelet transform-based framework for diagnosis of epilepsy using EEG signals. AI-enabled smart healthcare using biomedical signals. IGI Global Scientific Publishing; 2022. p. 217–39.

[ref44] 44. Park Y, Jang J-G, Kang U. Fast and accurate partial fourier transform for time series data. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021. p. 1309–18. https://doi.org/10.1145/3447548.3467293

[ref45] 45. Park Y, Kim J, Kang U. Fast multidimensional partial fourier transform with automatic hyperparameter selection. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024. p. 2328–39. https://doi.org/10.1145/3637528.3671667

[ref46] 46. Park Y, Kim K, Kang U. PuzzleTensor: a method-agnostic data transformation for compact tensor factorization. In: Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2. 2025. p. 2234–44. https://doi.org/10.1145/3711896.3737095

[ref47] 47. Lee S, Park Y-C, Kang U. Accurate coupled tensor factorization with knowledge graph. In: 2024 IEEE International Conference on Big Data (BigData). 2024. p. 1009–18. https://doi.org/10.1109/bigdata62323.2024.10825614

[ref48] 48. Kim J, Park KH, Jang J-G, Kang U. Fast and accurate domain adaptation for irregular tensor decomposition. In: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024. 1383–94. https://doi.org/10.1145/3637528.3671670

[ref49] 49. Liu S, Geng M, Hu S, Xie X, Cui M, Yu J. Recent progress in the CUHK dysarthric speech recognition system. CoRR. 2022.
View Article
Google Scholar

[74] View Article

[75] Google Scholar

[ref50] 50. Kleinert M, Helmke H, Siol G, Ehr H, Cerna A, Kern C, et al. Semi-supervised adaptation of assistant based speech recognition models for different approach areas. In: 2018 IEEE/AIAA 37th Digital Avionics Systems Conference (DASC). 2018. p. 1–10. https://doi.org/10.1109/dasc.2018.8569879

[ref51] 51. Nallasamy U, Metze F, Schultz T. In: 2012. 13–7.

[ref52] 52. Tomanek K, Zayats V, Padfield D, Vaillancourt K, Biadsy F. Residual adapters for parameter-efficient ASR adaptation to atypical and accented speech. arXiv preprint 2021. https://arxiv.org/abs/2109.06952

[ref53] 53. Vachhani B, Bhat C, Das B, Kopparapu SK. Deep autoencoder based speech features for improved dysarthric speech recognition. In: INTERSPEECH. ISCA; 2017. p. 1854–8.

[ref54] 54. Vakirtzian S, Tsoukala C, Bompolas S, Mouzou K, Stamou V, Paraskevopoulos G, et al. Speech recognition for greek dialects: a challenging benchmark. In: Interspeech 2024, 2024. p. 3974–8. https://doi.org/10.21437/interspeech.2024-2443

[ref55] 55. Kim J, Park KH, Kang U. Accurate semi-supervised automatic speech recognition via multi-hypotheses-based curriculum learning. In: PAKDD; 2024.

[ref56] 56. Vyas A, Madikeri SR, Bourlard H. Comparing CTC and LFMMI for out-of-domain adaptation of wav2vec 2.0 acoustic model. In: Interspeech; 2021.

[ref57] 57. Nguyen Q, Valizadegan H, Hauskrecht M. Learning classification models with soft-label information. J Am Med Inform Assoc. 2014;21(3):501–8. pmid:24259520
View Article
PubMed/NCBI
Google Scholar

[84] View Article

[85] PubMed/NCBI

[86] Google Scholar

[ref58] 58. Mirzaei MS, Meshgi K, Kawahara T. Automatic Speech Recognition Errors as a Predictor of L2 Listening Difficulties. In: CL4LC@COLING 2016 . The COLING 2016 Organizing Committee; 2016. p. 192–201.

[ref59] 59. Li Y, Zhao Z, Klejch O, Bell P, Lai C. ASR and emotional speech: a word-level investigation of the mutual impact of speech and emotion recognition. In: INTERSPEECH. 2023. p. 1449–53.

[ref60] 60. Karakasidis G, Kurimo M, Bell P, Grósz T. Comparison and analysis of new curriculum criteria for end-to-end ASR. Speech Communication. 2024;163:103113.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref61] 61. Panayotov V, Chen G, Povey D, Khudanpur S. Librispeech: an ASR corpus based on public domain audio books. In: ICASSP; 2015.

[ref62] 62. Ardila R, Branson M, Davis K, Henretty M, Kohler M, Meyer J. Common voice: a massively-multilingual speech corpus. arXiv preprint 2019. https://arxiv.org/abs/1912.06670

[ref63] 63. Demirsahin I, Kjartansson O, Gutkin A, Rivera C. Open-source multi-speaker corpora of the English accents in the British Isles. In: Proceedings of the 12th Language Resources and Evaluation Conference (LREC). 2020. https://www.aclweb.org/anthology/2020.lrec-1.804

[ref64] 64. Chen Y, Wang W, Wang C. Semi-supervised ASR by end-to-end self-training. arXiv preprint 2020. https://arxiv.org/abs/2001.09128

Figures

Abstract

Introduction

Related works

Problem definition

Semi-supervised ASR for ordinary speech.

Semi-supervised ASR for characterized speech.

Pre-trained audio feature extractor for ASR

Semi-supervised ASR models

Connectionist Temporal Classification (CTC) loss

Proposed method

Multiple hypotheses for unlabeled instances (I1)

Training ASR model for ordinary speech with multiple hypotheses (I2)

Training ASR model for characterized speech with multiple hypotheses (I3)

Curriculum learning (I4)

Theoretical analysis

Experiments

Experimental settings

Dataset.

Baselines.

Evaluation metrics.

Settings for MOCA.

Settings for MOCA-S.

Transcription performance of MOCA for ordinary speech (Q1)

Transcription performance of MOCA-S for characterized speech (Q2)

Training trend under multi-hypotheses (Q3)

Ablation study (Q4)

Supplementary experiments

Additional results on gender-based speech

Robustness of MOCA and MOCA-S

Discussion

Future works

Low-resource ASR.

Multilingual ASR.

Conclusions

References