Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Confusion2Vec 2.0: Enriching ambiguous spoken language representations with subwords


Word vector representations enable machines to encode human language for spoken language understanding and processing. Confusion2vec, motivated from human speech production and perception, is a word vector representation which encodes ambiguities present in human spoken language in addition to semantics and syntactic information. Confusion2vec provides a robust spoken language representation by considering inherent human language ambiguities. In this paper, we propose a novel word vector space estimation by unsupervised learning on lattices output by an automatic speech recognition (ASR) system. We encode each word in Confusion2vec vector space by its constituent subword character n-grams. We show that the subword encoding helps better represent the acoustic perceptual ambiguities in human spoken language via information modeled on lattice-structured ASR output. The usefulness of the proposed Confusion2vec representation is evaluated using analogy and word similarity tasks designed for assessing semantic, syntactic and acoustic word relations. We also show the benefits of subword modeling for acoustic ambiguity representation on the task of spoken language intent detection. The results significantly outperform existing word vector representations when evaluated on erroneous ASR outputs, providing improvements up-to 13.12% relative to previous state-of-the-art in intent detection on ATIS benchmark dataset. We demonstrate that Confusion2vec subword modeling eliminates the need for retraining/adapting the natural language understanding models on ASR transcripts.


Speech is the primary and most natural mode of communication for humans. This makes its use also attractive for human-computer interaction, which in turn requires decoding human language to enable spoken language understanding. Human language is a complex construct involving multiple dimensions of information including semantics, syntax and often contain ambiguities which make it challenging for machine inference of communication intent, emotions etc. Several word vector representations have been proposed for effectively describing the human language in the natural language processing community.

Contextual modeling techniques like language modeling, i.e., predicting the next word in the sentence given a window of preceding context, have been shown to model meaningful word representations [1, 2]. Bag-of-word based contextual modeling, where the current word is predicted given both its left and right (local) contexts has shown to capture language semantics and syntax [3]. Similarly, predicting local context from the current word, referred to as skip-gram modeling, is shown to better represent semantic and syntactic distances between words [4]. In [5] log bi-linear models combining global word co-occurrence information and local context information, termed as global vectors (GloVe), is shown to produce meaningful structured vector space. Bi-directional language models are proposed in [6], where internal states of deep neural networks are combined to model complex characteristics of word use and its variance over linguistic contexts. The advantages of bi-directional modeling are further exploited along with self-attention using transformer networks [7] to estimate a representation, termed as BERT (Bidirectional Encoder Representations from Transformers), that has shown its utility on a multitude of natural language understanding tasks [8]. Models such as BERT, ELMo estimate word representations that vary depending on the context, whereas the context-free representations including GloVe and Word2Vec generate a single representation irrespective of the context.

However, most of the word vector representations infer the knowledge through contextual modeling and many of the inherent ambiguities present in human language are often unrecognized or ignored. For instance, from the perspective of spoken language, the ambiguities can be associated with how similar the words sound, i.e., for example, the words “see” and “sea” sound acoustically identical but have different meanings. The ambiguities can also be associated with the underlying speech signal itself due to wide range of acoustic environments involving noise, overlapped speech and channel, room characteristics. These ambiguities often project themselves as errors through ASR systems. Most of the existing word vector representations such as word2vec [3, 4], fasttext [9], GloVe [5], BERT [8], ELMo [6] do not account for the ambiguities present in speech signals and thus degrade while processing the output of noisy ASR transcripts.

Confusion2vec was recently proposed to handle representation ambiguity information present in human language [10]. Confusion2vec is estimated by unsupervised skip-gram training on the ASR output lattices and confusion networks. The analysis of inherent acoustic ambiguity information of the embeddings displayed meaningful interactions between the semantic-syntactic subspace and acoustic similarity subspaces. In [11], the usefulness of the Confusion2vec was confirmed on the task of spoken language intent detection. The Confusion2vec representation significantly outperformed typical word embeddings including word2vec and GloVe when evaluated on noisy ASR transcripts by reducing the classification error rate by approximately 20% relative.

Prior attempts at leveraging information present in word lattices and word confusion networks have been successful for multiple tasks [1217]. However, they have some limitations, as these prior works estimate the embedding in a supervised manner specifically trained with task specific labels. Consequently, the main downside is that the word representation estimated by such techniques are task-dependent and are restricted to a particular domain and dataset. Moreover, availability of most of the task specific datasets are limited and task specific speech data are expensive to collect. The advantage of Confusion2Vec is that it estimates a generic, task-independent word vector representation via unsupervised learning on lattices or confusion networks generated by an ASR on any speech conversations.

In this paper, we extend the previously proposed Confusion2Vec representation framework by incorporating subwords to represent each word for modeling both the acoustic ambiguity information and the contextual information. Each word is modeled as a sum of constituent n-gram characters. Throughout this paper we refer to character n-grams as subwords. Our motivation behind the use of subwords are the following: (i) they incorporate morphological information of the words by encoding internal structure of words [9], (ii) the acoustically ambiguous words tend to have more similar bag-of-character n-grams, (iii) subwords help model under-represented words more efficiently, thereby leading to more robust estimation with limited available data [9], which is the case since training Confusion2Vec is restricted to ASR lattice outputs, (iv) subwords enable representations for out-of-vocabulary words [18] which are commonplace with end-to-end ASR systems outputting characters.

Although the use of subword (character n-grams) may appear commonplace in the NLP domain in terms of practicality, to the best of our knowledge, our work is the first to explore encoding n-gram characters to model acoustic ambiguity jointly with natural language semantics. Unlike typical applications of distributed semantics, in our case, the subword is mapped to a much larger distribution of words, and in some cases observe conflicting semantics and acoustic interactions which render the modeling task more complex and challenging. From an alternative perspective, we model information projected from multiple modalities, i.e., acoustics and natural language, which embeds inherent information including audio channel characteristics, room impulse response, noise environments etc., which further adds to the modeling richness and novelty of the proposed work.

The rest of the paper is organized as follows: Confusion2vec is introduced in Section Confusion2Vec Representation Framework. The proposed subword modeling is presented in Section Confusion2Vec 2.0 subword model. Section Evaluations gives details of the evaluation techniques employed for assessing the word embedding models. The experimental setup and results of various analogy and similarity tasks are presented in section Analogy & Similarity Tasks. Section Spoken Language Intent Detection presents the application of the proposed word vector representation to the spoken language intent detection task. Finally, the paper is concluded in section Conclusion.

Confusion2Vec representation framework

In psycho-acoustics, it is established that humans also relate words with how they sound [19] in addition to semantics and syntax. Inspired by principles of human speech production and perception, we previously proposed Confusion2vec [10]. The core idea is to estimate a hyper-space that not only captures the semantics and syntax of human language, but also augments the vector space with acoustic ambiguity information, i.e., word acoustic similarity information. In other words, word2vec, GloVe can be viewed as a subspace of the Confusion2vec vector space.

Several different methodologies are proposed in [10] for capturing the ambiguity information. The methodologies are an adaptation of the skip-gram modeling for word confusion networks or lattice-like structures. The word lattices are directed acyclic weighted graphs of all the word sequences that are likely possible. A confusion network is a specific type of lattice with constraints that each word sequence passes through each node of graph. Such lattice-like structures can be derived from machine learning algorithms that output probability measures, for example, an ASR. Fig 1, illustrates a confusion network that can possibly result from a speech recognition system. Unlike typical simple sentences which are used for training word embeddings like word2vec, GloVe, BERT, ELMo etc., the information in the confusion network can be viewed along two dimensions: (i) contextual dimension, and (ii) acoustic ambiguity dimension.

Fig 1. Example confusion network output by ASR for the ground-truth phrase—“I want to sit”.

Figure adapted from P. G. Shivakumar and P. Georgiou, “Confusion2vec: Towards enriching vector space word representations with representational ambiguities,” PeerJ Computer Science, vol. 5, p. e195, 2019 [10].

More specifically, four different configurations of skip-gram modeling algorithms are proposed in our previous work [10], namely: (i) top-confusion, (ii) intra-confusion, (iii) inter-confusion, and (iv) hybrid model. The top-confusion version considers only the most-probable path of the ASR confusion network and applies the typical skip-gram model on it. The intra-confusion version applies the skip-gram modeling on the acoustic ambiguity dimension of the confusion network and ignores the contextual information, i.e., each ambiguous word alternative is predicted by the other over a pre-defined local context. The inter-confusion version applies the skip-gram modeling on the contextual dimension but over each of the acoustic ambiguous words. The hybrid model is a combination of both the intra and inter-confusion configurations. More information on the training configuration is available in [10]. The present work builds upon this basic Confusion2vec framework.

Confusion2Vec 2.0 subword model

Subword encoding of words has been popular in modeling semantics and syntax of language using word vector representations [6, 8, 9]. The use of subwords are mainly motivated by the fact that the subwords incorporate morphological information which can be helpful, for example, in relating the prefixes, suffixes and the word root. In this work, we apply subword representation for encoding the word ambiguity information in the human language. We believe we have a compelling case for the use of subwords for representing the acoustic similarities (ambiguities) between the words in the language for the following reasons. Similar sounding words often tend to have similar set of characters and thus have a high degree of overlapping set of character n-grams. This subword based feature encoding helps model the level of overlap and estimate the magnitude of acoustic similarity robustly. Moreover, the use of subwords should help in efficient encoding of under-represented words in the language [9]. This is crucial in the case of Confusion2vec because we are restricted to speech data and their corresponding decoded ASR lattices for training which leads to data sparsity issues compared to computing typical word vector representation that can be trained on large amounts of easily available plain text data. Another important aspect is the ability to represent out-of-vocabulary words [18] which are common with end-to-end ASR systems outputting character sequences.

In the proposed model, for example, each word w is represented as a sum of its constituent n-gram character subwords. This enables the model to infer the internal structure of each word. For example, a word “want” is represented with the vector sum of the following subwords: Symbols < and > are used to represent the beginning and end of the word. The n-grams are generated for n = 3 up to n = 6. The choice of length of character n-grams is language dependent and empirically chosen for English [9]. It is apparent that an acoustically ambiguous, similar sounding word “wand” has a high degree of overlap with the set of n-gram characters.

In this paper, we consider two modeling variations: (i) inter-confusion, and (ii) intra-confusion versions of Confusion2vec with the subword encoding.

Intra-Confusion model

The goal of the intra-confusion model is to estimate the inter-word relations between the acoustically ambiguous words that appear in the ASR lattices. For this, we perform skip-gram modeling over the acoustic similarity dimension (see Fig 1) and ignore the contextual dimension of the utterance. The objective of the intra-confusion model is to maximize the following log-likelihood: (1) where T is the length of the utterance (confusion network) in terms of number of words, wi,j is the word in the confusion network output by the ASR at time-step i and j is the index of the word among the ambiguous alternatives. is the set of indices of all ambiguous words at time-step t, is the index of the current word along the acoustic ambiguity dimension, is the subset of ambiguous words barring at the current word t, i.e., for example from Fig 1, for the current word, , “want”, At ⊆ {wand, won’t, what}. Additionally, for subword encoding, each word input is represented as: (2) where Sw is the set of all character n-grams ranging from n = 3 to n = 6 and the word itself and xs is the vector representation for n-gram subword s. Few training samples (input, target) generated for this configuration pertaining to input confusion network in Fig 1 are (I, eye), (eye, I), (want, wand), (want, won’t), (won’t, what), (wand, what) etc.

Inter-Confusion model

The aim of the inter-confusion model is to jointly model the contextual co-occurrence information and the acoustic ambiguity co-occurrence information along both the axes depicted in the confusion network. Here, the skip-gram modeling is performed over time context and over all the possible acoustic ambiguities. The objective of the inter-confusion model is to maximize the following log-likelihood: (3) where Ct corresponds to set of indices of nodes of confusion network, i.e., words around the current word t along the time-axis and c is the current context index. Ac is the set of indices of acoustically ambiguous words at a context c. For example, for the current word, , “want” in Fig 1, Ac ⊆ {I, eye, two, tees, to, seat, sit, seed, eat} and At ⊆ {wand, won’t, what, want}. Note, each word input is subword encoded as in Eq 2. Few training samples (input, target) generated for this configuration are (want, I), (want, eye), (want, two), (want, to), (want, tees), (what, I), (what, eye), (what, to), (what, tees), (what, two), (won’t, eye) etc.

Training loss and objective

Negative sampling is employed for training the embedding model. Negative sampling was first introduced for training word2vec representation [4]. It is a simplification of the Noise Contrastive Estimation objective [20]. The negative sampling for training the embedding can be posed as a set of binary classification problems which operates on two classes: presence of signal or absence (noise). In the context of word embeddings the presence of the context words are treated as positive class and the negative class is randomly sampled from the unigram distribution of the vocabulary. The negative sampling loss function to be optimized for subword model can be expressed as: (4) where , wi is the input word, wt is the output word, is the set of n-gram character subwords for the word wi, xs is the vector representation for the character n-gram subword s and is the output vector representation of target word wt. K is the number of negative samples to be drawn from the noise distribution Pn(w). The noise distribution Pn(w) is chosen to be the unigram distribution of words in the vocabulary raised to the 3/4th power as suggested in [4]. Note, for Confusion2vec the input word wi and target word wt are derived according to Eqs 1 and 3 for implementing the respective training configurations.


We perform evaluations of the proposed word embeddings along two aspects. One, assessing the useful, meaningful information embedded in the word vector representation. Second, in its application to a realistic task of spoken language intent detection. Note, all the evaluations, analysis and databases used in this work are in the English language.

Analogy and similarity tasks

For evaluating the inherent semantic and syntactic knowledge of the word embeddings, we employ two tasks: (i) the semantic-syntactic analogy task, and (ii) the word similarity task.

Semantic-Syntactic analogy task.

The word analogy task was first proposed in [3] which comprises word pair analogy questions of the form W1 is to W2 as W3 is to W4. For example, “Boy” is to “Girl” as “Son” is to “Daughter”. The analogy is answered correctly if vec(W1) − vec(W2) + vec(W3) is most similar to vec(W4). The task comprises 19,544 analogy questions as originally compiled and released by [3]. We prune the analogy question database to match the training dataset vocabulary and to obtain identical setup used in [10] for comparison purposes.

Word similarity task.

Another prominent approach in the NLP community [5, 21] for evaluating word vector representations is the word similarity task. We use the WordSim-353 database [22] consisting of 353 pairs of words manually annotated over a score of 1 to 10 depending on the magnitude of word similarity as perceived by humans. The task involves computing the rank-correlation (Spearman correlation) between the human annotated scores and the cosine similarity of the corresponding word vector pairs [21]. A high correlation indicates that the word vector representation captures the word similarity order similar to that perceived by humans.

For assessing the word acoustic ambiguity (similarity) information, we conduct the Acoustic analogy task, Semantic&syntactic–acoustic analogy task and Acoustic similarity tasks, all proposed in [10].

Acoustic analogy task.

The Acoustic analogy task comprises word pair analogies compiled using homophones which answer questions of the form: W1 sounds similar to W2 as W3 sounds similar to W4. For example, “Boy” sounds similar to “Buoy” as “Sun” sounds similar to “Son”. The task comprises 2,678 analogy questions and is designed to assess the ambiguity information embedded in the word vector space [10].

Semantic&Syntactic-Analogy task.

The semantic&syntactic-acoustic analogy task is designed to assess semantic, syntactic and acoustic ambiguity information simultaneously. The analogies are formed by replacing certain words by their homophone alternatives in the original semantic and syntactic analogy task [10]. For example, “Boy” is to “Girl” as “Sun” is to “Daughter”. The task comprises 3860 analogy questions. Examples of the analogies can be found in [10].

Acoustic word similarity task.

The acoustic word similarity task is analogous to the word similarity task, i.e., it contains 943 word pairs which are rated on their acoustic similarity based on the normalized phone edit distances. A value of 1.0 refers to two words sounding identical and 0.0 refers to the word pairs being acoustically dissimilar. The task involves computing the rank-correlation (Spearman correlation) between the normalized phone edit distances and the cosine similarity of the corresponding word vector pairs.

More details regarding the evaluation methodologies are available in [10]. The evaluation datasets are made available at Note, for evaluation of Confusion2vec models with analogy tasks, we compute accuracy over top-2 nearest vectors, i.e., we count the analogy as answered correctly if any of the top-2 nearest vectors satisfies the analogy. This is because, (i) the 3 analogy tasks are not mutually exclusive, and (ii) the nearest vector query with Confusion2vec, can be either along the contextual axis (semantics/syntax) or along the acoustic similarity axis. However, in case of baseline models including fastText and word2vec (W2V), we conduct typical evaluation with nearest vector (top-1) since they model only the contextual information. However, we provide both the results of top-1 and top-2 evaluations in S1 Appendix for the benefit of the reader. More information regarding evaluation can be found in [10].

Spoken language intent classification

We also evaluate the efficacy of the proposed word representation models on the task of spoken language intent classification. A recurrent neural network (RNN) based classifier is employed by initializing the embedding layer with the proposed word vectors. Classification experiments are conducted by training the recurrent neural network on (i) clean manual transcripts, and (ii) noisy ASR transcripts, with evaluations on both manual and ASR transcripts. Classification error rates of the intent detection is used to derive assessments of the word vector representations.

Analogy & similarity tasks


The Fisher English Training Part 1, Speech (LDC2004S13) and Fisher English Training Part 2, Speech (LDC2005S13) corpora [23] are used for both training the ASR and the Confusion2vec 2.0 embeddings. The choice of database is based on [10] for direct comparison purposes. The corpus consists of spontaneous telephonic conversations between 11,972 native English speakers. The speech data amounts to approximately 1,915 hours sampled at 8 kHz. The corpus is divided into 3 parts for training (1,905 hours, 1,871,731 utterances), development (5 hours, 5000 utterances) and test (5 hours, 5000 utterances). Overall, the transcripts contain approximately 20.8 million word tokens with 42,150 unique words.

Experimental setup

The experimental setup is maintained identical to [10] for direct comparison. Brief detail of the setup is as follows:

Automatic speech recognition.

A hybrid HMM-DNN based acoustic model is trained on the train subset of the speech corpus using the KALDI speech recognition toolkit [24]. 40 dimensional mel frequency cepstral coefficients (MFCC) features are extracted along with the i-vector features for training the acoustic model. The i-vector features are used to provide speaker and channel characteristics to aid acoustic modeling. The DNN acoustic model, comprises 7 layers with P-norm non-linearity (p = 2) each with 350 units [25]. The DNN is trained using 5 MFCC frame splices with left and right context of 2 to classify among 7979 Gaussian mixtures with stochastic gradient descent optimizer. The CMU pronunciation dictionary [26] is used as the word-pronunciation transcription lexicon. A tri-gram language model is trained on the training subset of the Fisher English Speech Corpus. The ASR yields word error rates (WER) of 16.57% and 18.12% on the development and the test datasets. Lattices are derived during the ASR decoding with a decoding beam size of 11 and lattice beam size of 6. The lattices are converted to confusion networks with the minimum Bayes risk criterion [27] for training the Confusion2vec embeddings. The resulting confusion networks have a vocabulary size of 41,274 and 69.5 million words, with an average of 3.34 alternative (ambiguous) words for each edge in the graph.

Confusion2Vec 2.0.

In order to train the embedding, most frequent words are sub-sampled as suggested in [4], with the rejection threshold set to 10−4. Also, a minimum frequency threshold of 5 is set and the rarely occurring words are pruned from the vocabulary. The context window size for both the acoustic ambiguity and contextual dimensions are uniformly sampled between 1 and 5. The dimension of the word vectors is set to 300. The number of negative samples for negative sampling is chosen to be 64. The learning rate is set to 0.01 and trained for a total of 15 epochs using stochastic gradient descent. All the hyperparameters are empirically chosen for optimal performance on the development set. We implemented the Confusion2vec 2.0 by modifying the source code from fastText [9, 28]. We make our source code and trained models available at The models were trained on CPU only with the following machine configuration: dual Intel Xeon CPU E5–2670 operating at 2.6GHz based on 64bit architecture. The machine had a total of 32 threads, i.e., each CPU comprises 8 cores with 2 threads per core. The machine was equipped with 128 GB of DDR3 memory, i.e., 8 x 8GB memory per CPU. With the above machine configuration and the above mentioned experimental setup, training intra-confusion model took approximately 46 minutes and inter-confusion model took approximately 3 hours and 24 minutes.


Table 1 lists the results in terms of accuracies for analogy tasks and rank-correlations for similarity tasks. The first two rows correspond to results with the original word2vec. Google W2V model is the open source model released by Google [29], trained on 100 billion word Google News dataset. The fastText model employed is the open source model trained on Wikipedia dumps with a vocabulary size of more than 2.5 million words released by Facebook [30]. We also train an in-domain version of original word2vec and fastText on the Fisher English corpus manual transcriptions for fair comparison with the Confusion2vec models, referred to as “In-domain W2V” in Table 1. C2V-1 refers to top-confusion scheme [10], which is roughly equivalent to training skip-gram model on the ASR transcripts of the Fisher English corpus. The middle three rows of the table correspond to Confusion2vec embeddings without subword encoding and they are taken directly from [10]. The bottom three rows correspond to the results obtained with subword encoding. Note, the Confusion2vec 1.0 is initialized on the Google word2vec model for better convergence. The Confusion2vec 2.0 model is initialized on the fastText model to maintain compatibility with subword encodings. We normalize the vocabulary for all the experiments, meaning the same vocabulary is used to evaluate the analogy and similarity tasks to allow for fair comparisons.

Comparing the baseline word2vec and fastText embeddings to the Confusion2vec, we observe the baseline embeddings perform well on the semantic&syntactic analogy task and provide good positive correlation on the word similarity task as expected. However, they perform poorly on the acoustic analogy task, semantic&syntactic-acoustic analogy task and give small negative correlation on the acoustic analogy task. The in-domain word2vec model undergoes a significant dip in correlation evaluating for word similarity task (0.6893 to 0.4417). Similarly, the in-domain version of the fastText also sees degradation of correlation from 0.7361 to 0.3584. We believe the limited data and restricted vocabulary of the in-domain versions are responsible for the degradation. We also note that the subword encoding in fastText models is particularly more susceptible. All the Confusion2vec models perform relatively well on the semantic&syntactic analogy task and word similarity task, but more importantly, yield high accuracies on acoustic analogy task and semantic&syntactic-acoustic analogy tasks and provide high positive correlation with the acoustic similarity task.

Comparisons between Confusion2vec 1.0 and Confusion2vec 2.0 among the analogy tasks reveal the subword encoding enhances the acoustic ambiguity modeling. For the acoustic analogy task we find relative improvement of up to 46.41% over its non-subword counterpart. Moreover, even for the semantic&syntactic-acoustic analogy task, we observe improvements with subword encoding. However, we find a small reduction in performance for the original semantic and syntactic analogy task. One explanation for this is that the different analogy tasks are fairly, mutually exclusive, i.e., getting right on one task compromises performance on the other. The top-2 evaluations for Confusion2Vec provides a partial solution to this. Nevertheless, there can be instances where the embedding can favor information on either acoustic ambiguity or contextual information dimension. Thus, there exists trade-off between the different proposed analogy based evaluation tasks. The goal is to optimize this trade-off as best as possible. One way to judge this trade-off is to look at the average accuracy across the analogy tasks. Regardless of the small dip in the performance, the accuracies remain acceptable in comparison to the in-domain word2vec and fastText models. Overall, taking the average accuracy of all the analogy tasks, we obtain an increase of approximately 16.62% relative over the non-subword Confusion2vec models.

Investigating the results for the similarity tasks, we find a significant correlation of 0.81 for acoustic similarity task with the subword encoding. However, again, a small degradation is observed with the word similarity task obtaining a correlation of 0.2929 against the 0.3584 of the in-domain baseline fastText model. In contrast to Confusion2vec 2.0, Confusion2vec 1.0 is able to improve correlation on word similarity task against its counterpart in-domain word2vec model (from 0.4417 to 0.5798). Our investigations on the possible causes of this lower correlation on the word similarity task reveals the following: First, the same set of word pairs are scored for both the word similarity and acoustic similarity tasks, and thus increase in the performance of one similarity task resulting in slight compromise on the other is inevitable. Second, we found that in case of Confusion2vec 2.0, the Pearson correlation was always higher than the Spearman correlation (see Table 8 in S1 Appendix). This likely suggests that with Confusion2vec 2.0 models, while the linearity especially at the tails of the distribution is relatively stronger, the monotonicity is negatively impacted particularly at and around the mean. We believe this is a fair compromise since we are more concerned of words that are more similar or more dissimilar to the others and less concerned of neutral words. Finally, this is also supported by the results on original analogy task which performs fairly well (concerned with the most similar word). Overall, we believe that the subword modeling with Confusion2vec enhances the acoustic confusability modeling considerably, and this causes slight disruptions in semantic modeling while preserving the important and relevant semantics of the language.

Model concatenation

Further, the Confusion2vec model can be concatenated with the other word embedding models to produce a new word vector space that can result in better representations as seen in [10]. Table 2 lists the results of the concatenated models. For the previous, non-subword version of the Confusion2vec, the vector models are concatenated with the word2vec model trained on the ASR output transcripts (C2V-1). The choice of using the C2V-1 instead of the Google W2V for concatenation was based on empirical findings. Where as to maintain compatibility of subword encoding, the Confusion2vec 2.0 models are concatenated with fastText models. Note the Confusion2vec 1.0 C2V-1 is pre-trained on Google’s W2V model for fair comparison against concatenation of Confusion2vec 2.0 models with fastText.

First, comparisons between the non-concatenated versions in Table 1 and the concatenated version in Table 2, of the non-subword models, we observe an improvement of approximately 7.22% relative in average analogy accuracy after concatenation. We don’t observe significant improvement with subword based models after concatenation in terms of average analogy accuracy. However, we observe different dynamics between the acoustic ambiguity and the semantic and syntactic subspaces. Concatenation results in improved semantic and syntactic evaluations at the expense of degradation in accuracies of acoustic analogy task. We also note improvements (9.27% relative) in semantic&syntactic-acoustic analogy task after concatenation, confirming meaningful existence of both ambiguity and semantic-syntactic relations. Moreover, concatenation also yields a better correlation on the word similarity task.

Next, comparisons of the Confusion2vec 1.0 (non-subword) and the subword version, we observe significant improvements in the semantic&syntactic analogy task (7.51% relative) as well as the semantic&syntactic-acoustic analogy tasks (21.78% relative). Moreover, the subword models outperform the non-subword version in both of the similarity tasks. The subword models slightly under-perform in the acoustic analogy task, but more crucially outperform the Google W2V and FastText baselines significantly. Overall, these changes in dynamics between the acoustic and semantic/syntactic subspaces observed in the case of concatenated models can be attributed to the fact that we are optimizing a different criterion than the non-concatenated versions.

Further, the concatenated models can be fine-tuned and optimized to exploit additional gains as found in [10]. The row corresponding to Confusion2Vec 1.0 − C2V-1 + C2V-c (UJO) is the best result obtained in [10] which involves 2-passes. The Confusion2Vec 2.0 with the subword modeling with a single pass training gives comparable performance to the 2-pass approach. Thus we skip the 2-pass approach with the subword model in favor of ease of training and reproducibility.

Embedding visualization

Fig 2 illustrates the word vector spaces of fastText embeddings and the proposed C2V-a embeddings after dimension reduction using principal component analysis. The visualizations are generated using scikit-learn and matplotlib python packages. We observe meaningful interactions between the semantic&syntactic subspace and the acoustic ambiguity subspace. For example, in Fig 2, vectors “boy”-“prince”, “see”-“seeing”, “read”-“write”, “uncle”-“aunt” are similar to acoustically ambiguous vector “boy”-“prints”, “sea”-“seeing”, “read”-“write”, “uncle”-“ant” respectively which is not the case in Fig 2 with fastText embeddings. Such vector relationships can be exploited for downstream spoken language applications by providing crucial acoustic ambiguity information to recover from speech recognition errors. Also note, the acoustically ambiguous words such as “prinz”, “prince”, “prints” are found clustered together. Another important observation is that the word “prinz”, out-of-vocabulary in English, has an orphaned representation under fastText in Fig 2. However, “prinz” finds a meaningful representation on the basis of acoustic signature in the proposed Confusion2vec model as seen in Fig 2, i.e., “prinz” is clustered together with acoustically similar words “prince” & “prints” and the vector “boy”-“prinz” is similar to vector “boy”-“prince”. Occurrence of out-of-vocabulary words such as “prinz” is commonplace with end-to-end ASR systems that output characters prone to errors. Note, out-of-vocabulary words such as “prinz” cannot be represented by typical word embeddings such as word2vec, GloVe, etc., and hence sub-optimal for representation with many end-to-end ASR systems.

Fig 2. 2-D plots of selected word vectors portraying semantic, syntactic and acoustic relationships after dimension reduction using PCA.

The blue lines indicate semantic relationships, blue ellipses indicate syntactic relationships, orange lines indicate acoustic-semantic/syntactic relations and orange ellipses indicate acoustic ambiguity word relations. Plots with identical word sets corresponding to Confusion2Vec 1.0 and Google W2V can be found in [11]. Please note that the out-of-vocabulary word “prinz” cannot be represented in Google W2V and Confusion2Vec 1.0 spaces.

Spoken language intent detection

In this section, we apply the proposed word vector embedding to the task of spoken language intent detection. Spoken language intent detection is the process of decoding the speaker’s intent in contexts involving voice commands, call routing and any human computer interactions. Many spoken language technologies use an ASR to convert the speech signal to text, a process prone to errors due to the varying speakers and noisy environments. The erroneous ASR outputs in turn result in degradation of the downstream intent classification. Few efforts have focused on handling the errors of the ASR to make the subsequent intent detection process more robust to errors. These efforts often involve training the intent classification systems on noisy ASR transcripts. The downsides of training the intent classifiers on the ASR transcripts is that the systems are limited to the amount of speech data available. Moreover, varying speech signal conditions and use of different ASR models make such classifiers non-optimal and less practical. In many scenarios, speech data is not available to enable adaptation on ASR transcripts.

In our previous work [11], we applied the non-subword version of the Confusion2vec to the task of spoken language intent detection. We demonstrated that the Confusion2vec is able to perform as efficiently as the popular word embeddings like word2vec and GloVe on clean manual transcripts, giving comparable classification error rates. More importantly, we were able to illustrate the robustness of the Confusion2vec embeddings when evaluated on the noisy ASR transcripts, resulting in up-to relative 20% improvements. We showed that the results also translate to models trained on noisy ASR transcripts.

In this paper, we incorporate the Confusion2vec 2.0 embeddings and exploit the enhanced effects of the subword modeling in capturing acoustic ambiguity as verified by the previous evaluations in Section Analogy & Similarity Tasks. We believe the proposed model could further improve and provide robustness to the spoken utterance classification and thereby, aim to eliminate the need for re-training the classifiers on the ASR outputs.


We conduct experiments on the Airline Travel Information Systems (ATIS) benchmark dataset [31]. The dataset consists of humans making flight-related inquiries in the English language with an automated answering machine with audio recorded and its transcripts manually annotated. ATIS consists of 18 intent categories. The dataset is divided into train (4478 samples), development (500 samples) and test (893 samples) consistent with previous works [11, 32, 33]. For ASR evaluations, the audio recordings are down-sampled from 16kHz to 8kHz and then decoded using the ASR setup described in section Automatic speech recognition using the audio mappings provided in The ASR achieves a WER of 18.54% on the ATIS test set.

Experimental setup

For intent classification we adopt a simple RNN architecture identical to [11]. This allows for direct comparison. The architecture of the neural network is intentionally kept simple in order to assess the efficacy of the proposed embedding word features. The classifier is comprised of an embedding layer followed by a single layer of bi-directional recurrent neural network (RNN) with long short-term memory (LSTM) units. This is followed by a linear dense layer and softmax function. The softmax outputs a probability distribution across all the intent categories. The embedding layer is fixed throughout the training. However, in the case of the randomly initialized embeddings, the embedding is estimated on the in-domain data used for intent detection.

The intent classification models are trained on the 4478 samples of training subset and the hyperparameters are tuned on the development set. We choose the set of hyperparameters yielding the best results on the development set and then apply it on the unseen held-out test subset. The results are reported on both the manual clean transcripts and the ASR transcripts. For training we treat each utterance as a single sample (batch size = 1). The hyper-parameter space we experiment are as follows: the hidden dimension size of the LSTM is tuned over {32, 64, 128, 256}, the learning rate over {0.0005, 0.001}, and the dropout is tuned over {0.1, 0.15, 0.2, 0.25}. The Adam optimizer is employed for optimization and trained for a total of 50 epochs. We employ early stopping when the loss on the development set doesn’t improve for 5 consecutive epochs.


We include results from several baseline systems for providing comparisons of Confusion2Vec 2.0 with the popular context-free word embeddings, contextual embeddings, established NLU systems and the current state-of-the-art.

  1. Context-Free Embeddings: GloVe [5, 34], skip-gram word2vec [4, 29] and fastText [9, 30] word representations are employed. They are referred to as context-free embeddings since the word representations are static irrespective of the context.
  2. ELMo: [6] proposed deep contextualized word representation based on character based deep bidirectional language model trained on large text corpus. ELMo effectively models syntax and semantics of the language along varying linguistic contexts. Unlike context-free embeddings, ELMo embeddings have varying representations for each word depending on the word’s context. We employ the original model trained on 1 Billion Word Benchmark with 93.6 million parameters [35]. For intent-classification we add a single bi-directional LSTM layer with attention. We experiment with two versions of the model, one with intent classification only and the other with multi-task joint intent and slot predictions.
  3. BERT: [8] introduced BERT—bidirectional contextual word representations based on self attention mechanism of Transformer models. BERT models make use of masked language modeling and next sentence prediction to model language. Similar to ELMo, the word embeddings are contextual, i.e., varying according to the context. We employ “bert-base-uncased” model [36] with 12 layers of 768 dimensions each trained on BookCorpus and English Wikipedia corpus. For intent-classification we add a single bi-directional LSTM layer with attention. We experiment with two versions of the model, one with intent classification only and the other with multi-task joint intent and slot predictions.
  4. Joint SLU-LM: [37] employed joint modeling of the next word prediction along with intent and slot labeling. The unidirectional RNN model updates intent states for each word input and uses it as context for slot labeling and language modeling.
  5. Attn. RNN Joint SLU: [38] proposed attention based encoder-decoder bidirectional RNN model in a multi-task model for joint intent and slot-filling tasks. A weighted average of the bidirectional LSTM hidden states of the encoder network provides information from parts of the input word sequence which is used together with time aligned encoder hidden state for the decoder to predict the slot labels and intent.
  6. Slot-Gated Attn: [33] introduced a slot-gated mechanism which introduces additional gate to improve slot and intent prediction performance by leveraging intent context vector for slot filling task.
  7. Self Attn. SLU: [39] proposed self-attention model with gate mechanism for joint learning of intent classification and slot filling by utilizing the semantic correlation between slots and intents. The model estimates embeddings augmented with intent information using self attention mechanism which is utilized as a gate for slot filling task.
  8. Joint BERT: [40] proposed to use BERT embeddings for joint modeling of intent and slot-filling. The pre-trained BERT embeddings are fine-tuned for (i) sentence prediction task—intent detection, and (ii) sequence prediction task—slot filling. The Joint BERT model lacks the bi-directional LSTM layer in comparison to the earlier baseline BERT based model.
  9. SF-ID Network: [41] introduced a bi-directional interrelated model for joint modeling of intent detection and slot-filling. An iteration mechanism is proposed where the SF subnet introduces the intent information to slot-filling task while the ID-subnet applies the slot information to intent detection task. For the task of slot-filling, a conditional random field layer is used to derive the final output.
  10. ASR Robust ELMo: [17] proposed ASR robust contextualized embeddings for intent detection. ELMo embeddings are fine-tuned with a novel loss function which minimizes the cosine distance between the acoustically confused words found in ASR confusion networks. Two techniques based on supervised and unsupervised extraction of word confusions are explored. The fine-tuned contextualized embeddings are then utilized for spoken language intent detection.


In this section, we conduct experiments by training models on (i) clean human annotations and (ii) noisy ASR transcriptions.

Training on clean transcripts.

Table 3 lists the results of the intent detection in terms of classification error rates (CER). The “Reference” column corresponds to results on human transcribed ATIS audio and the “ASR” corresponds to the evaluations on the noisy speech recognition transcripts. Firstly, evaluating on the Reference clean transcripts, we observe the Confusion2vec 2.0 with subword encoding is able to achieve the third best performance. The best-performing Confusion2vec 2.0 achieves a CER of 1.79%. Among the different versions of the proposed subword based Confusion2vec, we find that the concatenated versions are better. We believe this is because the concatenated models exhibit better semantic and syntactic relations (see Tables 1 and 2) compared to the non-concatenated ones. Among the baseline models, the contextual embedding like BERT and ELMo gives the best CER. Note, the proposed Confusion2vec embeddings are context-free and are able to outperform other context-free embedding models such as GloVe, word2vec and fastText.

Table 3. Intent Classification Error Rates (CER): Trained on clean reference transcripts, evaluated on clean reference and noisy ASR transcripts.

Secondly, evaluating the performance on the noisy ASR transcripts, we find that all the subword based Confusion2vec 2.0 models outperform the popular word vector embeddings by a big margin. The subword-Confusion2vec gives an improvement of approximately 45.78% relative to the best performing context-free word embeddings. The proposed embeddings also improve over the contextual embeddings including BERT and ELMo (relative improvements of 29.06%). Moreover, the results are also an improvement over the non-subword Confusion2vec word vectors (31.50% improvement). Comparisons between the different versions of the proposed Confusion2vec show that the intra-confusion configuration yields the least CER. The best results with the proposed model outperforms the state-of-the-art (ASR Robust ELMo [17]) by reducing the CER by a relative of 13.12%. Inspecting the degradation, Δdiff (drop in performance between the clean and ASR evaluations), we find that all the Confusion2vec 2.0 with subword information undergo low degradation while giving the best CER, thereby re-affirming the robustness to noise in transcripts. This confirms that the subword encoding is better able to represent the acoustic ambiguities associated in human spoken language.

Further, analyzing the results, Table 4 lists a few examples within the domain of intent detection comparing the baseline fastText embedding and the proposed concatenated version of inter-confusion model. In the first example, the ASR incorrectly recognizes “seating” as “feeding” which leads to an error in intent classification, i.e., intent is detected as “Meal” instead of “Flight Capacity”. However, Confusion2Vec is able to recognize the ambiguity through better vector representation of acoustic confusions between the two unvoiced fricatives /f/ and /s/ and the consonants /d/ and /t/, phonomena that are well documented [42, 43] and eventually lead to better classification. The second example is a classic instance of homophones (fare and fair) with similar implications. In the third example, both the embeddings fail to recover from the error. Finally, the fourth example is a manifestation of a more complex error spanning words/phrases. The proposed Confusion2Vec is able to reconcile the acoustic ambiguity information across multiple words and successfully recognize the correct underlying intent.

Table 4. Examples of intent detection: Trained on clean reference text, evaluated on ASR transcripts.

Training on noisy ASR transcripts.

Table 5 presents the results obtained by training models on the ASR transcripts and evaluated on the ASR transcripts. Here we omit all the joint intent-slot filling baseline models, since training on ASR transcript needs aligned set of slot labels due to insertion, substitution and deletion errors which is out-of-scope of this study. We note that the Confusion2vec models give significantly lower CER. The subword based Confusion2vec models also provide improvements over the non-subword based Confusion2vec model (21.28% improvement). Comparing the results in Tables 3 and 5, we would like to highlight that the subword-Confusion2vec model gives a minimum CER of 4.37% on model trained on clean transcripts which is much better than the CER obtained by popular word embeddings like word2vec, GloVe, fastText even when trained on the ASR transcripts (15.15% better relatively). These results demonstrate that the subword-Confusion2vec models can eliminate the need for re-training the intent classification model on ASR transcripts for robust performance.

Table 5. Intent Classification Error Rates (CER): Trained and evaluated on noisy ASR transcripts.


In this paper, we proposed the use of subword encoding for modeling the acoustic ambiguity information and augment word vector representations along with the semantic and syntax of the language. Each word in the language is represented as a sum of its constituent character n-gram subwords. The advantages of the subword encoding are confirmed by evaluating the proposed models on various word analogy tasks and word similarity tasks designed to assess the effective acoustic ambiguity, semantic and syntactic knowledge inherent in the models. Finally, the proposed subword models are applied to the task of spoken language intent detection. The results of intent classification system suggest that the proposed subword Confusion2vec models greatly enhance the classification performance when evaluated on the noisy ASR transcripts. The results highlight that subword-Confusion2vec models are robust and domain-independent and do not need re-training of the classifier on ASR transcript.

Further, the following advantages highlight the prospects of the proposed Confusion2Vec embedding in enabling its applications in a wide range of conditions: (i) the proposed Confusion2Vec embeddings provide feasible representations both acoustically and semantically to unseen and out-of-vocabulary words, (ii) the embedding is able to model domain independent representations in an unsupervised manner that can capture acoustic signatures of words in conjunction with semantic information, which enhances the flexibility and feasibility to train on easily available domain independent speech data, and (iii) the domain independent nature of Confusion2Vec enables cross-lingual modeling, transfer learning techniques [45, 46] for capturing ambiguous information in low-resource languages.

The proposed Confusion2Vec word embedding can benefit a range of applications involving speech (spoken language) in which acoustic ambiguity is inherent, for example in scenarios involving ASR, error correction systems, spoken language understanding, speech translation, text-to-speech systems etc. Moreover, the ambiguity need not be limited to acoustics only. Inherent ambiguities are present in various other settings dependent on the nature of the underlying signals such as for example, pictorial ambiguities associated with applications such as Optical character recognition or Image/Video Scene summarization. There is also the possibility of multiple ambiguity dimensions associated with certain applications such as Speech Translation where in addition to acoustic ambiguity, there can be ambiguity associated with source and target language morphology, segmentation and linguistic expressions such as paraphrasing. More applications are discussed in detail in [10].

In the future, we plan to model ambiguity information using deep contextual modeling techniques such as BERT. We believe bidirectional information modeling with attention can further enhance ambiguity modeling. On the application side, we plan to implement and assess the effect of using Confusion2vec models for a wide range of natural language understanding and processing applications such as speech translation, dialogue state tracking etc. We also plan to understand the factors that affect the quality of the proposed embeddings by conducting further analysis of the effects of ASR performance (WER), decoding beam size, characteristics of underlying speech signal environments including type of noise, amount of noise, channel effects, transferability over different ASR systems etc. The performance implications of these factors to the end-task are also of interest.


  1. 1. Bengio Y, Ducharme R, Vincent P, Jauvin C. A neural probabilistic language model. Journal of machine learning research. 2003;3(Feb):1137–1155.
  2. 2. Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S. Recurrent neural network based language model. In: Eleventh Annual Conference of the International Speech Communication Association; 2010.
  3. 3. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013;.
  4. 4. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems; 2013. p. 3111–3119.
  5. 5. Pennington J, Socher R, Manning CD. GloVe: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP); 2014. p. 1532–1543.
  6. 6. Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, et al. Deep contextualized word representations. arXiv preprint arXiv:180205365. 2018;.
  7. 7. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. In: Advances in neural information processing systems; 2017. p. 5998–6008.
  8. 8. Devlin J, Chang MW, Lee K, Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018;.
  9. 9. Bojanowski P, Grave E, Joulin A, Mikolov T. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics. 2017;5:135–146.
  10. 10. Shivakumar PG, Georgiou P. Confusion2Vec: Towards enriching vector space word representations with representational ambiguities. PeerJ Computer Science. 2019;5:e195.
  11. 11. Shivakumar PG, Yang M, Georgiou P. Spoken language intent detection using Confusion2Vec. In: Interspeech 2019. p. 819–823.
  12. 12. Tai KS, Socher R, Manning CD. Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:150300075. 2015;.
  13. 13. Ladhak F, Gandhe A, Dreyer M, Mathias L, Rastrow A, Hoffmeister B. LatticeRNN: Recurrent Neural Networks Over Lattices. In: Interspeech; 2016. p. 695–699.
  14. 14. Tan Z, Su J, Wang B, Chen Y, Shi X. Lattice-to-sequence attentional Neural Machine Translation models. Neurocomputing. 2018;284:138–147.
  15. 15. Xiao F, Li J, Zhao H, Wang R, Chen K. Lattice-based transformer encoder for neural machine translation. arXiv preprint arXiv:190601282. 2019;.
  16. 16. Sperber M, Neubig G, Pham NQ, Waibel A. Self-attentional models for lattice inputs. arXiv preprint arXiv:190601617. 2019;.
  17. 17. Huang CW, Chen YN. Learning ASR-Robust Contextualized Embeddings for Spoken Language Understanding. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2020. p. 8009–8013.
  18. 18. Bazzi I. Modelling Out-of-Vocabulary Words for Robust Speech Recognition In: Doctoral Dissertation, Massachusetts Institute of Technology, USA; 2002.
  19. 19. Aydelott J, Bates E. Effects of acoustic distortion and semantic context on lexical access. Language and cognitive processes. 2004;19(1):29–56.
  20. 20. Gutmann MU, Hyvärinen A. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. The journal of machine learning research. 2012;13(1):307–361.
  21. 21. Schnabel T, Labutov I, Mimno D, Joachims T. Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 conference on empirical methods in natural language processing; 2015. p. 298–307.
  22. 22. Finkelstein L, Gabrilovich E, Matias Y, Rivlin E, Solan Z, Wolfman G, et al. Placing search in context: The concept revisited. In: Proceedings of the 10th international conference on World Wide Web; 2001. p. 406–414.
  23. 23. Cieri C, Miller D, Walker K. The Fisher Corpus: a Resource for the Next Generations of Speech-to-Text. In: LREC. vol. 4; 2004. p. 69–71.
  24. 24. Povey D, Ghoshal A, Boulianne G, Burget L, Glembek O, Goel N, et al. The Kaldi speech recognition toolkit. In: IEEE 2011 workshop on automatic speech recognition and understanding. CONF. IEEE Signal Processing Society; 2011.
  25. 25. Zhang X, Trmal J, Povey D, Khudanpur S. Improving deep neural network acoustic models using generalized maxout networks. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE; 2014. p. 215–219.
  26. 26. Weide R. The CMU pronunciation dictionary, release 0.6; 1998.
  27. 27. Xu H, Povey D, Mangu L, Zhu J. Minimum bayes risk decoding and system combination based on a recursion for edit distance. Computer Speech & Language. 2011;25(4):802–828.
  28. 28. fastText: Library for fast text representation and classification. GitHub:
  29. 29. word2vec: Tool for computing continuous distributed representations of words. Code:
  30. 30. Wiki Word Vectors—fastText
  31. 31. Hemphill CT, Godfrey JJ, Doddington GR. The ATIS spoken language systems pilot corpus. In: Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990; 1990.
  32. 32. Hakkani-Tür D, Tür G, Celikyilmaz A, Chen YN, Gao J, Deng L, et al. Multi-Domain Joint Semantic Frame Parsing Using Bi-Directional RNN-LSTM. In: Interspeech; 2016. p. 715–719.
  33. 33. Goo CW, Gao G, Hsu YK, Huo CL, Chen TC, Hsu KW, et al. Slot-gated modeling for joint slot filling and intent prediction. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). vol. 2; 2018. p. 753–757.
  34. 34. Pennington J, Socher R, Manning CD. GloVe: Global Vectors for Word Representation
  35. 35. AllenNLP—ELMo
  36. 36. BERT: Pre-training of deep bidirectional transformers for language understanding. GitHub:
  37. 37. Liu B, Lane I. Joint online spoken language understanding and language modeling with recurrent neural networks. arXiv preprint arXiv:160901462. 2016;.
  38. 38. Liu B, Lane I. Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling. In: Interspeech 2016; 2016. p. 685–689.
  39. 39. Li C, Li L, Qi J. A Self-Attentive Model with Gate Mechanism for Spoken Language Understanding. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; 2018. p. 3824–3833.
  40. 40. Chen Q, Zhuo Z, Wang W. BERT for joint intent classification and slot filling. arXiv preprint arXiv:190210909. 2019;.
  41. 41. Haihong E, Niu P, Chen Z, Song M. A Novel Bi-directional Interrelated Model for Joint Intent Detection and Slot Filling. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics; 2019. p. 5467–5471.
  42. 42. Phatak S. A., Allen J. B. Consonant and vowel confusions in speech-weighted noise. The Journal of the Acoustical Society of America. vol. 121(4); 2007, p. 2312–2326. pmid:17471744
  43. 43. Kong Y. Y., Mullangi A., Kokkinakis K. Classification of fricative consonants for speech enhancement in hearing devices. PloS one. vol. 9(4); 2014, p e95001. pmid:24747721
  44. 44. Schumann R, Angkititrakul P. Incorporating ASR Errors with Attention-based, Jointly Trained RNN for Intent Detection and Slot Filling. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE; 2018. p. 6059–6063.
  45. 45. Adams O., Makarucha A., Neubig G., Bird S., Cohn T. Cross-lingual word embeddings for low-resource language modeling In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics; 2017. Volume 1, Long Papers, p. 937–947.
  46. 46. Zoph B., Yuret D., May J., Knight K. Transfer Learning for Low-Resource Neural Machine Translation In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing; 2016. p. 1568–1575.