Improved model adaptation approach for recognition of reduced-frame-rate continuous speech

In distributed speech recognition applications, the front-end device that stands for any handheld electronic device like smartphones and personal digital assistants (PDAs) captures the speech signal, extracts the speech features, and then sends the speech-feature vector sequence to the back-end server for decoding. Since the front-end mobile device has limited computation capacity, battery power and bandwidth, there exists a feasible strategy of reducing the frame rate of the speech-feature vector sequence to alleviate the drawback. Previously, we proposed a method for adjusting the transition probabilities of the hidden Markov model to enable it to address the degradation of recognition accuracy caused by the frame-rate mismatch between the input and the original model. The previous model adaptation method is referred to as the adapting-then-connecting approach that adapts each model individually and then connects the adapted models to form a word network for speech recognition. We have found that this model adaption approach introduces transitions that skip too many states and increase the number of insertion errors. In this study, we propose an improved model adaptation approach denoted as the connecting-then-adapting approach that first connects the individual models to form a word network and then adapts the connected network for speech recognition. This new approach calculates the transition matrix of a connected model, adapts the transition matrix of the connected model according to the frame rate, and then creates a transition arc for each transition probability. The new approach can better align the speech feature sequence with the states in the word network and therefore reduce the number of insertion errors. We conducted experiments to investigate the effectiveness of our new approach and analyzed the results with respect to insertion, deletion, and substitution errors. The experimental results indicate that the proposed new method obtains a better recognition rate than the old method.


I. Introduction
In the era of instant wireless communication with intelligent client mobile devices, the human machine interface (HMI) based on automatic speech recognition (ASR) is becoming increasingly important. A client mobile device attributes the following properties: wireless access to the internet, a battery that provides limited power for a couple of days at most, small size that allows it to be carried in one hand, and a touch-screen interface for entering and displaying information. In speech recognition, a hidden Markov model (HMM) speech recognizer needs intense computation to fulfill the Viterbi decoding process. On the contrary, speech feature extraction consumes much less computation. Since the computation capacity of a mobile device is limited, low-complexity speech-feature extraction can be carried out by the client mobile device and high-complexity speech decoding can be performed by the backend powerful cloud server. This attractive client-server architecture is referred to as distributed speech recognition (DSR) [1][2][3][4]. Since under certain operating conditions the computational capacity, battery power, and transmission bandwidth resources of mobile devices may be very limited, there is a demand for the DSR system to work at a reduced frame rate. The apparent advantages of frame rate changes are not only minimizing computation in front-end devices but also reducing computation cost in the back-end server which allows it to serve more client users simultaneously under the same level of recognition accuracy when applying adaptation models [5].
The model parameters of a speech recognition server are typically trained from full frame rate (FFR) observation data. FFR observation data provide a sequence of speech feature vector that is carried out by a front-end algorithm. The front end algorithm consists of both framing and feature extraction processes. The former splits the speech samples into frames of constant length for simplifying block-wise processing of the speech signal, and the latter calculates a compact parametric spectrum representation of speech features that are intensely relevant for speech recognition. If the model parameters are directly applied to the recognition of reduced frame rate (RFR) speech, the performance will significantly degrade because the frame rates of the front-end feature sequence and back-end model are mismatched. In our previous study [5], the experimental result shows that using models trained from FFR clean data, the word accuracy for FFR and HFR clean data are 99.28% and 82.96%, respectively. There indeed exists a significantly performance degradation. In the past, several approaches have been proposed to compensate for the performance degradation caused by the frame-rate mismatch. Tan et al. [6] reconstructed the FFR data sequence by repeating each frame in the half frame rate (HFR) observation sequence, and then using the original FFR hidden Markov models (HMMs) to decode the reconstructed FFR feature sequence. The authors also suggested a multi-frame rate adaptation method, which allows the system to switch between the HFR and FFR. Linear and non-linear interpolation methods are also popular traditional approaches for reconstructing missing frames [7]. Instead of reconstructing an FFR feature sequence from the RFR one, we proposed a model adaptation method that adapts the state-transition probabilities of the FFR HMM to match the RFR of the input feature vector sequence [5,8]. In that adaptation method, each model is individually adapted and then connected to form a word network for speech decoding (i.e., automatic speech recognition, ASR). However, this model adaptation approach creates transitions that skip too many states and increase the number of insertion errors. Subsequently, we proposed an improved model adaptation method that first connects the individual models to form a word network and then adapts the connected network for speech recognition, the concept of which we briefly outlined in a letter [9]. In this paper, we present our detailed formulation, implementation, experiments, and analysis of experimental results for this new model adaptation approach.

Adaptation of hidden Markov models for isolated word speech at reduced frame rate
Most speech recognition systems are classified as either isolated or continuous. Isolated word recognition demands a short pause after each spoken word, whereas continuous speech recognition does not. Nowadays, the hidden Markov model [10] is one of the most popular and successful speech recognizers. A hidden Markov model consists of a finite set of states. Transitions among the states and observations generated in emitting states are governed by two sets of probabilities called state-transition probability distributions and observation symbol probability distributions, respectively. The state is not directly visible, but the observation dependent on the state is visible to an external observer. In our following explanation a particular matrix consists of elements of state-transition probability distribution that concerns the probability from a state to another state in a single step called the transition matrix.
Assume that o 1 ,o 2 ,� � �,o T is the FFR observation sequence of an isolated word to be recognized and o D ,o 2D ,� � �,o KD is an RFR subsequence with a reduction factor of D. Typically, a speech recognizer calculates the likelihood score for each model to generate the word to be recognized. In an HMM, we let {0,1, 2, � � �, N+1} denote the indices of the model states, in which states 0 and N+1 are the two special non-emitting starting and ending null states and states 1 through N are the emitting states. An emitting state produces a random output each time it is visited. The starting and ending non-emitting null states represent the end boundaries before the first and after the last observations, respectively. The set of parameters of an HMM includes the state-transition probabilities and state-output probability distributions. We use Q t to denote the state of the system at time t, a ij to denote the transition probability from state i to state j, and b j (o) to denote the probability density function for state j to produce the observation o. To calculate the probability that the FFR observation sequence is generated by a model λ, we can define a forward probability density function for the model to be in state i at time t and produce observations up to time t.
This forward function can be calculated recursively as follows: We can calculate the probability of the FFR sequence as follows: To calculate the probability density function for the model to produce the RFR subsequence, we can define the following forward probability density function: This forward function can be calculated as follows: where a ðDÞ i;j ¼ X Nþ1 is the transition probability from state i to state j for one step of the RFR sequence's observation period (D times in one FFR observation period). Comparing Eqs (5) and (2), we can see that the forward function for the RFR sequence is computed using an equivalent HMM with the adapted transition probabilities given by Eq (6) and the unchanged state-output distributions. The RFR adaptation of the transition matrix of an HMM creates transitions that skip more states than those of the original model. For example, suppose that the original HMM topology is a six-state left-to-right HMM including the starting and ending null states without any skip transitions, as shown in Fig 1A. The corresponding adapted HMM for the HFR (D = 2) and one-third frame rate (D = 3) are quite different from the original HMM topology. A larger frame reduction factor will create more skip transitions over a greater length, as shown in Fig 1B and 1C.
The adaptation of an HMM can also be illustrated using the power of its transition probability matrix. Let the following matrix:   be the state-transition matrix of a six-state FFR HMM model, as shown in Fig 1A. Note that the sum of each row is equal to 1, and especially that a 12 = a 66 = 1. Since the observation period of the RFR subsequence is twice that of the original FFR sequence, the FFR HMM must go through two state-transition steps to get one RFR state-transition step. Therefore, the statetransition matrix for the RFR subsequence must be the square of the transition matrix for the FFR sequence, i.e.,: :ð8Þ The non-zero entries of the above matrix are simply the links in Fig 1B. We can easily generalize the state-transition matrix for the one-third frame rate observation sequence and obtain , which leads to the adapted HMM topology for the one-third frame rate data shown in Fig 1C. For a subsequence with a frame reduction factor of D, one statetransition step between two consecutive observations is equivalent to D state-transition steps in the original FFR model. Therefore, the transition matrix for the RFR subsequence should be the D th power of the FFR transition matrix.

Adaptation of hidden Markov models for recognition of continuous speech with reduced frame rate
In continuous speech recognition, HMMs must be connected to form a word network for decoding. A word network defines the sequence of words that can be recognized. When an HMM is connected to the following HMM, the self-transition link of the null ending state of the first HMM must be diverted to the starting null state of the next model, since the ending null state of an HMM represents the time after the last observation of that model. After the self-transition link of the ending null state of the previous HMM is diverted to connect to the starting null state of the next HMM, we can remove these two passing-through null states. There are two possible approaches for adjusting the transition probabilities of a connected word network for recognizing the RFR observation sequence. The first approach is an adapting-then-connecting approach, as shown in Fig 3, in which individual word models are adapted first, according to the RFR factor D, and are then connected to form the word network for RFR speech decoding. The elements of the transition matrix of the adapted combined model include the word-internal state transitions and word-outgoing state transitions. The latter are transitions from the states of the first model to those of the second model. In Fig 3A, we have two FFR HMMs with left-to-right topology and without any state-skipping transitions. In  Fig 3B  and 3C, we can see that the transition from state (N-2) to state 3' skips two states, which is unreasonable for HFR speech. Therefore, the RFR HMM adaptation approach creates transitions that skip more states than an actual RFR model can jump over. In speech recognition, an insertion error is when a word is recognized but in fact none was spoken. These excessivejump state-transition links hinder the alignment of the RFR speech sequence with the connected adapted HMMs and consequently increase the number of insertion errors.
In contrast to the adapting-then-connecting order in the first approach, the second approach uses a connecting-then-adapting strategy to avoid creating links that skip too many states. In this approach, the transition probabilities from the states of an HMM to the states of a directly following HMM are determined first by connecting the two models to form a combined model, as shown in Fig 4A and 4B, and then the combined HMM is adapted according to the frame rate reduction factor D to fit the RFR speech, as shown in Fig 4C. We can see that there are no excessive-jump transitions in Fig 3C and all the transitions can skip at most one state in the HFR adaptation case. This connecting-then-adapting approach is more accurate since it avoids the problem of skipping too many states and alleviates the insertion-prone problem. In this adaptation approach, the destinations and associated probabilities that an emitting state may reach at the next RFR observation time are exactly the same as that it may reach at the next D FFR observation time in the FFR network. That is, at the same time, the (prior) probability of the emitting states in the RFR network is the same as that in the FFR network.

Design and implementation of a decoder for adapted hidden Markov model network
In a connected-digit recognition system, digital, silence, and short-pause models are connected to form a word loop network and the Viterbi algorithm is used to find the best path for the feature vector sequence of a test utterance to move through the network. A traditional network of HMMs is characterized by each model in the connected network having only a single entrance and a single exit. This forces the emitting state of an HMM to make transitions only to a following HMM's emitting states by going through its exit null state and the entrance null state of the following HMM. This characteristic lessens the burden in designing the decoder program and almost all publicly available HMM toolkits have this feature. After applying our proposed connecting-then-adapting method to the original FFR word network, the adapted RFR network no longer has the single entrance/exit characteristic and an emitting state can jump directly to an emitting state of a following HMM. Moreover, there are transitions that can jump over a series of HMMs with tee-transition, which is a transition link from the entrance null state to the exit null state of an HMM, to an emitting state of an HMM following this series. Since almost all publicly available HMM toolkits rely on the single entrance/exit characteristic, they cannot be directly applied to our new adapted network and we had to design and implement a new connected-digit decoder from scratch. In our design, we represent an HMM by a data structure that contains the FFR transition matrix and sub-data structures for each of its emitting states. The data structure for an emitting state contains both the parameters for its output probability distribution and transition links to each target emitting state that it can reach at the next observation time, including word-internal links and word-outgoing links that point to the data structure of an emitting state in another HMM. The data structure for a transition link includes a pointer to the target emitting state and the associated transition probability to that state. When the HMMs are connected to form a network, we must first create the transition links for each emitting state. To do so, we first create transition links to the emitting states of the same HMM based on the transition matrix of that HMM. Next, we create transition links to the emitting states of other HMMs. If we suppose two HMMs with N 1 and N 2 states (including both starting and ending null states) are concatenated in a network, we can compose a combined HMM with (N 1 + N 2 −2) states to create links from the first HMM to the second HMM. We can compute the transition probability for the links from the first HMM to the second HMM as follows. Let the transition probability matrix of the two HMMs be and respectively. Here, the last column elements of A HMM1 and the first row elements of A HMM2 correspond to the red and blue arcs in Fig 5, respectively. The transition probability matrix of the combined model becomes the following: The upper left and lower right submatrixes of the transition probability matrix for the combined model come from the upper left (N 1 −1)×(N 1 −1) submatrix of the first matrix, and the lower right (N 2 −1)×(N 2 −1) submatrix of the second matrix, respectively. The elements in the upper right (N 1 −1)×(N 2 −1) submatrix represent the transitions from the first HMM to the second HMM and the transition probability from state i of the first HMM to state j of the second HMM is given by a iN 1 � b 1j . We can then create transition links from the emitting states of the first HMM to the emitting states of the second HMM using the non-zero elements in this upper right submatrix of (11), except the first row and the last column of that submatrix (because they represent transitions either from or to a null state). Note that the non-zero element in the last column of the upper right submatrix of (11) represents a transition from the first HMM to the ending null state of the second HMM and therefore it can also reach the emitting states of a third HMM if the third HMM is connected to the end of the second HMM.
In that case, we must create links from the first HMM to the third HMM by combining the three HMMs, computing the transition matrix of the combined HMM, and then creating transition links from each non-zero element in the combined transition matrix that is associated with a transition from the first HMM to the third HMM. This process continues until we create all the transition links that correspond to all the targets that the emitting states in the first HMM can reach in one observation time step.
In the adapting-then-connecting approach, we first adapt the transition matrix of each HMM to the matrix power of the frame-rate reducing factor D, and then create transition-link data structures for all possible destinations that an emitting state can reach in the next observation time (of the RFR). denoted as the first HMM and its transition matrix be denoted by A 1 . Initially, the data structure for the HMM contains its FFR transition matrix but none of its emitting states contain a transition-link data structure for the frame-rate reduction factor D. In each of the HMM's emitting states, we then create a transition-link data structure from each of the non-zero elements in the corresponding row A D 1 , excepting the elements in the last column.
After the model internal transition links are created, we must create transition links to all the HMMs that immediately follow the first HMM. Let an HMM immediately following the first HMM be denoted as the second HMM. Improved model adaptation approach for RFR continuous speech recognition point to the emitting states of the 2 nd HMM. Note that there may be several HMMs directly connected to the end of the first HMM, and we must create transition links in the first HMM to point to all the directly following HMMs.
If the adapted transition matrix of the second model includes a tee transition, we must create RFR transition links from the first model to the models that directly follow the second model. This process of creating transition links continues until links are created to all possible destinations that an emitting state can reach at the next observation time (D times of the FFR  In the connecting-then-adapting approach, we created transition links for RFR speech by first computing the transition probability matrix of the connected model, raising it to the power of the frame-rate reduction factor D, and then creating links using the adapted matrix. The process of creating model internal transition links was the same as that shown in Fig 6.  Fig 9 illustrates the process for creating transition-link data structures from an HMM (refered to as the 1 st HMM) to a directly following HMM (refered to as the 2 nd HMM). We must create transition-link data structures for all the destinations that an emitting state can reach at the next RFR observation time. In this study, we used the Aurora2 [11] database in our experiments to evaluate the performance of the adaptation methods. We simplified the word loop network provided by the Aurora2 database to reduce the programming complexity without sacrificing recognition accuracy. Fig 11A and 11B show the original Aurora2 word loop network and our simplified version, respectively. In Fig 12, we have expanded each word-level model to show the details of its HMM structure. The whole network contains one system start node, one system end node, and fourteen HMMs comprising a front silence, an end silence, a short pause (SP) and 11 English digits (zero, 'oh', one, two, . . ., and nine). The front and end silence models share the same model parameter set.
We implemented a modified token-passing algorithm [12] to decode FFR and RFR speech. In our design, a token represents a candidate partial path and its associated likelihood score. The path information of a token includes state-level and word-level paths. A path is represented and implemented as a string of states or digits depending on the path level. At every observation time t, each emitting state holds a token that represents the best subpath that reaches that state at that time. A null state holds no token and represents a place where a token can pass through instantly. Initially, each emitting state holds a token with a negative infinite score and empty paths, and the system-start node holds a token with score equal to 0 and empty paths. For each new observation time, in each emitting state, the stored token is propagated along the state's transition links to its destination. The system-start state propagates its token only at the first observation time. When a token is propagated along a link, we add the Improved model adaptation approach for RFR continuous speech recognition token's score by the amount of the link's log probability and append the new node to the token's path. When the propagation is through a word-outgoing link, we also update the word-level path. In each destination state, we collect incoming tokens and select the one with the maximum score, add its score to the log probability that the observation was generated by the destination state, and then update the stored token with this maximum score token. We designate a state as the system-end node of the whole network for the purpose of collecting tokens after the last observation is processed. Finally, we select the token with the highest score in the system-end node and retrieve its word-level path as the recognition result.

III. Experiments and results
In the experiments, we used the Aurora2 database to investigate the effectiveness of the HMM adaptation methods for the task of speaker-independent connected-digit recognition in clean and noisy environments.

Speech feature extraction, model structure and training methods
We used 12 mel-frequency cepstral coefficients (MFCCs) and one log energy as the static feature vector. We set the frame length and frame shift times for the FFR observation sequence to 25 ms and 10 ms, respectively. The dynamic feature vector was composed of delta and acceleration coefficients of the static feature sequence and the feature vector for each frame of speech consisted of a total of 39-dimension speech features. The processing details with respect to feature extraction and expansion were exactly the same as those provided by the Aurora2 database. In accordance with the recommendations of the European Telecommunication Standards Institute (ETSI), we transmitted only the static feature to the client and then we appended the dynamic feature after the static feature was received at the recognition server. We modeled each digit using an HMM with 16 emitting states, modeled silence using an HMM with three emitting states, and modeled SPs using an HMM with a single emitting state. The emitting state of the short-pause model and the middle state of the silence model shared the same state-output probability distribution. The SP had a tee transition from the null start state to the null end state so that it could be skipped when there is no pause between two digits. We used the Gaussian mixture distribution for the output of the emitting states. The number of mixture components for states in the silence model (and hence the short-pause model) and digital models were eight and four, respectively. We used the HTK Toolkit [10] to train the FFR speech model. We prepared two sets of FFR models, one of which was trained using the clean training condition and the other using the multi-training condition. Improved model adaptation approach for RFR continuous speech recognition

Recognition for RFR connected word speech
We investigated and compared the performances of the adapting-then-connecting and connecting-then-adapting approaches with respect to speech recognition of RFR speech. We tested the two model adaptation methods for their recognition of clean and noisy test data at several SNR levels from 0 dB to 20 dB in 5-dB steps. An ETSI repetition concealment method for recognition of RFR speech is also included for comparison [13]. Table 1 shows the word accuracies of the ETSI repetition concealment method and the two adaptation models in various conditions, in which we can see that the connecting-then-adapting approach obtains slightly better accuracy than the adapting-then-connecting approach for D = 2, 3, and 4. Table 1 also includes the word accuracy at the original frame rate (for D = 1) for allowing an assessment about the word accuracy degradation in terms of frame rate reduction. We can see that for multi-condition training the word accuracies for the ETSI repetition concealment method and the two adaptation models all are slightly worse than that of FFR data. Though the adapting-then-connecting approach performs the worst, the word accuracy degradation is limited within 0.97%, 2.52% and 4.58% (in average) for D = 2, 3 and 4, respectively. Improved model adaptation approach for RFR continuous speech recognition  Table 2 lists the insertion error rate for various conditions, in which we can see that the connecting-then-adapting approach had a lower insertion error rate than the adapting-thenadapting approach, as expected. Table 3 lists the deletion error rate for various conditions, in which we can see that although the connecting-then-adapting approach can reduce the insertion error rate, it can also increase the number of deletion errors. Some trade-off between the resulting insertion and deletion errors is inevitable, since the new approach puts a stricter constraint on the minimum length of a digit and can force a very rapid utterance to be aligned with fewer digits than it should. Table 4 lists the substitution error rate for various conditions, from which we can see that the substitution error rates of these two adaptation approaches are very similar.
The total decoding time (in minutes) measured by decoding all of the three test data sets (set A, set B, and set C) of the AURORA 2 for the ETSI repetition and the two adaptation approaches with clean condition training and multi-condition training is shown in Figs 13 and 14, respectively. The decoding time was gauged on a personal computer with dual Intel Xeon E5-2690 CPU of 2.90 GHz and random access memory of 16 GB. No multi-thread processing was employed and the decoding program was executed sequentially. The platform used in the experiments was 64-bit Windows 10 Education. The experimental results show that the decoding time for the two adaptation approaches is much less than that of the ETSI repetition concealment method. That means if we employ any one adaptation approach, the same back-end server is capable of serving much more client users as compared with the ETSI repetition standards. As we can see from Table 1, for multi-condition training and for a frame reduction factor of D = 2 as an example, even the word accuracy obtained with the proposed connecting-thenadapting approach is 0.3% (in average) worse than that obtained with ETSI repetition. However, from Fig 14, we find that using the proposed connecting-then-adapting approach, it allows the same back-end server to serve about twice the amount of client users without any extra cost of setting up new equipment as compared with ETSI repetition. Therefore, we can observe an appealing consequence of using the proposed connecting-then-adapting approach at the back-end server that the price paid (performance degradation) is small but the gain (computation cost saving) is huge.

Conclusions
In this paper, we presented a new HMM adaptation approach that first connects the HMMs and then adapts the combined HMM for the recognition of RFR continuous speech. This new approach avoids the problems associated with creating transition links that skip too many states and violate the skipping length constraint. Therefore, it can remedy the insertion-prone Improved model adaptation approach for RFR continuous speech recognition problem caused by the old adapting-then-connecting approach. In our new approach, the destinations and associated probabilities that an emitting state may reach at the next RFR observation time are exactly the same as those it may reach at the next D FFR observation time in the FFR network. That is, at the same time, the (prior) probability of the emitting states in the RFR network is the same as that in the FFR network. We derived the formula for computing the transition matrix of the frame-rate-adapted HMMs and for computing the transition matrix of an HMM obtained by concatenating HMMs. We described the design and implementation of the old and new adaptation methods in detail and conducted experiments to compare and analyze the performance of the two adaptation approaches. The experimental results show that our new connecting-then-adapting approach can reduce the insertion error rate and obtain a slightly better accuracy than the adapting-then-connecting approach. Improved model adaptation approach for RFR continuous speech recognition High-performance Computing (NCHC) of Taiwan for providing computational resources and storage resources.