Multi-resolution speech analysis for automatic speech recognition using deep neural networks: Experiments on TIMIT

doi:10.1371/journal.pone.0205355

Table 1.

TIMIT phone duration statistics for the longest (/aw/) and shortest (/b/) phones in the set of 48 phones used for training.

Columns show mean, standard deviation, maximum and minimum duration. The last column indicates the percentage of phones with a duration shorter than 25ms.

More »

Expand

Fig 1.

Spectrograms with different time-resolution trade-offs for a short phone.

Spectrograms obtained for a segment of 0.5 seconds around the /b/ phone in the word before in the sentence “Drop five forms in the box before you go out” (Speaker FAKS0, sentence SX313). The only difference in the spectrograms is the length of the Hamming windows used: 32, 16, 8 and 4 ms. from top to bottom. Vertical red lines show the limits of the /b/ phone. There are substantial differences in the spectral representation of the short /b/ phone, for which an analysis using shorter windows is probably better.

More »

Expand

Fig 2.

Spectrograms with different time-resolution trade-offs for a long phone.

Spectrograms obtained for a segment of 0.5 seconds around the /aw/ phone (more precisely, dipthong) of “out” in the sentence “Drop five forms in the box before you go out” (Speaker FAKS0, sentence SX313). The only difference in the spectrograms is the length of the Hamming windows used: 32, 16, 8 and 4 ms. from top to bottom. Vertical red lines show the limits of the /aw/ dipthong. There are substantial differences in the spectral representation of the long /aw/ dipthong, for which an analysis using longer windows is probably better.

More »

Expand

Fig 3.

Mel-frequency scaled filterbank.

The figure shows the amplitude response of the 23 filters of a Mel-scaled filterbank ranging from 0 to 8000 Hz.

More »

Expand

Fig 4.

Deep Neural Network (DNN).

This is a graphical representation of a standard feedforward DNN architecture. The DNN is fed with an input vector x of dimension D which is transformed by the hidden layers h_j (composed of N_j hidden units) according to an activation function g and the parameters of the DNN (weight matrices W and bias vectors b). Finally the output layer O produces the output of the DNN for the target task (for the case of classification, the posterior probability of an input vector to belong to each of the C classes). Reprinted from [19] under a CC BY license, with permission from Alicia Lozano et. al., original copyright 2017.

More »

Expand

Fig 5.

Multiresolution spectrum computation.

The figure shows an example of combination of two spectra computed with different window lengths and frame periods to produce a multi-resolution spectrum with a fixed frame period.

More »

Expand

Table 2.

Baseline results with Kaldi.

In all cases features are MFCC+CMVN+Splice+LDA+MLLT+fMLLR (the same used in the Triphones3 HMM/GMM setup. Feature splicing indicated in the table is performed at the input of the DNN. Input Dim. is the dimension of the input of the network including feature splicing. Param. is the number of trainable parameters of the network.

More »

Expand

Table 3.

Results with simplified features.

In all cases the DNN is similar to the FC-pnorm with the only difference that the input layer is modified to fit the input dimensionality. In all cases a frame splicing of ±4 is used at the input of the DNN. Input Dim. is the dimension of the input of the network including feature splicing. Param. is the number of trainable parameters of the network.

More »

Expand

Table 4.

Results with multi-resolution spectrograms and FC-pnorm networks.

In all cases DNNs have three hidden layers and include ±4 splicing of input features features, which are raw spectrograms in dB obtained with Hamming windows. Input Dim. is the dimension of the input of the network including feature splicing. Param. is the number of trainable parameters of the network.

More »

Expand

Table 5.

Results with multi-resolution spectrograms and TDNNs-ReLU networks with ±10 feature splicing.

In all cases features are raw spectrograms in dB obtained with Hamming windows. Input Dim. is the dimension of the input of the network including feature splicing. Param. is the number of trainable parameters of the network.

More »

Expand

Table 6.

Results with multi-resolution spectrograms and Fully Connected feedforward DNNs trained with Keras and Theano with ReLU activation functions and ±4 feature splicing.

In all cases features are raw spectrograms in dB obtained with Hamming windows. Results are given as frame by frame phone state recognition accuracy considering 1936 different phone states. Input Dim. is the dimension of the input of the network including feature splicing. Param. is the number of trainable parameters of the network.

More »

Expand

Table 7.

Configurations used with multi-resolution highly processed and speaker-adapted features and FC-sigmoid-RBM-pretrain DNNs.

More »

Expand

Table 8.

Results with multi-resolution features and FC-sigmoid-RBM-pretrain DNNs with hidden layers of 1024 units.

In all cases features are MFCC+CMVN+Splice+LDA+MLLT+fMLLR. Input is spliced ±5 frames. Input Dim. is the dimension of the input of the network including feature splicing. Param. is the number of trainable parameters of the network.

More »

Expand

Table 9.

Results with multi-resolution features and FC-sigmoid-RBM-pretrain DNNs with hidden layers of 2048 units.

In all cases features are MFCC+Splice+LDA+fMLLR. Input is spliced (±5 frames). Input Dim. is the dimension of the input of the network including feature splicing. Param. is the number of trainable parameters of the network.

More »

Expand

Table 10.

Frame accuracies with multi-resolution features and FC-sigmoid-RBM-pretrain DNNs with hidden layers of 1024 units.

In all cases features are MFCC+CMVN+Splice+LDA+MLLT+fMLLR. Input Dim. is the dimension of the input of the network including feature splicing. Param. is the number of trainable parameters of the network.

More »

Expand

Table 11.

Frame accuracies with multi-resolution features and FC-sigmoid-RBM-pretrain DNNs with hidden layers of 2048 units.

In all cases features are MFCC+Splice+LDA+fMLLR. Input Dim. is the dimension of the input of the network including feature splicing. Param. is the number of trainable parameters of the network.

More »

Expand