Unsupervised speech recognition through spike-timing-dependent plasticity in a convolutional spiking neural network

doi:10.1371/journal.pone.0204596

Fig 1.

Architecture of proposed spiking neural network (SNN).

The network consists of an input layer, a convolutional layer, and a pooling layer. The input layer converts the Mel-Frequency Spectral Coefficients (MFSC) of speech signal into spikes using the time-to-first-spike coding scheme. The convolutional layer contains multi features maps which are responsible for detecting different features, and their input weights are learned by spike-timing-dependent plasticity (STDP). Each feature map in the convolutional layer is divided into non-overlapping sections which have shared input weights. The pooling layer compresses the output of the convolutional layer, and its output is classified by a linear classifier.

More »

Expand

Fig 2.

The input coding of the SNN.

(A) The MFSC spectrogram of the spoken digit “one”. The horizontal axis represents the index of frequency bands, and the vertical axis represents the time frames. (B) The spike coding of one frame (the row of pixels inside the black box) in Fig A. The MFSC features are encoded by the time-to-first-spike coding scheme. The higher the feature value, the earlier the neuron fires. Note that the time axis of Fig B is irrelevant to time axis of Fig A.

More »

Expand

Fig 3.

Three stages of model’s evaluation.

Firstly, the SNN is trained with STDP on the training set without supervisory labels. Then the fixed network is run on the training set, and the output of the pooling layer and the corresponding labels in the training set, i.e. training labels, which are the corresponding labels of currently processed input samples, are used to train the classifier. Finally, the classifier is run to predict the labels of the test data, which are called as predicted labels. The classification accuracy of the model is evaluated by comparing the predicted labels with the corresponding ground truth labels.

More »

Expand

Fig 4.

Confusion matrix of the classification result on the test set.

The diagonal values indicate the ratio of correct classifications for each digit, while off-diagonal values represent the misclassifications.

More »

Expand

Table 1.

Comparison of proposed SNN and other models on the isolated spoken digit classification task.

RC: Reservoir computing; FC: Fully-connected layer; CSNN: Convolutional spiking neural network; HTM: Hierarchical temporal memory; RNN: Recurrent neural network; DN: Delta network.

More »

Expand

Fig 5.

The SVM classification accuracy curve on the test set averaged over five runs.

The SVM accuracy on the SNN output exceeded the accuracy on the MFSC features after 900 samples. The accuracy converges to about 97.5% after 6000 samples.

More »

Expand

Fig 6.

Performance comparison of local weight sharing and global weight sharing averaged over five runs.

Global weight sharing works well with a larger number of feature maps, while local weight sharing performs significantly better than global weight sharing with fewer feature maps.

More »

Expand

Fig 7.

The visualization of evolving receptive fields in the training process.

There is no special feature before training since the weights are initialized randomly. As the training proceeded, the features are gradually learned. Finally, the learned features are distinct from each other due to the effect of lateral inhibition.

More »

Expand

Fig 8.

Visualization of MFSC features and SNN output with t-SNE.

All samples are color-coded according to their digit classes. (A) The t-SNE visualization of MFSC features. Most digit classes have more than one cluster. (B) The t-SNE visualization of SNN output. The processing of SNN makes the clusters of each digit merged or closer.

More »

Expand

Fig 9.

Visualization of SNN output.

Each image represents the SNN output after a sample from the corresponding digit class is processed. The brightness of pixels represent the final membrane potentials of pooling neurons. Most pixels are dark in all images so the output is sparsely coded.

More »

Expand

Fig 10.

Confusion matrix of the classification result on the test set of the TIMIT dataset.

The diagonal values represent the ratio of correct classifications for each word, and the off-diagonal values represent the misclassifications.

More »

Expand