Spiking network optimized for word recognition in noise predicts auditory system hierarchy

doi:10.1371/journal.pcbi.1007558

Fig 1.

Auditory hierarchical spiking neural network (HSNN) model.

The model consists of a (a) cochlear model stage that transforms the sound waveform into a spectrogram (time vs. frequency), (b) a central hierarchical spiking neural network containing frequency organized spiking neurons and a (c) Bayesian classifier that is used to read the spatio-temporal spike train outputs of the HSNN. Each dot in the output represents a single spike at a particular time-frequency bin. (d-f) Zoomed in view of the HSNN illustrates the pattern of convergent and divergent connections between network layers for a single leaky integrate-and-fire (LIF) neuron. (d-e) Input spike trains from the preceding network layer are integrated with excitatory (red) and inhibitory (blue) connectivity weights that are spatially localized and model by Gaussian functions (f). The divergence and convergence between consecutive layers is controlled by the connectivity width (SD of the Gaussian model, σ_l). Each incoming spike generates excitatory and inhibitory post-synaptic potentials (EPSP and IPSP, red and blue kernels in e). The integration time constant (τ_l) of the EPSP and IPSP kernels can be adjusted to control the temporal integration between consecutive network layers while the spike threshold level (N_l) is independently adjusted to control the output firing rates and the overall neuron layer sensitivity. (g, h) Example cochlear model outputs and the corresponding multi-neuron spike train outputs of the HSNN under the influence of speech babble noise (at 20 dB SNR). (g) HSNN response pattern for one sample of the words zero, six, and eight illustrate output pattern variability that can be used to differentiate words. (h) Example response variability for the word zero from multiple talkers in the presence of speech babble noise (20 dB SNR).

More »

Expand

Fig 2.

Global optimal solution that maximizes word recognition accuracy in the presence of background noise (-5, 0, 5, 10, 15 and 20 dB SNR).

Cross-validated word recognition accuracy (see Methods) is measured using the network outputs as a function of the three scaling parameters (α_τ,γ_σ, and λ_N). Word recognition accuracy curves are shown at 5 and 20 dB SNR (a and b, respectively) as well as for the global solution (c, average accuracy between -5 and 20 dB SNR). In all cases shown, word recognition accuracy curves are tuned for the different scaling parameters and exhibit a similar optimal solution (green circles). (d) The optimal scaling parameters are relatively stable across SNRs and similar to the solution that maximize average performance across all SNRs (optimal solution α_τ = 1.9, γ_σ = 1.0, and λ_N = 1.0).

More »

Expand

Fig 3.

Receptive field transformations of the optimal HSNN predicts transformations observed along the ascending auditory pathway.

(a) Example spectro-temporal receptive field (STRF) measured for the optimal network change systematically between consecutive network layers. All STRFs are normalized to the same color scale (red = increase in activity or excitation; blue = decrease in activity or inhibition/suppression; green tones = lack of activity). In the early network layers STRFs are relatively fast with short duration and latencies, and relatively narrowly tuned. STRFs become progressively slower, slightly broader, and have longer patterns of inhibition across the network layers, mirroring changes in spectral and temporal selectivity observed in the ascending auditory pathway. The measured (b) integration times, (c) latencies, and (d) bandwidths increase across the six network layers. (e) Examples STRFs from the auditory nerve (AN) [26], inferior colliculus (IC) [7], thalamus (MGB) and primary auditory cortex (A1) [8] become progressively longer and have progressively more complex spectro-temporal sensitivity along the ascending auditory pathway. Average integration times (f), latencies (g) and bandwidths (h) between AN and A1 follow similar trends as the optimal HSNN (b-d). Asterisks (*) designate significant comparisons (t-test with Bonferroni correction, p<0.01) relative to layer 1 for the optimal network (b-d) or relative to the auditory nerve for the neural data (f-h) while error bars designate SD.

More »

Expand

Fig 4.

Receptive field transformations of the high-resolution network indicate that spectro-temporal information propagates with minimal processing across network layers.

(a) Example spectro-temporal receptive field (STRF) measured for the optimal network maintain high-resolution and change minimally across network layers. Unlike the optimal network, the measured (b) integration times and (c) latencies change minimally and are relatively constant across the six network layers. (d) Bandwidths, by comparison, increase slightly across the six network layers. The figure format follows the same convention as in Fig 3.

More »

Expand

Fig 5.

Optimal HSNN outperforms a high-resolution HSNN designed to preserve incoming acoustic information.

Sample network spike train outputs and Bayesian likelihood histograms for the words three, four, five, and nine are shown for the (a) high-resolution and (b) optimal HSNN at 5 dB SNR. The Bayesian likelihood histograms correspond to the average probability of firing at each time-frequency bin for each digit (averaged across all trials and talkers). The firing patterns and Bayesian likelihood of the high-resolution network are spatio-temporally blurred compared to the hierarchical network. (b) Details such as spectral resonances (e.g., formants) and temporal transitions resulting from voicing onset are accentuated in the hierarchical network output. (c) The optimal HSNN (maximize performance across all SNRs) outperforms the high-resolution network in the word recognition task at all SNRs tested (blue = optimal; red = high-resolution) with an average accuracy improvement of 25.7%. The optimal HSNN word recognition accuracy also closely matches the performance when the network is optimized and tested individually at each SNR (black, SNR optimal HSNN) indicative of a stable network representation.

More »

Expand

Fig 6.

Hierarchical transformation between consecutive network layers enhances word recognition performance and robustness of the optimal HSNN.

(a) The average word accuracy at 5 dB SNR systematically increases across network layers for the optimal HSNN (a, blue) whereas the high-resolution HSNN exhibits a systematic reduction in word recognition accuracy (a, red). For the high-resolution HSNN average firing rates (b, red), information rates (c, red), and information per spike (d, red) are relatively constant across layers indicating minimal transformations of the incoming acoustic information. In contrast, average firing rates (b, blue) and information rates (c, blue) both decrease between the first and last network layers of the optimal network, consistent with a sequential sparsification of the response and a reduction in the acoustic information encoded in the output spike trains. However, the information conveyed by single action potentials (d, blue) in the optimal HSNN sequentially increase between the first and last layer so that individual action potentials become progressively more informative across layers. Continuous curves show the mean whereas error contours designate the SD.

More »

Expand

Fig 7.

Hierarchical transformation between consecutive network layers of the optimal HSNN serve to denoise the speech signal and selectively enhance word related temporal information.

The average network outputs are shown for the word zero (over 50 trials) at different layers of the optimal (a) and high-resolution networks (b). The response signal-to-noise ratio (SNR) remains relatively consistent for the high-resolution network across layers (c, red curve). By comparison, for the optimal network the response is sequentially lowpass filtered so that the response SNR is sequentially reduced across layers for high modulation frequencies (c, black curve). (d) The difference SNR (optimal–high res, black–red curve in panels b) demonstrate a sequential lowpass filtering that accumulates across layers for the optimal HSNN. The band SNR within the fluctuation/rhythm range (1–25 Hz) decreases with layer, but is slightly enhanced for the optimal network (e, black curve) when compared to the high-resolution network (e, red curve). The band SNR within the periodicity pitch range (75–150 Hz) is substantially reduced across layers of the optimal network (f, black) when compared to the high-resolution network (f, red). The modulation index within the 1–25 Hz band increases and is thus enhanced across layers of the optimal network (g, black) whereas it is reduced for the high-resolution network (g, red). Error bars in panels e-g represent SEM.

More »

Expand

Fig 8.

Optimal HSNN enhances robustness and outperforms single-layer generalized linear model networks with matched linear and nonlinear receptive field transformation.

(a) Linear STRFs obtained at the output of the HSNN are used as to model the linear receptive field transformation of each neuron (see Methods). The LP network consists of an array of linear STRFs followed by a Poisson spike generator. The LNP network additionally incorporates a rectifying output stage following each STRF. (b) The optimal HSNN outperformance the LP network with an average performance improvement of 21.7% across SNRs. Nonlinear output rectification in the LNP network improves the performance to within 2% of the HSNN at 20 dB SNR. However, the average LNP performance was 7% lower than the optimal HSNN and performance degraded systematically with increasing noise levels (13.75% performance reduction at -5 dB SNR) demonstrating enhanced robustness of the optimal HSNN. (c) Performance improvement of each of the tested models compared against the performance of the high-resolution network.

More »

Expand

Fig 9.

Optimal temporal resolution that maximize word recognition accuracy in noise.

(a) Word accuracy rate as a function of spike train temporal resolution (bin widths 0.5–100 mms) and SNR (-5 to 20 dB) for the optimal (a) and high-resolution networks (c). Each curve is computed by selecting the optimal scaling parameters for each SNR and measuring the word accuracy rate from the network outputs at multiple temporal resolutions. (b) Same as (a), except that global optimal scaling parameters were used for all SNRs tested. The temporal resolution that maximizes the word accuracy rate of the global optimal HSNN is 6.5 ms. (c) Word accuracy rate as a function of temporal resolution and SNR for the high-resolution network. The optimal temporal resolution for the high-resolution HSNN is 2 ms.

More »

Expand