Two stages of bandwidth scaling drives efficient neural coding of natural sounds

doi:10.1371/journal.pcbi.1010862

Fig 1.

Using a multi-stage auditory system model to measure the modulation power spectrum of natural sounds.

A cochlear filterbank stage first decomposes the sound pressure waveform (show for speech) into a spectro-temporal output representation (cochleogram). The cochleogram is then decomposed into modulation bands by a bank of spectro-temporal receptive fields (STRFs) of varying resolution modeled after the principal auditory midbrain nucleus (inferior colliculus). The resulting multi-dimensional output represents the sounds in frequency, time, temporal modulation, and spectral modulation. The modulation power spectrum (MPS), as measured through this auditory midbrain-inspired representation, is generated by measuring and plotting the power in each of the modulation band outputs versus temporal and spectral modulation frequency.

More »

Expand

Fig 2.

Comparing Fourier and cochlear model filterbanks.

(A) Cochlear filter transfer functions are shown for model filters with best frequency between 0.1–10 kHz (color designates gain in dB). The cochlear filters are logarithmically spaced and have bandwidths that scale with frequency (proportional resolution). They exhibit a sharp high-frequency transition and gradual low frequency transition as observed physiologically for auditory nerve fibers. A subset of the transfer functions is line plotted above. Three selected filters (103.5, 830.0, 6653.5Hz) are shown in different colors and their corresponding time domain impulse responses are shown below. (B) The Fourier filterbank, by comparison, has constant resolution filters (30 Hz bandwidth shown here) that are ordered on a linear scale (shown up to 2kHz for clarity, and part of them are line plotted above, three examples are: 250, 750, 1500Hz). In the time domain, the cochlear filter impulse responses (C) have frequency dependent peak amplitudes and delays and the impulse response durations scale inversely with frequency. For visualization purposes and to allow for ease of comparison the impulse response line plots for the three examples are normalized to a constant peak amplitude (C, top). The Fourier filterbank filters, by comparison, have constant duration and are designed for zero delay (D). (E) shows the time (Δt) and frequency (Δf) resolution of the cochlear (colored circles) and three distinct Fourier filterbanks (+ symbols show Δf = 30 Hz, 120 Hz, and 480 Hz). The dotted line represents the uncertainty principle boundary. Although the Fourier filterbanks are represented by a single point and fall on the uncertainty principle boundary, the time-frequency resolution of the cochlear filters is frequency dependent (colored circles).

More »

Expand

Fig 3.

Example Fourier and cochlear model spectrogram decompositions for vocalizations and background environmental sounds: (A) Crackling fire, (B) owl vocalization, (C) speech, and (D) running water. Fourier-based spectrograms are shown for three different frequency resolutions (Δf = 30, 120 and 480 Hz). The Fourier spectrograms tend to have higher power and details that are more concentrated at low frequencies, while the cochlear spectrograms have spectro-temporal components and power distributions that are more evenly distributed across frequency. Black (1.6–6.4 kHz), magenta (0.4–1.6 kHz) and red (0.1–0.4 kHz) boxes for speech (C) illustrate a regions of the Fourier or cochlear spectrograms that emphasize the voicing hormonic structure, second formant, and voicing temporal periodicity, respectively.

More »

Expand

Fig 4.

Spectra of vocalizations (VC) and background (BG) natural sound ensembles. Power spectra are shown for both the (A) cochlear and (B) Fourier-based model representations. Dotted lines represent the best linear fit between 0.1–10 kHz. All but one of the natural sounds have a Fourier spectrum with negative slope, while cochlear spectrums, by comparison, have more varied slopes (positive and negative) indicating a more even distribution of power across frequencies. The spectral entropy of each sound category is listed on the right side of the panel.

More »

Expand

Fig 5.

Cochlear model bandwidth scaling whitens the power spectrum of natural sounds and maximizes spectral entropy.

(A) Violin plots showing the distribution of normalized slopes of the best regression fits to both the Fourier and cochlear models (from Fig 4). For both vocalization and background sounds, normalized spectral slopes for the Fourier decomposition are negative and not significantly different (t-test, p = 0.58). By comparison, vocalizations have positive and negative slopes for vocalizations and background sounds, respectively, with an average slope near zero (0.2) indicating a whitened average spectrum. (B and C) The cochlear model entropy is higher than Fourier-based entropy regardless of the Fourier filter resolution used (30, 120 or 480 Hz). (D) Bandwidth scaling predicts the cochlear filter whitening. The average Fourier power spectrum has a decreasing trend (black) whereas the cochlear power spectrum is substantially flatter (red, continuous). The gain provided by the cochlear filter bandwidths (green curve) increases and counteracts the decreasing power trend of the Fourier power spectrum. The cochlear power spectrum is accurately predicted by considering the bandwidth dependent gain (dotted red lines; bandwidth gain + Fourier power spectrum).

More »

Expand

Fig 6.

Fourier and midbrain modulation filterbanks.

Modulation decomposition filters are shown for (A) the midbrain filterbank and (C) the Fourier-based filterbank, with each transfer function contoured at the 3dB level (50% power). Note that the Fourier-based modulation filters have equal resolution in both spectral and temporal dimensions, whereas the midbrain modulation filters have proportional resolution as observed physiologically (i.e., bandwidth scaling). The corresponding STRFs are shown for both the (B) midbrain filterbank and (D) Fourier-based filterbank. Note that the Fourier-based STRFs have equal duration and bandwidth whereas the durations and bandwidths scale for midbrain filters.

More »

Expand

Fig 7.

Modulation power spectra of natural sound ensembles including vocalizations (VC) and background sounds (BG). The modulation power spectrum is shown for the (A) Fourier-based decomposition (Δf = 30 Hz), (B) cochlear model decomposition and (C) midbrain model decomposition. Whereas the Fourier MPS and cochlear model MPS overemphasize low frequency spectral and temporal modulations, the midbrain model MPS is substantially flatter. Black contours in each graph designate the MPS region accounting for 90% of the total sound power. The modulation entropy of each sound category is listed on the right side of the panel.

More »

Expand

Fig 8.

Midbrain model decomposition maximizes the modulation entropy of natural sounds.

(A) Spectral and temporal modulation entropy are significantly higher for the midbrain model when compared against Fourier (black; Δf = 30 Hz) and cochlear model (red). (B) The total modulation entropy is highest for the midbrain model when compared against Fourier and cochlear models.

More »

Expand

Fig 9.

Modulation bandwidth scaling in the midbrain model accounts for the modulation whitening.

Averaging over all sounds, the (A) cochlear model MPS overemphasizes low temporal and spectral modulations whereas the (B) midbrain model is substantially flatter MPS. (C) Residual gain of the midbrain modulation filters arising from bandwidth scaling. (D) The predicted midbrain MPS obtained by adding the Cochlear MPS (A) and bandwidth-dependent gain (C) accounts for the whitened output of the midbrain model.

More »

Expand