Fig 1.
Neural responses of AN fibers are invariably nonstationary, even when the stimulus is not.
(A, B) Spectrogram and waveform of a speech segment (s4 described in Materials and Methods). Formant trajectories (black lines in panel A) and short-term intensity (red line in panel B, computed over 20-ms windows with 80% overlap) vary with time, highlighting two nonstationary aspects of speech stimuli. (C) PSTH constructed using spike trains in response to a tone at the AN-fiber’s characteristic frequency (CF, most-sensitive frequency; fiber had CF = 730 Hz and was high spontaneous rate or SR [28]). Tone intensity = 40 dB SPL. Even though the stimulus is stationary, the response is nonstationary (i.e., sharp onset followed by adaptation). (D) Period histogram, constructed from the data used in C, demonstrates the phase-locking ability of neurons to individual stimulus cycles. (E) PSTH constructed using spike trains in response to a sinusoidally amplitude-modulated (SAM) CF-tone (50-Hz modulation frequency, 0-dB modulation depth, 35 dB SPL) from an AN fiber (CF = 1.4 kHz, medium SR). (F) Period histogram (for one modulation period) constructed from the data used in E. The response to the SAM tone follows both the modulator (envelope, red, panels E and F) as well as the carrier (temporal fine structure), the rapid fluctuations in the signal (blue, panel F). Bin width = 0.5 ms for histograms in C-F. Number of stimulus repetitions for C and E were 300 and 16, respectively.
Fig 2.
apPSTHs can be directly compared to evoked potentials in response to the same stimulus.
(A) Time-domain waveforms for the difference FFR (blue) and mean difference PSTH [d(t), red] in response to a Danish speech stimulus, s3 (black). Mean d(t) was computed by taking the grand average of d(t)s from 246 AN fibers from 13 animals (CFs: 0.2 to 11 kHz). The difference FFR was estimated by subtracting FFRs to alternating stimulus polarities. (B) Spectra for the signals in A for a 100-ms segment (purple dashed lines in A). (C) Time-domain waveforms for the sum FFR (blue) and mean sum PSTH [s(t), red] for the same stimulus. Both responses show sharp onsets for plosive (/d/ and /g/) and fricative (/s/) consonants. (D) Spectra for the responses in C for the same segment considered in B. The mean s(t) was estimated as the grand average of s(t)s from 246 neurons. Sum FFR was estimated by halving the sum of the FFRs to both polarities. Stimulus intensity = 65 dB SPL.
Fig 3.
Lower spectral-estimation variance can be achieved using apPSTHs (with multiple tapers) compared with difcor correlograms.
(A) Spectrum for the 100-ms segment in the speech sentence s3 (F0 ∼ 98 Hz, F1 ∼ 630 Hz) used for analysis. (B) Example spectra for an AN fiber (CF = 900 Hz, high SR) with spikes from 25 randomly chosen repetitions per polarity. The first two discrete-prolate spheroidal sequences were used as tapers corresponding to a time-bandwidth product of 3 to estimate D(f), the spectrum of d(t). No taper (i.e., rectangular window) was used to estimate the difcor spectrum. The AN fiber responded to the 6th, 7th and 8th harmonic of the fundamental frequency. (C) Error-bar plots for fractional power (PowerFrac) at the frequency (green triangle) closest to the 6th harmonic. Error bars were computed for 12 randomly and independently drawn sets of 25 repetitions per polarity. The same spikes were used to compute the spectra for d(t) (blue) and difcor (red). (D) Diamonds denote the ratio of variances for the difcor-based estimate to the d(t)-based estimate. This ratio was greater than 1 (i.e., above the dashed gray line) for all units considered, which demonstrates that the variance for the multitaper-d(t) spectrum was lower than the difcor-spectrum variance. AN fibers with CFs between 0.3 and 2 kHz and with at least 75 repetitions per polarity of the stimulus were considered. Bin width = 0.1 ms for PSTHs. Sampling frequency = 10 kHz for FFRs. Stimulus intensity = 65 dB SPL.
Fig 4.
Modulation-domain internal representations for speech coding can be obtained from PSTH-based envelopes.
PSTH response [p(t)] from one AN fiber (CF = 290 Hz, SR = 12 spikes/s) is shown. (A) Time-domain waveforms for the stimulus (gray) and p(t) (blue). (B) Output of a modulation filter bank after the processing of p(t). Modulation filters were zero-phase, fourth-order, and octave-wide IIR filters. Center frequencies (Fm) for these filters ranged from 2 to 128 Hz (octave spacing), similar to those used in recent psychophysically based SI models (e.g., [16]). PSTH bin width = 0.5 ms. 15 stimulus repetitions. Stimulus intensity = 60 dB SPL.
Fig 5.
Envelope-coding metrics should be spectrally specific to avoid artifacts due to rectifier distortion and neural stochasticity.
Simulated responses for 24 AN fibers (log-spaced between 250 Hz and 8 kHz) were obtained using a computational model (parameters listed in S2 Table) using SAM tones at CF (modulation frequency, Fm = 20 Hz; 0-dB (100%) modulation depth) as stimuli. Stimulus intensity ∼ 65 dB SPL. S(f) (blue) and E(f) (red) for three example model fibers with CFs = 1.0, 1.7, and 4 kHz (panels A-C) illustrate the relative merits of s(t) and e(t), and the potential for rectifier distortion to corrupt envelope coding metrics. d(t) was band-limited to a 200-Hz band near Fc for each fiber prior to estimating e(t) from the Hilbert transform of d(t). (A) For the 1-kHz fiber, S(f) and E(f) are nearly identical in the Fm band. S(f) is substantially affected by rectifier distortion at 2 × CF, which can be ignored using spectrally specific analyses. (B) The two envelope spectra are largely similar near the Fm bands since phase-locking near the carrier (1.7 kHz) is still strong (panel D). Rectifier distortion in S(f) is greatly reduced since phase-locking at twice the carrier frequency (3.4 kHz) is weak. (C) Fm-related power in E(f) and rectifier distortion in S(f) are greatly reduced as the frequencies for the carrier and twice the carrier are both above the phase-locking roll-off. (D) The strength of modulation coding was evaluated as the sum of the power near the first three harmonics of Fm (gray boxes in panels A-C) for S(f) (blue squares) and E(f) (red circles). VSpp was also quantified to CF-tones for each fiber (black dashed line, right y-axis). (E) Rectifier distortion (RD) analysis was limited to the second harmonic of the carrier (brown boxes in panels A-C). RD was quantified as the sum of power in 10-Hz bands around twice the carrier frequency (2 × CF) and the adjacent sidebands (2 × CF ± Fm). RD for E(f) is not shown because E(f) was virtually free from RD. (F) Raw and adjusted sumcor peak-heights across CFs. sumcors were adjusted by band-pass filtering them in the three Fm-related bands. Large differences between the two metrics at low frequencies indicate that the raw sumcor peak-heights are confounded by rectifier distortion at these frequencies. (G) Relation between raw and adjusted sumcor peak-heights with Fm-related power (from panel D) in S(f). Good correspondence between Fm-related power in S(f) and adjusted sumcor peak-height supports the use of spectrally specific envelope analyses. (H) The difference between raw and adjusted sumcor peak-heights was largely accounted for by RD power. However, this difference was always greater than zero, suggesting broadband metrics can also be biased because of noise related to neural stochasticity.
Fig 6.
Compared to the d(t), the apPSTH ϕ(t) provides a better TFS representation.
(A-C) Spectra of d(t) and ϕ(t) for the same three simulated AN fiber responses for which ENV spectra were shown in Fig 5. D(f) has substantial power at CF (black triangle), as well as at lower (purple circle) and upper (purple square) sidebands. Φ(f), the spectrum of ϕ(t), shows maximum power concentration at CF (carrier frequency), with greatly reduced sidebands. (D) Ratio of power at CF (carrier, black triangle in panels A-C) to power at lower sideband (LSB, Fc − Fm, purple circles in panels A-C). (E) Ratio of power at CF (carrier) to power at upper sideband (USB, Fc + Fm, purple squares in A-C). ϕ(t) highlights the carrier and not the sidebands, and thus, compared to d(t), ϕ(t) is a better representation of the true TFS response.
Fig 7.
Spectral-domain application of various apPSTHs to spike trains recorded in response to natural speech.
Example of spectral analyses of spike trains recorded from an AN fiber (CF = 1.1 kHz, SR = 64 spikes/s) in response to a vowel snippet of a speech stimulus (s3). (A) Time-domain representation of p(t), n(t), and the stimulus (Stim). n(t) is reflected across the x-axis for display. Signals outside the analysis window are shown in gray. PSTH bin width = 0.1 ms. Number of stimulus repetitions per polarity = 50. Stimulus intensity = 65 dB SPL. (B) Stimulus spectrum (blue, left y-axis). In panels B-E, the frequency-threshold tuning curve (TC θ, black) of the neuron is plotted on the right y-axis. (C) P(f), which shows comparable energy at F0 (130 Hz) and F2 (1.2 kHz). (D) D(f) and S(f). (E) Φ(f) and E(f). Both S(f) and E(f) show peaks near F0. Similarly, both D(f) and Φ(f) show good F2 representations, although D(f) is confounded by the strong F0-related modulation in e(t) as d(t) = e(t) × ϕ(t). The significant representation of F0 in this near-F2 AN fiber response to a natural vowel is inconsistent with the synchrony-capture phenomenon for synthetic stationary vowels.
Fig 8.
p(t), n(t), and s(t) have robust representations of the onset response, whereas e(t) and d(t) do not.
Response of a high-frequency fiber (CF = 5.8 kHz, SR = 70 spikes/s) to a fricative portion (/s/) of the speech stimulus, s3. Stimulus intensity = 65 dB SPL. (A) Stimulus (black, labeled Stim), p(t) (blue) and n(t) (red, reflected across the x-axis). PSTH bin width = 0.5 ms. Number of stimulus repetitions per polarity = 50. (B) The sum envelope, s(t) (C) The difference PSTH, d(t), and (D) the Hilbert-envelope PSTH, e(t). Since the onset envelope is a polarity-tolerant response, all PSTHs capture the response onset except for d(t) and e(t).
Table 1.
apPSTH-taxonomy for ENV & TFS.
Fig 9.
More accurate estimates of power along a spectrotemporal trajectory can be obtained using frequency demodulation.
(A) Spectrogram of a synthesized example signal that mimics a single speech-formant transition. The 2-s signal consists of two stationary tones (1.4 and 2 kHz) and a linear frequency sweep (400 to 800 Hz). Red dashed lines outline the spectrotemporal trajectory along which we want to compute the power. Both positive and negative frequencies are shown for completeness. (B) Fourier-magnitude spectrum of the original signal. Energy related to the target spectrotemporal trajectory is spread over a wide frequency range (400 to 800 Hz, red line). (C) Spectrogram of the frequency-demodulated signal, where the target trajectory was used for demodulation (i.e., shifted down to 0 Hz). (D) Magnitude-DFT of the frequency-demodulated signal. The desired trajectory is now centered at 0 Hz, with its (spectral) energy spread limited only by the signal duration (i.e., equal to the inverse of signal duration), and hence, is much narrower.
Fig 10.
The harmonicgram can be used to visualize formant tracking in synthesized nonstationary speech.
Neural harmonicgrams for fibers with a CF below 1 kHz (A, N = 16) and for fibers with a CF between 1 and 2.5 kHz (B, N = 29) in response to the dynamic vowel, s2. Stimulus intensity = 65 dB SPL. The formant frequencies mimic formant trajectories of a natural vowel [21]. A 20-Hz bandwidth was employed to low-pass filter the demodulated signal for each harmonic. The harmonicgram for each AN-fiber pool was constructed by averaging the Hilbert-phase PSTHs of all AN fibers within the pool. PSTH bin width = 50 μs. Data are from one chinchilla. The black, purple, and red lines represent the fundamental frequency (F0/F0), the first formant (F1/F0) and the second formant (F2/F0) contours, respectively. The time-varying formant frequencies were normalized by the time-varying F0 to convert the spectrotemporal representation into a harmonicgram.
Fig 11.
The harmonicgram can be used to quantify the coding of time-varying stimulus features at superior spectrotemporal resolution compared to the spectrogram.
Harmonicgrams were constructed using ϕ(t) for the same two AN-fiber pools described in Fig 10. PSTH bin width = 50 μs. A 9-Hz bandwidth was employed to low-pass filter the demodulated signal for each harmonic. The data were collected from one chinchilla in response to the speech stimulus, s3. Stimulus intensity = 65 dB SPL. A 500-ms segment corresponding to the voiced phrase “amle” was considered. (A, B) Spectrograms constructed from the average ϕ(t) for the low-CF pool (A) and from the medium-CF pool (B). (C, D) Average harmonicgrams for the same set of fibers as in A and B, respectively. Warm (cool) colors represent regions of high (low) power. The first-formant contour (F1 in A and B, F1/F0 in C and D) is highlighted in purple. The second-formant contour (F2 in A and B, F2/F0 in C and D) is highlighted in red. Trajectories of the fundamental frequency (black in A and B, right y-axis) and the formants were obtained using Praat [67]. (E, F) Harmonicgram power near the first formant (purple) and the second formant (red) for the low-CF pool (E) and the medium-CF pool (F). Harmonicgram power for each formant at any given time (t) was computed by summing the power in the three closest F0 harmonics adjacent to the normalized formant contour [e.g., F1(t)/F0(t)] at that time. The noise floor (NF) for power was estimated as the sum of power for the 29th, 30th, and 31st harmonics of F0 because the frequencies corresponding to these harmonics were well above the CFs of both fiber pools. These time-varying harmonicgram power metrics are spectrally specific to F0 harmonics and are computed with high temporal sampling rate (same as the original signal). This spectrotemporal resolution is much better than the spectrotemporal resolution that can be obtained using spectrograms.
Fig 12.
The harmonicgram of the FFR to natural speech shows robust dynamic tracking of formant trajectories, similar to the AN-fiber harmonicgram.
Comparison of the spectrogram (A) and the harmonicgram (B) for the FFR recorded in response to the same stimulus, s3 that was used to analyze apPSTHs in Fig 11. Stimulus intensity = 65 dB SPL. Lines and colormap are the same as in Fig 11. These plots were constructed using the difference FFR, which reflects the neural coding of both stimulus TFS and ENV. To highlight the coding of stimulus TFS, Hilbert-phase [ϕ(t)] FFR can be used instead of the difference FFR (S5 Fig). The FFR harmonicgram (A) is strikingly similar to the AN-fiber harmonicgrams in Fig 11C and 11D in that the representations of the first two formants are robust. The FFR data here and spike-train data used in Fig 11 were obtained from the same animal.