Envelope reconstruction of speech and music highlights stronger tracking of speech at low frequencies

doi:10.1371/journal.pcbi.1009358

Fig 1.

(A) 20 seconds of example dB envelopes for each stimulus type. (B) Power spectra of the dB envelopes using Welch’s method with a 16 s Hamming window and half-overlap after normalizing by the average EEG spectrum for all subjects (see Methods). Lines indicate the median across stimuli of each type, and shaded regions indicate 95% quantiles of the distribution of 1000 bootstrapped median values.

More »

Expand

Fig 2.

(A) Diagram of the stages for fitting the PCA & spline model. Throughout this study, the model is trained on all trials with one left out and tested on the left-out trial. (B) The difference in reconstruction accuracy (Pearson’s r) between the PCA & spline method and a standard approach based on regularized linear regression for each of the left-out trials was then examined; negative values mean that the standard approach performs better. Error bars show the interquartile range of the reconstruction accuracy differences. Of all hyperparameter pairs, only 64 PCs and spline knots sampled at 32 Hz exhibited performance that was no different than the standard approach. (C) The combined effect of removing the moving average and using basis splines restricts the frequency content of the reconstruction to a three-octave range; for a 500 ms window, this is restricted from 2–16 Hz.

More »

Expand

Fig 3.

(A) For each stimulus type (6–7 trials per subject) the model was iteratively fit to all trials with one left out and tested on the left-out trial. In order to get a null distribution of reconstruction accuracies, we repeated this procedure, leaving out one trial at a time, after randomly circularly shifting the envelopes in each trial by the same amount. This was repeated 50 times for each stimulus type. (B) Schematic of our expectation for how reconstruction accuracy varies with frequency. We expected that EEG may be tracking a particular frequency range of the stimulus envelope. This could be identified by varying the range of the three-octave model bandwidth. The reconstruction accuracy increases from chance as the model bandwidth overlaps the relevant frequency range, and plateaus when the model bandwidth is fully contained within the relevant frequency range. (C) As lower frequencies are introduced into the stimulus envelope, the variance of the null distribution increases. Because of this, we z-scored the true trial-by-trial reconstruction accuracies relative to the null distribution to ease cross-frequency comparisons. (D) Shown are the median reconstruction accuracies across subjects and trials. Shaded regions show the 95% quantiles of the distribution of 1000 median values calculated using bootstrap resampling with replacement. Thicker lines indicate frequency ranges where median z-scored reconstruction accuracies were significantly greater than zero (one-tailed Wilcoxon signed-rank test with Bonferroni correction for 40 comparisons, p < 0.001). (E) Throughout the frequency range tracked by speech, speech reconstructions were significantly better than all other musical stimuli, with a difference peaking in the 0.5–4 Hz range. Thick lines indicate differences in reconstruction accuracy that are significantly greater than zero (two-tailed permutation test with Bonferroni correction for 40 comparisons, p < 0.001).

More »

Expand

Fig 4.

(A) Model weights, median across subjects and averaged across channels. Models are color-coded based on their frequency range (see to the right of the plots). The model of the range 0.0625–0.5 Hz was excluded because none of the stimuli exhibited significant neural tracking in this range, and the large values for the weights obscured the trends in the other models. (B) Mean and standard error across subjects of model weights for two frequency ranges (see S9 Fig for the models for individual subjects). (C) and (D) show the topographies of the model weights averaged over the range of delays corresponding to peaks and troughs in the 4–32 Hz and 0.5–4 Hz models respectively. Note that the range of delays vary across stimulus types in the 4–32 Hz model in order to capture similar peaks and troughs.

More »

Expand

Fig 5.

(A) The difference between the trial-by-trial reconstruction accuracies using the stimulus-specific and stimulus-general models was then computed. Lines show the median reconstruction accuracy differences across subjects and trials. Shaded regions show the 95-percentile range of bootstrapped resampled median values. Significance values are based on a Wilcoxon signed-rank test with Bonferroni correction for 32 comparisons, *** p < 0.001, ** p < 0.01). The stimulus-specific speech model outperformed the stimulus-general model at 0.5–4 Hz and 1–8 Hz (blue text), while the other stimulus-specific models performed worse than the stimulus-general model for most frequency ranges (black text, solid lines at top). (B, C) The stimulus-general model was quite similar in its temporal (B) and spatial (C) pattern compared to the stimulus-specific models. (D) We assumed that stimulus-general model is a scaled and phase shifted version of each of the stimulus-specific models, so by circularly shifting and scaling the model we could quantify the difference between the model weights. The scaling was computed separately for each EEG channel, but we assumed the shift would be identical for all EEG channels. (E) R² fits of the scaled and shifted stimulus-general model to each stimulus-specific model on a trial-by-trial basis, based on the summed errors across all EEG channels. Solid black lines show the median values across all trials and subjects. The grey lines to the left of each set of black dots designates the 5% and 95% range of the chance R² distribution. Red asterisks show the stimulus-types for which the fits were significantly better than chance (Wilcoxon rank-sum, p < 0.001). (F) Distribution of circular shifts plotted identically to E. Red asterisks show distributions whose medians are significantly different than zero (Wilcoxon signed-rank, p < 0.001). (G) Above, topography of scaling factors for vocals and speech. The scaling factors at channels Fz and Pz were then compared for each trial. Below, individual scaling factor differences (Pz–Fz) for each trial and subject for vocals (magenta) and speech (blue). Arrows in each plot indicate individual points that were outside of the y-axis limits in the plot. Comparisons between vocals and speech for each frequency range are based on a Wilcoxon rank-sum test. Comparisons between frequency ranges are based on a signed-rank test.

More »

Expand

Fig 6.

(A) Using the EEG data recorded while subjects were listening to the rock songs, we trained and tested PCA & spline models on the dB envelopes for the vocals, guitar, bass, and drums individually. Z-scored reconstruction accuracies were quantified as in Fig 3A–3D. All instrument envelopes were reconstructed above chance when the model included frequencies above 2 Hz (Wilcoxon signed-rank test: p < 0.001 with Bonferroni correction for 40 comparisons). The full rock envelope, shown with a dashed black line, is equivalent to the values shown in Fig 3D. (B) Pairwise differences between the z-scored reconstruction accuracy for the full envelope and the envelope for each individual instrument. The z-scored reconstruction accuracies for drum were not significantly different than the same pairwise reconstruction accuracies for the rock envelope based on the multi-tracked recording with all instruments, except for the 8–64 Hz model where reconstruction accuracies were slightly but significantly better than full rock envelope (Wilcoxon signed-rank test with Bonferroni correction for 40 comparisons: z = 3.28, p = 0.042). (C) Welch’s power spectral density of the reconstructions was computed for each stimulus and averaged across subjects. The noise floor of the power spectra is shown with dashed lines. (D) We then adjusted the power spectral density by subtracting the true spectrum from the average of the null spectra, which made the peaks associated with temporally coherent reconstructions across subjects clearer. The maximum values in the adjusted power spectral density were then identified relative to the expected tempo of the music (1x tempo) as well as 2x to 4x the tempo. (E) Each of the 10 rock stimuli are plotted as a different color, and each dot corresponds to 1 – 4x the music’s tempo with increasing frequency. The darkest blue line and dots correspond to the example stimulus shown in C and D.

More »

Expand