Fig 1.
Feature schematic showing stimulus input for encoding model.
A) We fit models predicting neural activity from auditory (phonological features, the acoustic envelope, and pitch) and visual (scene cuts and Gabor wavelet filters) features from movie trailer stimuli presented in an audiovisual format, audio only format, or visual only format. B) Correlation between features across all movie trailers. Higher correlations were observed between phonological features (for example, sonorant+voiced), but overall other auditory and visual features were uncorrelated. C) Example of speaker congruent and incongruent visual scenes in the movie trailer stimuli. The percentage of time for each of these event types across all movie trailers are shown in the title above. Example images similar to those shown in the experiment were generated using ChatGPT DALL-E (https://www.dall-efree.com/). Original images presented to participants are not shown here due to copyright restrictions.
Fig 2.
Comparison between auditory and visual feature models to predict EEG.
A) Scatter plot showing the model comparison (r-value) between AV and A conditions, where the original (AV) movie trailer was present, or an audio only version (A). In both cases, the same set of auditory features were used to predict the data. Each gray dot represents the encoding model performance in a single channel in a single participant. Channel FT8 is shown for subject MT0033 with the corresponding weight matrices for each condition (A condition: top left, AV condition: bottom right) and associated correlation value for the channel. B) Grand average correlation values for AV and A condition plotted on topographical map and averaged across all participants (n = 11). Topography of selective channels was similar for AV and A. C) Average difference in prediction performance for AV and A for all participants. D) Average correlation between weights between acoustic and linguistic features from A and AV across all participants. The receptive field structure was similar over temporal, frontal, and central sensors. E) Scatter plot showing the model comparison (r-value) between AV and V conditions, where the original (AV) movie trailer was present, or a visual only version (V). Visual feature model used a combination of 10 Gabor wavelet filter principal components (PCs) and scene cut (SC) information. Each gray dot represents the encoding model performance in a single channel in a single participant. An example channel, P4, is shown for subject MT0029 with the corresponding weight matrices for each condition and associated model performance (correlation value) for the channel. F) Grand average correlation values for AV and V condition plotted on topographical map and averaged across all participants (n = 11). Similar spatial distribution of good model performance was observed regardless of condition. G) Same as C for visual feature models fitted with AV and V data. H) Average correlation between weights between visual features for V and AV across all participants. Receptive field structure was most similar over occipital sensors.
Fig 3.
Cross prediction analysis shows that responses are generalizable between unimodal and multimodal stimulus information, with stronger generalizability for visual information compared to auditory.
A) Model performance for audio-only (A) test sets with A-only training data (x-axis) or audiovisual (AV) training data (y-axis), calculated as the linear correlation between predicted and actual held out EEG test data. Features for this model included phonological features, the acoustic envelope, and pitch. Each dot represents an individual electrode for an individual EEG subject (64 channels x 11 participants). Dashed black line = unity line; red line = regression line. B) Model performance for visual-only (V) test set with V-only training data or AV training data. Features for this model included the 10 Gabor PCs and scene cuts. Similar model performance was observed for both within- and cross-condition predictions, though this relationship was stronger between V and AV. C) Model performance for comparing visual only responses using either scene cuts or only Gabor feature representations with the individual feature in the audiovisual condition. D) Normalized correlation coefficient between each EEG channel for audiovisual and visual only conditions and audiovisual and audio only conditions. Overall, single trial bandpass filtered EEG (input to the model) was more correlated between the AV and V only conditions as compared to the AV and A only conditions, suggesting a strong influence of visual information on the EEG signals.