Brain-optimized extraction of complex sound features that drive continuous auditory perception

doi:10.1371/journal.pcbi.1007992

Fig 1.

Architecture of the BO-NN model.

(a) The BO-NN model received two inputs: a 1D time-domain audio signal at 8 kHz (audio waveform) and a 2D time-frequency signal at 50 Hz (audio spectrogram). Six-second audio chunks were input to the model at once (48000 and 300 time points, respectively). Two convolutional streams were trained to process each input separately. Then, both representations were concatenated and passed to the recurrent layer. The output was a prediction of the HFB neural signal at 50 Hz (300 time points over 396 electrodes). (b) Results of testing various ANN architectures. Dots represent cross-validated model performance for individual electrodes. Boxes outline the 25^th and 75^th percentiles of the model performance over electrodes with a significant fit, caps show 5^th and 95^th percentiles. A solid line within each box is the median. Red box outlines the model with a significantly better cross-validated performance. Four comparisons were performed: (1) computations within the ANN nodes (convolution, CNN, recurrence, RNN and a combination of both, RCNN); (2) the type of the audio input (a waveform, a spectrogram and a combination of both); (3) depth of the convolutional block of the ANN (one layer vs multiple layers); and (4) the size of the temporal window in the audio input (1-second, 3-second and 6-second). For each comparison a set of ANNs with relevant architectures were constructed and trained on the data of Movie I. In general, wherever possible we matched the number of parameters across the different ANNs. Spearman correlation (ρ) between the predicted and observed HFB responses was used as the model performance metric. The performance of each model was cross-validated over ten test folds. A strict parametric threshold p<1×10⁻²⁰ on the t-transformed ρ values was used to estimate significance of the fit per electrode. Other thresholds were tested as well (1×10⁻¹⁰, 1×10⁻⁵⁰, 1×10⁻⁸⁰ and 1×10⁻¹⁰⁰) and yielded the same results in all ANN comparisons. The electrodes with a significant ANN fit (reported here at p<1×10⁻²⁰) in any of the models were selected for each comparison (for example when comparing the computations in the main nodes we selected all electrodes with a significant fit in any of the three models: CNN, RNN or RCNN). Non-parametric Wilcoxon signed-rank tests were applied to compare the cross-validated model performance over the ANNs in each comparison. The reported results were significant at p<.001 (***), Bonferroni corrected for the number of the ANN models and the number of architecture comparisons.

More »

Expand

Fig 2.

BO-NN model performance with Movie I (6 participants).

(a) Overview of the training data (Movie I). The distributions of sound types were made based on the manual annotation of the soundtrack. (b) Cross-validated BO-NN model performance (over ten folds) estimated as the Spearman correlation (ρ) between the predicted and observed HFB responses on test set of Movie I. Spearman’s ρ is shown per type of sound with a sufficient amount of data (at least 10% of the soundtrack): speech, noise, music and ambient sounds. In each test fold all fragments of the same sound type were concatenated, then the Spearman’s ρ was calculated per each sound type and averaged over ten cross-validation folds. Each dot is an ECoG electrode with a significant fit at p<1×10⁻²⁰, Bonferroni corrected for the number of electrodes and folds. Boxes show 25^th and 75^th percentiles, caps show 5^th and 95^th percentiles. A solid line within each box shows the median. The colored lines connect the model performance for the same electrodes across the sound types. The colored lines are less visible for smaller ρ values, however it is clear from the plot that five electrodes (ρ≥.4) were associated with comparable prediction accuracy across different sound types. (c) Projection of electrodes with the significant cross-validated model performance (Spearman’s ρ for ~7.8 min of data) on individual cortical surfaces. Spearman’s ρ values significant at p<1×10⁻²⁰, Bonferroni corrected for the number of electrodes and folds are shown. The size of each colored electrode is proportional to the ρ value that is additionally color-coded. Small black dots show electrodes that did not obtain a significant BO-NN fit and are displayed for the coverage reference. (d) Distribution of the prediction accuracy over different sound types for an example electrode E216 (outlined in green on the cortical map in c). First, we concatenated all ten test folds into a full soundtrack of predicted HFB responses. Then, per sound type we sampled 30 fragments, each 30-second long, at random starting points throughout the soundtrack and correlated the predicted and observed HFB responses. The rug plots show ρ values per individual 30-second fragment and per sound type. The probability density plots outline the distribution of the ρ values per sound type. This plot is another illustration that the best fitted electrodes were associated with high prediction accuracies regardless of the type of sound.

More »

Expand

Fig 3.

BO-NN model performance with unseen data (29 new participants, Movie II).

Cross-validated model performance (over six folds) estimated as Spearman’s ρ values (similar to Movie I) is projected onto the corresponding electrode locations on the average MNI cortical surface. The displayed results were obtained with the features from the top (recurrent) layer of the BO-NN model. Individual electrode locations were normalized to the MNI space using patient-specific affine transformation matrices obtained with SPM8. Spearman’s ρ values significant at p<.001, Bonferroni corrected for the number of electrodes, are shown (significance testing was based on a surrogate distribution of shifted data, see Methods for details). The BO-NN model performance is shown separately for speech and music test sets. Note the difference in the range of the Spearman’s ρ values between speech (up to .6) and music (up to .3). For the visualization purposes, a 2D Gaussian kernel (FWHM = 8 mm) was applied to the coordinate on the brain surface corresponding to the center of the electrode, so that the model performance score (Spearman’s ρ value) faded out from the center of the electrode toward its borders. Small black dots show electrodes that did not obtain a significant BO-NN fit and are displayed for the coverage reference.

More »

Expand

Fig 4.

Comparison of the data-driven BO-NN model trained on the neural responses with two control feature sets for Movie II.

First control feature set is the theory-driven spectrotemporal modulation feature (STMF) model. Second control feature set is the top layer of the ANN with the same architecture as the data-driven BO-NN but not optimized to fit the neural responses (Rand-NN). (a) Top panel displays second-order dissimilarity matrices showing how dissimilar the audio representations are between STG, the top layer of the data-driven BO-NN model, the theory-driven STMF model and the top layer of the non-optimized Rand-NN. The dissimilarity matrices are shown separately for speech and music data. The degree of similarity between the three audio representations with STG was tested using Wilcoxon signed-rank tests (see main text and Methods for details). The results were significant only for speech at p<.001, Bonferroni corrected for the number of models. Bottom panel shows the cross-validated prediction accuracy in speech (over six folds) achieved by the top layer of the BO-NN model per individual STG electrode. Scatter plots display the difference in prediction by the data-driven BO-NN model (top layer) from the prediction by the non-optimized Rand-NN model (top layer) and the theory-driven STMF model. Each dot is an ECoG electrode with significant Spearman’s ρ between the predicted and observed HFB responses on the test set of Movie II. Spearman’s ρ significant at p<.001, Bonferroni corrected for the number of cortical regions is shown. (b) Cross-validated prediction accuracy (over six folds) achieved by the top layer of the BO-NN model per individual electrode in higher-level speech ROIs: IFG, MTG and postcentral gyrus. The mode of display is analogous to the STG plots from a.

More »

Expand

Fig 5.

Visualization and interpretation of the key BO-NN features.

(a) Optimal choice of the number of key features using AP clustering selected as the knee of the curve over the parameter of preference (left panel). The drop-off in the prediction accuracy is shown as a function of a number of clusters (right panel). The accuracy for predicting the neural responses to speech and music is shown separately. The prediction accuracy averaged over all significant electrodes is shown. Shaded area shows the standard error of the mean. (b) Top plot shows the key BO-NN features with the maximal average activation across speech or music fragments. Bottom plot shows music and speech specificity values per feature as assessed with the d′ statistics (signal separability index). Boxplots show surrogate distributions used for significance testing (obtained by permuting speech and music blocks and recalculating d′ values per feature 10000 times). The boxes show the 25^th and 75^th percentiles of the surrogate d′ values computed on permuted feature activation time courses, caps show 1^th and 99^th percentiles. A solid line in the middle shows the median. The actual d′ statistics (from non-permuted data) are shown as circle markers per feature. The markers are filled if the actual d′ statistics fall above the 99^th (red markers, speech specificity) or below the 1^st (blue markers, music specificity) percentile of the surrogate d′ distributions. (c) Example of a ~4-second fragment of activity for a number of key BO-NN features with the corresponding audio spectrogram and language annotations. The top three selected features (#1, #2 and #36) were most active during speech blocks (and exhibited the specificity to speech), whereas the bottom three selected features (#48, #40 and #15) were most active during music blocks (and exhibited the specificity to music). Feature activation values are the result of the tanh-transformation and are therefore in the range of [–1, 1]. Black dotted line shows the border between music and speech blocks. Yellow contour shows sound intensity, red contour shows pitch. Both were extracted automatically from Praat.

More »

Expand

Fig 6.

Visualization and interpretation of the key BO-NN features.

(a) Example of an electrode tuning to various key BO-NN features. Top right panel shows the location of the selected electrode on the MNI cortical surface. Bottom right panel shows the distribution of β-weights over the 53 key BO-NN features for the selected electrode. (b) Cortical map of the maximal positive β-weights across all key BO-NN features highlighting that most electrodes associated with the time courses of the key BO-NN features were located in the perisylvian cortex and primarily on the left (note the left hemisphere bias in ECoG coverage). (c) Cortical weight maps over the electrodes for a number of the key BO-NN features. Per plot, bars in the top right corner show normalized mean activity of the feature during speech and music blocks, half-pie charts in the bottom right corner show the distribution of the electrodes contributing to the corresponding cluster over all 29 subjects.

More »

Expand

Fig 7.

Relation of the key BO-NN features to STMFs.

(a) Cross-validated prediction accuracy (over six folds) for modeling STMFs from the top BO-NN features, separately for speech and music, using either a full set of 300 top BO-NN features or only 53 key BO-NN features. The prediction accuracy was estimated as the Spearman correlation (ρ) between the predicted and observed STMF time courses in a held-out test set. Shown are Spearman’s ρ values significant at p<.001 (based on a surrogate distribution of shifted data, see Methods for details), averaged over six cross-validation folds. Colored boxes show the prediction accuracy for aligned BO-NN-STMF data during speech (in red) and music (in blue) fragments. White boxes show the distribution of the prediction accuracy for the surrogate shifted data (the alignment between the STMF and BO-NN time courses was shifted 1000 times). The boxes show the 25^th and 75^th percentiles of the prediction accuracy over all STMFs, the caps show 5^th and 95^th percentiles. A solid line in the middle shows the median. (b) Distribution of highest Spearman correlations with STMFs over all key BO-NN features, separately for speech and music fragments. Dotted lines show ρ values per key BO-NN feature significant at p<.001 (based on a surrogate distribution of shifted data). Each key BO-NN feature with a significant ρ value best captures a certain STMF that is a combination of values along three STMF dimensions: SMs, TMs and frequency. Therefore, per key BO-NN feature we mapped its maximal significant ρ value in all three STMF dimensions. For example, key BO-NN feature #2 exhibits ρ_max = .71 with a STMF that is a combination of SM = 2.8 cyc/oct, TM = 4 Hz and frequency = 2.7 kHz. This ρ correlation is then shown at the corresponding values along each STMF dimension. (c) In addition, we investigated which of the three STMF dimensions each key BO-NN feature is selectively tuned to. For this, per key BO-NN feature we calculated the variance in BO-NN-STMF correlations along each STMF dimension and marked the dimension of the largest variance (see Methods for more details). Exhibiting preference for certain values along a specific STMF dimension was used as an estimate of dimension selectivity and tuning to a specific value along that dimension for each key BO-NN feature. These distributions of dimension selectivity for all key BO-NN features with significant correlations to STMFs are shown in the rug plots and the probability density plots above the correlation profiles per STMF dimension and sound type. Thus, the distribution plots show information that is different from the information shown on the correlation plots (dotted lines in b). In the example of the key BO-NN feature #2 the dimension selectivity is shown as a solid line in the correlation plot in b. It is also displayed in the rug plots and the probability density plots in c. The plots show that during speech the majority of the key BO-NN features were selective to the TM and frequency dimensions (as there was more variance in correlation for the different STMF values along these dimensions). Whereas the SM dimension, while showing high BO-NN-STMF correlations in b, was associated with less variance in the BO-NN-STMF correlations and therefore–lower preference in the key BO-NN features to its specific values.

More »

Expand

Fig 8.

Temporal profiles of the key BO-NN features.

(a) Autocorrelation profiles of 53 key BO-NN features. Per feature, a black line shows the time point where the autocorrelation drops below .5. (b) The accuracy of predicting the activity of the key BO-NN features from a set of speech features: speechON, sound intensity, pitch and frequency values for the first and second formants. A linear model was trained per each audio-ECoG shift in the range of -600 ms to 1 s. Spearman correlations between the predicted and observed activity of the key BO-NN features were calculated and averaged over the six cross-validation folds. The shifts associated with non-significant prediction accuracies are greyed-out. (c) The results of the MDS for the key BO-NN features. Various types of information were used for the MDS including the results of the previous analyses (correlation to STMFs and linear fit using Praat features), the magnitude of the associated cortical weights (β-weights) and the anatomical constraints (association with the cortical weights specifically from the perisylvian regions). Each circle represents one key BO-NN feature; and the labels correspond to the indices of the key features. The cluster of features with the highest values along all MDS dimensions are highlighted in red. Per MDS dimension we also show its maximal correlation with the specific type of information used in the analysis (e.g. correlation with STMFs). (d) Left plot shows the results of the ordinary least squares fit of the sorted key features (highlighted features in c) to the y-coordinate (anterior-posterior direction) of the center of mass for the ECoG electrodes associated with each feature (based on the β-cortical weights). The boxes show the 25^th and 75^th percentiles of the electrode y-coordinates per feature, caps show 5^th and 95^th percentiles. A solid line in the middle shows the median. The line across the boxplots represents the result of the ordinary least squares fit. The slope and intercept of the fit are also reported on the top of the plot. Right plot show the gradual decrease in the TM tuning for the features highlighted in c as well as the gradual increase in the temporal response profile (i.e. optimal shifts for the prediction of the key BO-NN features using Praat features shown in a). (e) A tentative timeline of the neural processing of the audiovisual speech in -50 to 250 ms around the sound onset. The plots contain cortical weight maps for the selected key features (highlighted in c) together with their example time course during a speech fragment. Only the electrodes from the perisylvian cortices are shown; the weights of the electrodes in the right hemisphere are mapped onto the analogous location in the left hemisphere.

More »

Expand

Table 1.

Electrode grid information for all participants (both Movie I and II).

Shown is information about the number of electrodes, grid hemisphere, covered cortices, handedness, and language-dominant hemisphere per patient. L, Left; R, right; F, frontal cortex; M, motor cortex; T, temporal cortex; P, parietal cortex; O, occipital cortex; fMRI, functional magnetic resonance imaging; fTCD, functional transcranial Doppler sonography.

More »

Expand

Table 2.

Parameters of the BO-NN model.

Conv, Convolutional layer; BN, Batch normalization layer; ReLU, Rectified linear units; LSTM, long-short-term-memory units. The table is divided into blocks corresponding to two streams of processing: CNN1 on time-domain input, CNN2 on time-frequency input (sound spectrogram), concatenation layer, RNN layer (LSTM) and the output layer that contains predictions for the HFB ECoG responses across electrodes (396 electrodes).

More »

Expand