Brain-optimized extraction of complex sound features that drive continuous auditory perception
Fig 5
Visualization and interpretation of the key BO-NN features.
(a) Optimal choice of the number of key features using AP clustering selected as the knee of the curve over the parameter of preference (left panel). The drop-off in the prediction accuracy is shown as a function of a number of clusters (right panel). The accuracy for predicting the neural responses to speech and music is shown separately. The prediction accuracy averaged over all significant electrodes is shown. Shaded area shows the standard error of the mean. (b) Top plot shows the key BO-NN features with the maximal average activation across speech or music fragments. Bottom plot shows music and speech specificity values per feature as assessed with the d′ statistics (signal separability index). Boxplots show surrogate distributions used for significance testing (obtained by permuting speech and music blocks and recalculating d′ values per feature 10000 times). The boxes show the 25th and 75th percentiles of the surrogate d′ values computed on permuted feature activation time courses, caps show 1th and 99th percentiles. A solid line in the middle shows the median. The actual d′ statistics (from non-permuted data) are shown as circle markers per feature. The markers are filled if the actual d′ statistics fall above the 99th (red markers, speech specificity) or below the 1st (blue markers, music specificity) percentile of the surrogate d′ distributions. (c) Example of a ~4-second fragment of activity for a number of key BO-NN features with the corresponding audio spectrogram and language annotations. The top three selected features (#1, #2 and #36) were most active during speech blocks (and exhibited the specificity to speech), whereas the bottom three selected features (#48, #40 and #15) were most active during music blocks (and exhibited the specificity to music). Feature activation values are the result of the tanh-transformation and are therefore in the range of [–1, 1]. Black dotted line shows the border between music and speech blocks. Yellow contour shows sound intensity, red contour shows pitch. Both were extracted automatically from Praat.