Natural sounds can be reconstructed from human neuroimaging data using deep neural network representation

doi:10.1371/journal.pbio.3003293

Natural sounds can be reconstructed from human neuroimaging data using deep neural network representation

Fig 4

Sound reconstruction.

(A) Audio generator. A multi-stage system is used to reconstruct sounds from DNN features. The audio transformer is trained to generate compact spectrogram representations as sequences of codebook indices based on decoded DNN features. The codebook decoder reconstructs the spectrogram from these indices, which are then converted into a time-domain waveform using the spectrogram vocoder. (B) Reconstruction pipeline. The audio generator transforms decoded DNN features from fMRI responses into sound. A recovery check is performed using the true DNN features of the stimulus. (C) Reconstructed spectrograms (ROI: AC, DNN layer: Conv5; for reconstructed sounds, see https://www.youtube.com/watch?v=kNSseidxFJU). The top row shows the original spectrograms of the stimulus sounds. The second row displays the recovered spectrograms using true DNN features. The following five rows present reconstructed spectrograms derived for each subject. (D) Identification accuracy of reconstructed sounds based on behavioral evaluation. Bars represent the mean accuracies of pairwise identification evaluations, averaged across samples for evaluation pairs. Error bars indicate the 95% CI, calculated from 50 data points. For category-specific analyses, 20 data points were used for speech and 10 data points for the other categories. (E) Acoustic features. Three key acoustic features are evaluated: the Fundamental Frequency (F0), the Spectral Centroid (SC), and the harmonic-to-noise ratio (HNR). (F) Evaluation of reconstructed sounds based on objective feature-based measures. (G) Similarity of evaluation metrics with human ratings. The similarity between identification accuracy from human ratings and an objective evaluation metric is assessed using Pearson correlation across the identification accuracy of 50 reconstructed data points. Each dot represents a subject, and the bars indicate the mean across five subjects. The data underlying this figure are provided in S2 Data.

doi: https://doi.org/10.1371/journal.pbio.3003293.g004