Natural sounds can be reconstructed from human neuroimaging data using deep neural network representation

doi:10.1371/journal.pbio.3003293

Fig 1.

Schematic overview of the proposed sound reconstruction pipeline.

(A) DNN feature extraction from sound. A deep neural network (DNN) extracts auditory features at multiple levels of complexity using a hierarchical framework. (B) Sound reconstruction. The reconstruction pipeline starts by decoding DNN features from fMRI responses using trained brain decoders. The audio generator then transforms these decoded features into the reconstructed sound.

More »

Expand

Fig 2.

Sound stimuli and brain data.

(A) Training dataset. Waveform examples and category labels are displayed, with each category—human speech, animal sounds, musical instruments, and environmental sounds—represented in distinct colors. (B) Test dataset. Examples selected from each of the four categories are shown, with corresponding waveforms and labels. (C) Examples of spectrograms from the test dataset. One example from each category is displayed. (D) Data samples for machine learning analyses. Each 8-s sound stimulus is divided into three overlapping 4-s windows, and corresponding fMRI responses (3 volumes) are averaged within each window to create data samples. Single-trial fMRI volumes are used for training data, while test data utilize either single-trial volumes or volumes averaged across eight repetitions. (E) Definition of auditory cortex (AC). The AC, outlined by blue lines, is delineated as a combination of two regions: the early auditory cortex, shown in brown, and the auditory association cortex, shown in orange.

More »

Expand

Fig 3.

Feature decoding analysis.

(A) DNN feature decoding and evaluation. Decoders are trained to predict DNN features from voxel patterns of fMRI responses. Decoding performance is assessed by evaluating the ability of the decoded features to identify perceived sounds from the test set. (B) Example of true and decoded features for a DNN feature unit. The graph displays the true and decoded values for a single DNN feature unit across 50 test stimuli. This unit (#21060) was from the Conv5 layer of the VGGish-ish model (ROI: AC). (C) Profile correlation of decoded auditory features. Each bar represents the mean profile correlation, with distinct colors indicating different subjects. Error bars denote the 95% confidence interval (CI). (D) Identification accuracy for decoded auditory features. Each bar represents the mean identification accuracy across 50 test stimuli, with error bars denoting the 95% CI. Although Pearson correlation was used as the primary evaluation metric, similar results were confirmed when applying Spearman correlation as an alternative measure for both profile correlation and identification accuracy. The data underlying this figure are provided in S1 and S2 Data.

More »

Expand

Fig 4.

Sound reconstruction.

(A) Audio generator. A multi-stage system is used to reconstruct sounds from DNN features. The audio transformer is trained to generate compact spectrogram representations as sequences of codebook indices based on decoded DNN features. The codebook decoder reconstructs the spectrogram from these indices, which are then converted into a time-domain waveform using the spectrogram vocoder. (B) Reconstruction pipeline. The audio generator transforms decoded DNN features from fMRI responses into sound. A recovery check is performed using the true DNN features of the stimulus. (C) Reconstructed spectrograms (ROI: AC, DNN layer: Conv5; for reconstructed sounds, see https://www.youtube.com/watch?v=kNSseidxFJU). The top row shows the original spectrograms of the stimulus sounds. The second row displays the recovered spectrograms using true DNN features. The following five rows present reconstructed spectrograms derived for each subject. (D) Identification accuracy of reconstructed sounds based on behavioral evaluation. Bars represent the mean accuracies of pairwise identification evaluations, averaged across samples for evaluation pairs. Error bars indicate the 95% CI, calculated from 50 data points. For category-specific analyses, 20 data points were used for speech and 10 data points for the other categories. (E) Acoustic features. Three key acoustic features are evaluated: the Fundamental Frequency (F0), the Spectral Centroid (SC), and the harmonic-to-noise ratio (HNR). (F) Evaluation of reconstructed sounds based on objective feature-based measures. (G) Similarity of evaluation metrics with human ratings. The similarity between identification accuracy from human ratings and an objective evaluation metric is assessed using Pearson correlation across the identification accuracy of 50 reconstructed data points. Each dot represents a subject, and the bars indicate the mean across five subjects. The data underlying this figure are provided in S2 Data.

More »

Expand

Fig 5.

Evaluation using temporally perturbed stimuli.

(A–C) Analysis using textured stimuli. (A) Identification analysis is conducted between the true or textured stimuli and lure candidates using extracted features. (B) The first row shows the original spectrograms of the stimulus sounds, while the second row illustrates the spectrograms of textured stimuli. (C) Each panel displays the identification accuracy for individual subjects (e.g., S3, S4). The dark blue bars represent the identification accuracy for reconstructed sounds using the original true stimuli, while the orange bars indicate accuracy for reconstructed sounds using the textured true stimuli. Error bars denote the 95% CI, calculated from 50 data points. (D–F) Analysis using temporally shuffled stimuli. (D) Identification analysis is performed between the true or shuffled true stimuli and lure candidates using extracted features. (E) The first row depicts the original spectrograms of the stimulus sounds, and the second row presents examples of temporally shuffled stimuli. Here, spectrograms are divided into equal-sized time windows (e.g., 48 ms), and the segments are randomly shuffled to introduce temporal perturbations. (F) Each panel shows the identification accuracy for individual subjects. The bars represent the mean identification accuracy for various segment sizes, with different colors indicating specific segment sizes. The data underlying this figure are provided in S2 Data.

More »

Expand

Fig 6.

Reconstructions with leave-category-out analysis.

Each panel corresponds to a sound category excluded during training. The upper row in each panel illustrates the spectrograms of the original stimuli. The middle row displays the reconstructed spectrograms generated by decoders trained on the full dataset. The bottom row depicts the reconstructed spectrograms obtained using decoders trained on a dataset where data from the test category were excluded during training (ROI: AC; DNN layer: Conv5). Audio examples of the reconstructed sounds are accessible at https://www.youtube.com/watch?v=znm6NWL1YYY. Adjacent to the spectrograms, the upper bar graphs represent the identification accuracies achieved using decoders trained on the full dataset, while the lower bar graphs show the accuracies for decoders trained on the leave-category-out dataset. Error bars represent the 95% CIs. Each color in the bar graphs corresponds to a different subject. The data underlying this figure are provided in S2 Data.

More »

Expand

Fig 7.

Effect of hierarchical auditory regions and DNN layers.

(A) Evaluation of reconstructed sounds from individual auditory regions (DNN layer: Conv5). (B) Evaluation of reconstructed sounds from different DNN layers (ROI: AC). Each bar represents the mean identification accuracy calculated for each subject, with the error bar indicating the 95% CI estimated from 50 data points. The data underlying this figure are provided in S2 Data.

More »

Expand

Fig 8.

Sound reconstruction with attention.

(A) Reconstructed spectrograms under selective auditory attention tasks (ROI: AC, DNN layer: Conv5; for reconstructed sounds, see https://www.youtube.com/watch?v=1ZHCoiyqPb4). The top panel displays the spectrograms of two superimposed sounds presented during the task, where subjects were instructed to focus on one specific sound. The bottom panel shows the spectrograms reconstructed by different subjects (S4, S5). (B) Evaluation of attentional bias. Identification analysis was conducted to evaluate the attentional bias by comparing the correlation of reconstructed features with those of the attended and unattended stimuli. (C) Identification accuracy of attended sound. Each bar represents the correct rate among the 48 identification trials. Since the identification of the attended vs. unattended sound was scored as a binary outcome per trial, conventional error bars are not shown. Instead, the dashed line indicates the significance level (p < 0.05) based on a binomial test (N = 48). For the pilot study S1 (N = 32) and cases where F0 and HNR calculations were unsuccessful, a higher significance threshold was required (not depicted here). The data underlying this figure are provided in S2 Data.

More »

Expand