Music in Our Ears: The Biological Bases of Musical Timbre Perception

doi:10.1371/journal.pcbi.1002759

Figure 1.

Neurophysiological receptive fields.

Each panel shows the receptive field of 1 neuron with red indicating excitatory (preferred) responses, and blue indicating inhibitory (suppressed) responses. Examples vary from narrowly tuned neurons (top row) to broadly tuned ones (middle and bottom row). They also highlight variability in temporal dynamics and orientation (upward or downward sweeps).

More »

Expand

Figure 2.

Schematic of the timbre recognition model.

An acoustic waveform from a test instrument is processed through a model of cochlear and midbrain processing; yielding a time-frequency representation called auditory spectrogram. This later is further processed through the cortical processing stage through neurophysiological or model spectro-temporal receptive fields. Cortical responses of the target instrument are tested against boundaries of a statistical SVM timbre model in order to identify the instrument's identity.

More »

Expand

Table 1.

Classification performance for the different models.

More »

Expand

Figure 3.

Spectro-temporal modulation profiles highlighting timbre differences between piano and violin notes.

(A) The plot shows the time-frequency auditory spectrogram of piano and violin notes. The temporal and spectral slices shown on the right are marked. (B) The plots show magnitude cortical responses of four piano notes (left panels), played in normal (left) and Staccato (right) at F4 (top) and F#4 (bottom); and four violin notes (right panels), played in normal (left) and Pizzicatto (right) also at pitch F4(top) and F#4 (bottom). The white asterisks (upper leftmost notes in each quadruplet) indicate the notes shown in part (A) of this figure.

More »

Expand

Figure 4.

The confusion matrix for instrument classification using the auditory spectrum.

Each row sums to 100% classification (with red representing high values and blue low values). Rows represent instruments to be identified and columns are instrument classes. Off diagonal values that are non-dark blue represent errors in classification. The overall accuracy from this confusion matrix is 79.1%±0.7.

More »

Expand

Figure 5.

The average KL divergence between support vectors of instruments belonging to different broad classes.

Each panel depicts the values of the 3 dimensional average distances between pairs of instruments of a given couple of classes: (A) wind vs. percussion; (B) string vs. percussion; (C) wind vs. string. The 3 dimensional vectors are displayed along eigenrates (x-axis), eigenscales (y-axis) and eigenfrequency (across small subpanels). Red indicates high values of KL divergence and blue indicates low values.

More »

Expand

Figure 6.

Human listener's judgment of musical timbre similarity.

(A) The mean (top row) and standard deviation (bottom row) of the listeners' responses show the similarity between every pair of instruments for three notes A3, D4 and G#4. Red (values close to 1) indicates high dissimilarity and blue (values close to 0) indicates similarity. (B) Timbre similarity is averaged across subjects, musical notes and upper and lower half-matrices, and used for validation of the physiological and computational model. (C) Multidimensional scaling (MDS) applied to the human similarity matrix projected over 2 dimensions (shown to correlate with attack time and spectral centroid).

More »

Expand

Figure 7.

Model musical timbre similarity.

Instrument similarity matrices based kernel optimization technique of the (A) neurophysiological receptive field and (B) cortical model receptive fields. (C) Control experiments using the auditory spectral features (left), separable spectro-temporal modulation feature (middle), and global modulation features [separable spectral and temporal modulations integrated across time and frequency] (right). Red depicts high dissimilarity. All the matrices show only the upper half-matrix with the diagonal not shown.

More »

Expand

Table 2.

Correlation coefficients for different feature sets.

More »

Expand

Figure 8.

Correlation between human and model similarity matrices as a function of reduced feature dimensionality.

In each simulation, a cortical representation is projected onto a lower dimensional space before passing it onto the SVM classifier. Each projection maintains a given percentage of the variability in the original data (shown adjacent to each correlation data point). We contrast performance using full cortical model (black curve) vs. separable spectro-temporal modulation model (red curve). The empirical optimal performance is achieved around 420 dimensions; which are the parameters reported in the main text in table 2.

More »

Expand