Figure 1.
Block diagram of the encoding process.
a) The audio signal is segmented into non-overlapping analysis windows. b) The power spectrum of the audio segment is computed. c) The shape of the power spectrum is approximated by Bark-bands. d) Each Bark-band is binary-quantized by comparing the normalized energy of the band against a pre-computed energy threshold. These 22 quantized bands from a timbral code-word.
Figure 2.
Spectrogram vs. timbral code-word example.
a) Spectrogram representation for a sinusoidal sweep in logarithmic progression over time going from 0 to 9,500 Hz. The color intensity represents the energy of the signal (white = no energy, black = maximum energy). This standard representation is obtained by means of the short-time Fourier transform. b) Timbral code-word representation of the same audio signal. The horizontal axis corresponds to temporal windows of 186 ms and the vertical axis shows the quantized values per Bark-band (black = 1 and white = 0). For instance, in the first 40 temporal windows only the first Bark-band is quantized as one (the first Bark-band corresponds to frequencies between 0 and 100 Hz). A total of 37 different code-words are used to encode this sinusoidal sweep.
Table 1.
Number of different timbral code-words used to describe each sound.
Figure 3.
Timbral code-words encoded from Bark-bands.
a) Rank-frequency distribution of timbral code-words per database (encoded Bark-bands, analysis window = 186 ms). b) Probability distribution of frequencies for the same timbral code-words. Music-W means Western Music, Music-nW means non-Western Music and Elements means Sounds of the Elements.
Table 2.
Power-law fitting results for Bark-band code-words per database and window size.
Figure 4.
Probability distribution of frequencies of timbral code-words for non-Western Music analyzed with window sizes of 46, 186, and 1,000 ms.
Figure 5.
Most (left) and least (right) frequent timbral code-words per database (window size = 186 ms).
The horizontal axis corresponds to individual code-words (200 most common and a random selection of 200 of the less common). The vertical axis corresponds to quantized values per Bark-band (white = 0, black = 1). Every position in the abscissa represents a particular code-word.
Figure 6.
Smoothness values () per database.
For a better visualization we plot the mean and standard deviation of the smoothness value of 20 logarithmically-spaced points per database (window size = 186 ms).
Figure 7.
Rank-frequency distribution of timbral code-words (window = 1,000 ms) and Yule-Simon model with memory [13] per database.
Gen. Model stands for the computed generative model. For clarity's sake the curves for non-Western Music, Western Music, and Speech are shifted up by one, two, and three decades respectively. The model's parameters ,
, and
were manually adjusted to match the experimental data. They correspond to the probability of adding a new code-word, the memory parameter, and the number of initial code-words respectively. The adjusted parameters are
,
, and
for Sounds of the Elements;
,
, and
for Speech;
,
,
for Western Music and
,
, and
for non-Western Music. All model's curves were computed by averaging 50 realizations with identical parameters.