Zipf's Law in Short-Time Timbral Codings of Speech, Music, and Environmental Sound Signals

doi:10.1371/journal.pone.0033993

Figure 1.

Block diagram of the encoding process.

a) The audio signal is segmented into non-overlapping analysis windows. b) The power spectrum of the audio segment is computed. c) The shape of the power spectrum is approximated by Bark-bands. d) Each Bark-band is binary-quantized by comparing the normalized energy of the band against a pre-computed energy threshold. These 22 quantized bands from a timbral code-word.

More »

Expand

Figure 2.

Spectrogram vs. timbral code-word example.

a) Spectrogram representation for a sinusoidal sweep in logarithmic progression over time going from 0 to 9,500 Hz. The color intensity represents the energy of the signal (white = no energy, black = maximum energy). This standard representation is obtained by means of the short-time Fourier transform. b) Timbral code-word representation of the same audio signal. The horizontal axis corresponds to temporal windows of 186 ms and the vertical axis shows the quantized values per Bark-band (black = 1 and white = 0). For instance, in the first 40 temporal windows only the first Bark-band is quantized as one (the first Bark-band corresponds to frequencies between 0 and 100 Hz). A total of 37 different code-words are used to encode this sinusoidal sweep.

More »

Expand

Table 1.

Number of different timbral code-words used to describe each sound.

More »

Expand

Figure 3.

Timbral code-words encoded from Bark-bands.

a) Rank-frequency distribution of timbral code-words per database (encoded Bark-bands, analysis window = 186 ms). b) Probability distribution of frequencies for the same timbral code-words. Music-W means Western Music, Music-nW means non-Western Music and Elements means Sounds of the Elements.

More »

Expand

Table 2.

Power-law fitting results for Bark-band code-words per database and window size.

More »

Expand

Figure 4.

Probability distribution of frequencies of timbral code-words for non-Western Music analyzed with window sizes of 46, 186, and 1,000 ms.

More »

Expand

Figure 5.

Most (left) and least (right) frequent timbral code-words per database (window size = 186 ms).

The horizontal axis corresponds to individual code-words (200 most common and a random selection of 200 of the less common). The vertical axis corresponds to quantized values per Bark-band (white = 0, black = 1). Every position in the abscissa represents a particular code-word.

More »

Expand

Figure 6.

Smoothness values () per database.

For a better visualization we plot the mean and standard deviation of the smoothness value of 20 logarithmically-spaced points per database (window size = 186 ms).

More »

Expand

Figure 7.

Rank-frequency distribution of timbral code-words (window = 1,000 ms) and Yule-Simon model with memory [13] per database.

Gen. Model stands for the computed generative model. For clarity's sake the curves for non-Western Music, Western Music, and Speech are shifted up by one, two, and three decades respectively. The model's parameters , , and were manually adjusted to match the experimental data. They correspond to the probability of adding a new code-word, the memory parameter, and the number of initial code-words respectively. The adjusted parameters are , , and for Sounds of the Elements; , , and for Speech; , , for Western Music and , , and for non-Western Music. All model's curves were computed by averaging 50 realizations with identical parameters.

More »

Expand