Skip to main content
Advertisement

< Back to Article

Fig 1.

The model.

The model is composed of three main sections: (A) The cochlear model [2629] transduces a one-dimensional input stimulus, sin(t), into a two-dimensional matrix that represents the AN population response, SAN(t,fCF). (B) The AN’s spatiotemporal response is introduced into the sparse coding (SC) block to produce the sparse coefficient vector, h. The vector h carries invariant information of the input stimulus that we refer to as pitch cues. The (sparse) information in h represents harmonics in sin(t). (C) Finally, the likelihood probability of the pitch given the vector h is extracted and denoted as pdf(fp).

More »

Fig 1 Expand

Fig 2.

Different complex harmonic stimuli with the same pitch.

(A) The Fourier transform (FT) of three complex harmonics stimuli with a fundamental frequency of f0 = 240 Hz. The three signals have different spectral components: Sin,1(f) is composed of the first harmonic component of f0; Sin,2(f) consists of the first four successive harmonics of f0; and Sin,3(f) is formed from of the 10–13 harmonics. (B) The corresponding output of the cochlear model [26], i.e., the AN population responses for the three stimuli. The y-axis represents the normalized characteristic frequencies (CFs), which is CF divided by f0, on a linear scale, and the x-axis shows the post-stimulus time in milliseconds. The cochlear input is a 15ms long stimulus, and the resulting output is taken from the last 5ms. Note the different patterns of the AN activities that correspond to the three different cases: a stimulus with low frequencies excites the apical parts of the cochlea (lower part in the images), while a stimulus with higher frequencies excites the basal parts. Note also that the AN population responses define unique spatiotemporal patterns of activities for each of the stimuli. All the three stimuli have relatively low sound levels (30 dB SPL), which means that the cochlea response is linear.

More »

Fig 2 Expand

Fig 3.

Composing the dictionary D.

(A) An example of one atom in D. It is the AN population response to a sine wave of 1.5k Hz tone generated by the cochlear model [26]. The atom was normalized to a peak value of 1, as for all other population responses. The y-axis of the two-dimensional matrix represents the CF along the BM, and the x-axis is the post-stimulus time in milliseconds. (B) Each of the atoms, dj, j ∈ [1, M], is vectorized into a column in D. These M columns are concatenated to form the dictionary matrix D. All the input signals used for the creation of the dictionary have the same level of 30 dB SPL (i.e., at the cochlear linear region). In this example, we used only one atom per group (g = 1).

More »

Fig 3 Expand

Fig 4.

The sparse coefficient vector h and the final pitch probability vector.

(A) A simplified view of the SC methodology. The algorithm decomposed the two-dimensional signal SAN,2(t,fCF) into a linear combination of four atoms (columns) within D. This is a simplified view that shows the primary values in h2 (green indices) multiplied by the atoms. (B) The sparse coefficient solution vectors, hk, for the three cases (k ∈ [1,2,3]). The green circles in the figure of h2 correspond to the four terms in the simplified example of (A). All x-axes are normalized by the fundamental frequency f0 = 240 Hz for convenience. Observe that the solutions for hk resemble that of the FT for the respective stimuli (Fig 2A). (C) Using the pitch estimation unit (harmonic sieve), we can easily map the information in hk, for k ∈ [1,2,3], into a pitch probability vector, pdf(fp). Each of the y-axes of the pdfs functions is multiplied by a constant (x100) for visual clearance. The red arrows indicate the locations of the maximum peaks, all of which are shown to occur at the fundamental harmonic. In other words, it is most probable that all three stimuli represent the same pitch. Still, note that other options are also plausible, especially in rational ratios of f0.

More »

Fig 4 Expand

Fig 5.

Comparing LS with SC.

(A) From left to right: the AN population response for a harmonic complex with the 1st – 4th harmonics. The y-axis is the CFs normalized by the fundamental frequency, in a linear scale (f0 = 240 Hz). The x-axis indicates the post-stimulus time (between 10ms to 15ms). Next, the h coefficient vectors for the LS case (λ = 0.0) and for the sparse case (λ = 0.01). (B) Same as in (A) but for a complex tone stimulus that contains the harmonics 10th–13th. Note that for the lower harmonic stimulus (A), the results between the two cases, i.e., hLS vs. hSC, are almost identical. On the other hand, for the stimuli with the higher harmonics (B), the difference is more substantial. Specifically, there are much more nonzero coefficients in hLS than in hSC that are unrelated to the original spectrum structure of the signal (compare with the FTs in Fig 2A).

More »

Fig 5 Expand

Fig 6.

Stimulus level invariance.

(A) The AN population response for the missing-fundamental harmonic complex tone of Eq 6, with f0 = 225 Hz. The stimulus has an amplitude level of 30 dB SPL, and the AN population response is normalized to one, as usual. The x-axis shows post-stimulus time, and the y-axis denotes the (linear) mapping between locations along the cochlea and CFs. (B) The AN population response for the same spectral structure as in A (3–8 harmonics), but for a stimulus level of 90 dB SPL. For this relatively high stimulus level, the nonlinearity effects of the cochlea over the AN population response are apparent. (C–D) The solutions of the LS case (hLS) and the SC case (hSC) for the 30 dB (C) and 90 dB SPL (D) stimulus levels, respectively. (E–F) Probability functions of the LS (Sp,LS) and the SC (Sp,SC) cases, for the two amplitude levels, respectively. In the 30 dB SPL case (E), the same pitch is succesfully estimated for both the LS and the SC simulations (blue and red arrows indicate maximum peaks). However, for the 90 dB SPL case (F) only the SC solution proved to be robust and invariant to the stimulus level, as desired (red arrow indicates maximum peaks). In order to account for the cochlear nonlinearities due to the changing in the stimuli levels, all simulations of the AN fibers in this section were made using Carney's cochlea model (Zilany’s et al. [2729]).

More »

Fig 6 Expand

Fig 7.

Comparing the performance of different dictionaries over moderate and high amplitude stimulus levels.

All simulations have the same spectral structure (Eq 6). This spectral structure is simulated for various fundamental frequencies, f0, and the figures show the estimated pitches for each such case (i.e., the maximum peak in each pdf). The estimations are taken from an interval of ± 0 .5 octaves around f0. Each row, i.e., figures A-B and figures C-D, show the estimation results of the SC model for the two dictionaries Dsine, and Dstack, respectively (see text). The column subplots refer to different stimuli levels: moderate (45dB SPL), and high (90dB SPL) amplitudes. The x-axis denotes the location of the first harmonic within the stimuli (i.e., the 3rd harmonic); the thick black dashed lines define the main octave (f0), and the thin black dashed lines define the lower and upper octaves, i.e., 0.5 f0 and 2f0, respectively. (A-B) At low frequencies, up to about 4k Hz of the lower harmonic in the complex stimulus, the estimations of the Dsine dictionary converge to the expected frequencies for both moderate and high stimuli. However, from 4k Hz and above, the pitch estimations for the high stimuli levels diverge from the main octave to other ratios of f0. (C-D) The pitch estimations of the Dstack dictionary converge to the main octave better for the low and high frequencies and for both amplitudes.

More »

Fig 7 Expand

Fig 8.

Detailed results for f0 = 606.4Hz.

The selected examples are taken from Fig 7 and show the SC coefficient vectors h and the pdfs for the two dictionaries and for the two amplitudes. (A, C) The SC coefficient vectors h for the Dsine and Dstack dictionaries, respectively. (B, D) The resulting pdfs, over one octave around f0 = 606.4Hz, for the Dsine and Dstack dictionaries, respectively. Note the difference between the SC coefficients of the two dictionaries, but the qualitative resemblance between the two pdfs.

More »

Fig 8 Expand

Fig 9.

Resolved vs. unresolved representation of harmonic cues.

(A) The solutions hk, k ∈ [1, 5], for the stimuli of Eq 7. We compare the SCs of the two dictionaries, Dsine (lines) and Dstack (dashed lines). Dsine consists of tone-atoms and Dstack consists of complex tones that contain six harmonics with decreasing amplitudes (1 to 1/6). All stimuli contain four harmonics of the same fundamental frequency, f0 = 433 Hz, but at different spectral locations (r ∈ {1, 6, 10, 17, 22}). The x-axis is normalized by f0 for convenience. The correlation between the SC solutions and the stimuli' spectral components (Eq 7) are apparent. Note that signals with low-frequency components (such as h1) have more prominent nonzero coefficients than those of the higher harmonics (e.g., h5). A closer look at h5 (the inset) shows that only two of the four harmonics are successfully reconstructed (the 23 and 24 tones of the 22–25 harmonics). (B) Pitch probabilities (pdfs) for the five complex tones for the Dsine (see text). The right figure shows all fp frequencies and the left one views fewer octaves around f0. The numbers above the curves state the four prominent peaks of the pdfs, from the highest (1) to the fourth lower peak. Observe that all five solutions peak at the first harmonic, that is, the model predicts the same 433 Hz pitch for all stimuli. Additionally, most of the other plausible pitches, i.e., other peaks, are usually located at harmonic ratios of f0, that is, they represent octave equivalence options. It is also instructive to note the fLOCUS frequencies in the right figure of (B). These peaks indicate the additional possibility of perceiving the pitches at the locus of the stimuli spectral energy and not of f0 [1]. All simulations were performed with Slaney's model and with a sound level of 45 dB SPL.

More »

Fig 9 Expand

Fig 10.

Salience of complex tones.

(A) A Comparison between the two probability functions of the complex tones from Fig 9: the blue line is the pdf of the complex harmonic tone with the 1–4 harmonics, and the green line is the pdf of the fifth stimulus, which comprises 22–25 harmonics. The x-axis is limited to one octave in order to compare the pitch's relative heights and without considering the octave equivalence of consecutive harmonics. The blue and green arrows show the 1st and the 2nd largest peaks of the two curves, respectively. Computing the ratio for each curve between the 1st and the 2nd peaks yields a measure of the pitch's salience; a larger ratio indicates a more prominent percept of tha pitch. (B) Calculating the ratio between the 1st and 2nd peaks for harmonic tones with four consecutive tones at different harmonic numbers. The x-axis indicates the location of the first harmonic in each stimulus, and the y-axis shows the ratio between the 1st and the 2nd peaks (as demonstrated in (A)). Colored circles indicate the relevant stimuli that are shown in Fig 9.

More »

Fig 10 Expand

Fig 11.

Pitch shift of equally spaced harmonics.

(A-D) The vectors h for complex harmonic stimuli that contain the four harmonics of 4–7 (Eq 8). The x-axis denotes fd normalized by the fundamental frequency, f0 = 200Hz. The four figures show the stimulus in Eq 8 for the cases of Δf = 0 Hz, 40 Hz, 100 Hz, and 200 Hz, respectively. The zero shift case represents a regular complex harmonic signal. The 40 Hz shift is no longer a complex tone of 200 Hz. The third option (C) is a harmonic complex of 100 Hz (with the harmonics 9, 11, 13, and 15). Finally, the Δf = 200 Hz shift results again in a complex harmonics of f0 = 200 Hz but this time with the 5–8 harmonics. (E) The peaks of the probability functions, pdf(fp), for 500 uniformly shifted stimuli. Each stimulus is given by Eq 8, i.e., each signal includes the first four terms (1–4) of the fundamental f0 = 200Hz, plus an incremental frequency shift of Δf. The x-axis denotes the frequency of the lowest harmonic component of the input stimulus (f0 + Δf) normalized by f0 for visual clarity. The y-axis denotes the estimated pitch. To demonstrate the ambiguity of this process, we included the first four largest peaks of each of the resulted pdfs. We focused the view along the 100 Hz, 200 Hz, and 400 Hz in the y-axis; all other regions are mostly empty. Note the linear shifts in the pitch estimations and the changing of these slopes as a function of Δf [47].

More »

Fig 11 Expand

Fig 12.

Transposing low-frequency tones into high-frequency regions of the cochlea.

(A) An example of three sparse coefficient vectors, h, for the three frequencies f0 = 229 Hz, 249 Hz, and 269.7 Hz. The resulting h vectors have the same nonzero indices, i.e., these stimuli cannot be differentiated based on their sparse representations. (B) The pdfs of the three TTs are noisy and inconclusive, as expected. (C) Predictions of 100 epochs; only the 1st peak in the pdf is considered. There are two distinct types of stimuli: (i) pure tones (blue), and (ii) TTs (red). Both stimuli are simulated with incremental fundamental frequencies of f0 ∈ [100 Hz, 500 Hz]. Each stimulus is normalized relative to the fundamental f0. The model could estimate the f0 of the pure tones with a high degree of accuracy but could not predict those of the TTs at all (compare with [13]).

More »

Fig 12 Expand

Fig 13.

Iterated rippled noise for different time delays and repetitions.

The figures show the results of 500 simulations for each case of IRN stimulus. Each subplot along the columns show the delays of d = 5, 4, and 2 ms that correspond to the fundamental frequencies of 200, 250, and 500 Hz, respectively. The subplots in the first row show the delay-add simulations, and the lower row shows the delay-subtract simulations. The results are derived from the first peaks of the resulting pdfs, and all estimations are taken from an interval of one octave around the appropriate fundamental frequency [42]. Simulations are done using Carney's model (Zilany et al. [2729]) with stimuli of 70 dB SPL. The dictionary contained 1000 groups of sine-atoms with distinct CFs and 10 phases in each group (g = 10, Eq 3).The blue dots indicate rippled noise (one repetition), red points correspond to IRN with 2 repetitions, and yellow dots are for the 10 repetitions. (A-C) The delay-add simulations show distinct peaks around the 1/d frequencies. (D-F) The delay-subtract simulations show accumulation of the inferred pitches at frequencies equal to or greater than 1/d±10%, but the results for this case are noisy and inaccurate relative to psychoacoustic measurements.

More »

Fig 13 Expand

Fig 14.

Analyzing a recorded stimulus of a violin.

(A) The Fourier transform of the recorded signal. This is a note of A5 (880 Hz) played by a bow (arco). The 880 Hz and its harmonics are clearly seen. (B) Each time step Tsteps of the stimulus is processed separately. The results are collected to form the columns of the matricx Hg. (C) Each of the SC vectors (columns) of Hg are processed by the harmonic sieve separately to produce the pitch probability of that time step (Pg). (D) To compare between simulations, we average over the time steps to extract the most prominent pitch of the signal. The result is the usual pdf vector, and the estimated pitch is set to the maximum of this pdf.

More »

Fig 14 Expand

Fig 15.

Results for musical notes on a chromatic scale.

We analyzed three musical instruments: a flute, a violin, and a piano for different notes. The results are shown on a chromatic musical scale (equal-tempered). The colored labels along the colored dots specify the notes played in specific recordings. All of the recordings were downloaded from [55]. Although not exact, the model does manage to assign most of the measurements to the right note (pitch).

More »

Fig 15 Expand