Skip to main content

Advertisement

PLOS Computational Biology

Browse
Publish
- Submissions
- Policies
- Manuscript Review and Publication
About

Search Search

advanced search

< Back to Article

Fig 1 — Fig 1.

The model.
The model is composed of three main sections: (A) The cochlear model [26–29] transduces a one-dimensional input stimulus, s_in(t), into a two-dimensional matrix that represents the AN population response, S_AN(t,f_CF). (B) The AN’s spatiotemporal response is introduced into the sparse coding (SC) block to produce the sparse coefficient vector, h. The vector h carries invariant information of the input stimulus that we refer to as pitch cues. The (sparse) information in h represents harmonics in s_in(t). (C) Finally, the likelihood probability of the pitch given the vector h is extracted and denoted as pdf(f_p).

More »

Fig 2 — Fig 2.

Different complex harmonic stimuli with the same pitch.
(A) The Fourier transform (FT) of three complex harmonics stimuli with a fundamental frequency of f₀ = 240 Hz. The three signals have different spectral components: S_in,1(f) is composed of the first harmonic component of f₀; S_in,2(f) consists of the first four successive harmonics of f₀; and S_in,3(f) is formed from of the 10–13 harmonics. (B) The corresponding output of the cochlear model [26], i.e., the AN population responses for the three stimuli. The y-axis represents the normalized characteristic frequencies (CFs), which is CF divided by f₀, on a linear scale, and the x-axis shows the post-stimulus time in milliseconds. The cochlear input is a 15ms long stimulus, and the resulting output is taken from the last 5ms. Note the different patterns of the AN activities that correspond to the three different cases: a stimulus with low frequencies excites the apical parts of the cochlea (lower part in the images), while a stimulus with higher frequencies excites the basal parts. Note also that the AN population responses define unique spatiotemporal patterns of activities for each of the stimuli. All the three stimuli have relatively low sound levels (30 dB SPL), which means that the cochlea response is linear.

More »

Fig 3 — Fig 3.

Composing the dictionary D.
(A) An example of one atom in D. It is the AN population response to a sine wave of 1.5k Hz tone generated by the cochlear model [26]. The atom was normalized to a peak value of 1, as for all other population responses. The y-axis of the two-dimensional matrix represents the CF along the BM, and the x-axis is the post-stimulus time in milliseconds. (B) Each of the atoms, d_j, j ∈ [1, M], is vectorized into a column in D. These M columns are concatenated to form the dictionary matrix D. All the input signals used for the creation of the dictionary have the same level of 30 dB SPL (i.e., at the cochlear linear region). In this example, we used only one atom per group (g = 1).

More »

Fig 4 — Fig 4.

The sparse coefficient vector h and the final pitch probability vector.
(A) A simplified view of the SC methodology. The algorithm decomposed the two-dimensional signal S_AN,2(t,f_CF) into a linear combination of four atoms (columns) within D. This is a simplified view that shows the primary values in h₂ (green indices) multiplied by the atoms. (B) The sparse coefficient solution vectors, h_k, for the three cases (k ∈ [1,2,3]). The green circles in the figure of h₂ correspond to the four terms in the simplified example of (A). All x-axes are normalized by the fundamental frequency f₀ = 240 Hz for convenience. Observe that the solutions for h_k resemble that of the FT for the respective stimuli (Fig 2A). (C) Using the pitch estimation unit (harmonic sieve), we can easily map the information in h_k, for k ∈ [1,2,3], into a pitch probability vector, pdf(f_p). Each of the y-axes of the pdfs functions is multiplied by a constant (x100) for visual clearance. The red arrows indicate the locations of the maximum peaks, all of which are shown to occur at the fundamental harmonic. In other words, it is most probable that all three stimuli represent the same pitch. Still, note that other options are also plausible, especially in rational ratios of f₀.

More »

Fig 5 — Fig 5.

Comparing LS with SC.
(A) From left to right: the AN population response for a harmonic complex with the 1^st – 4^th harmonics. The y-axis is the CFs normalized by the fundamental frequency, in a linear scale (f₀ = 240 Hz). The x-axis indicates the post-stimulus time (between 10ms to 15ms). Next, the h coefficient vectors for the LS case (λ = 0.0) and for the sparse case (λ = 0.01). (B) Same as in (A) but for a complex tone stimulus that contains the harmonics 10^th–13^th. Note that for the lower harmonic stimulus (A), the results between the two cases, i.e., h_LS vs. h_SC, are almost identical. On the other hand, for the stimuli with the higher harmonics (B), the difference is more substantial. Specifically, there are much more nonzero coefficients in h_LS than in h_SC that are unrelated to the original spectrum structure of the signal (compare with the FTs in Fig 2A).

More »

Fig 6 — Fig 6.

Stimulus level invariance.
(A) The AN population response for the missing-fundamental harmonic complex tone of Eq 6, with f₀ = 225 Hz. The stimulus has an amplitude level of 30 dB SPL, and the AN population response is normalized to one, as usual. The x-axis shows post-stimulus time, and the y-axis denotes the (linear) mapping between locations along the cochlea and CFs. (B) The AN population response for the same spectral structure as in A (3–8 harmonics), but for a stimulus level of 90 dB SPL. For this relatively high stimulus level, the nonlinearity effects of the cochlea over the AN population response are apparent. (C–D) The solutions of the LS case (h_LS) and the SC case (h_SC) for the 30 dB (C) and 90 dB SPL (D) stimulus levels, respectively. (E–F) Probability functions of the LS (S_p,LS) and the SC (S_p,SC) cases, for the two amplitude levels, respectively. In the 30 dB SPL case (E), the same pitch is succesfully estimated for both the LS and the SC simulations (blue and red arrows indicate maximum peaks). However, for the 90 dB SPL case (F) only the SC solution proved to be robust and invariant to the stimulus level, as desired (red arrow indicates maximum peaks). In order to account for the cochlear nonlinearities due to the changing in the stimuli levels, all simulations of the AN fibers in this section were made using Carney's cochlea model (Zilany’s et al. [27–29]).

More »

Fig 7 — Fig 7.

Comparing the performance of different dictionaries over moderate and high amplitude stimulus levels.
All simulations have the same spectral structure (Eq 6). This spectral structure is simulated for various fundamental frequencies, f₀, and the figures show the estimated pitches for each such case (i.e., the maximum peak in each pdf). The estimations are taken from an interval of ± 0 .5 octaves around f₀. Each row, i.e., figures A-B and figures C-D, show the estimation results of the SC model for the two dictionaries D_sine, and D_stack, respectively (see text). The column subplots refer to different stimuli levels: moderate (45dB SPL), and high (90dB SPL) amplitudes. The x-axis denotes the location of the first harmonic within the stimuli (i.e., the 3^rd harmonic); the thick black dashed lines define the main octave (f₀), and the thin black dashed lines define the lower and upper octaves, i.e., 0.5 f₀ and 2f₀, respectively. (A-B) At low frequencies, up to about 4k Hz of the lower harmonic in the complex stimulus, the estimations of the D_sine dictionary converge to the expected frequencies for both moderate and high stimuli. However, from 4k Hz and above, the pitch estimations for the high stimuli levels diverge from the main octave to other ratios of f₀. (C-D) The pitch estimations of the D_stack dictionary converge to the main octave better for the low and high frequencies and for both amplitudes.

More »

Fig 8 — Fig 8.

Detailed results for f₀ = 606.4Hz.
The selected examples are taken from Fig 7 and show the SC coefficient vectors h and the pdfs for the two dictionaries and for the two amplitudes. (A, C) The SC coefficient vectors h for the D_sine and D_stack dictionaries, respectively. (B, D) The resulting pdfs, over one octave around f₀ = 606.4Hz, for the D_sine and D_stack dictionaries, respectively. Note the difference between the SC coefficients of the two dictionaries, but the qualitative resemblance between the two pdfs.

More »

Fig 9 — Fig 9.

Resolved vs. unresolved representation of harmonic cues.
(A) The solutions h_k, k ∈ [1, 5], for the stimuli of Eq 7. We compare the SCs of the two dictionaries, D_sine (lines) and D_stack (dashed lines). D_sine consists of tone-atoms and D_stack consists of complex tones that contain six harmonics with decreasing amplitudes (1 to 1/6). All stimuli contain four harmonics of the same fundamental frequency, f₀ = 433 Hz, but at different spectral locations (r ∈ {1, 6, 10, 17, 22}). The x-axis is normalized by f₀ for convenience. The correlation between the SC solutions and the stimuli' spectral components (Eq 7) are apparent. Note that signals with low-frequency components (such as h₁) have more prominent nonzero coefficients than those of the higher harmonics (e.g., h₅). A closer look at h₅ (the inset) shows that only two of the four harmonics are successfully reconstructed (the 23 and 24 tones of the 22–25 harmonics). (B) Pitch probabilities (pdfs) for the five complex tones for the D_sine (see text). The right figure shows all f_p frequencies and the left one views fewer octaves around f₀. The numbers above the curves state the four prominent peaks of the pdfs, from the highest (1) to the fourth lower peak. Observe that all five solutions peak at the first harmonic, that is, the model predicts the same 433 Hz pitch for all stimuli. Additionally, most of the other plausible pitches, i.e., other peaks, are usually located at harmonic ratios of f₀, that is, they represent octave equivalence options. It is also instructive to note the f_LOCUS frequencies in the right figure of (B). These peaks indicate the additional possibility of perceiving the pitches at the locus of the stimuli spectral energy and not of f₀ [1]. All simulations were performed with Slaney's model and with a sound level of 45 dB SPL.

More »

Fig 10 — Fig 10.

Salience of complex tones.
(A) A Comparison between the two probability functions of the complex tones from Fig 9: the blue line is the pdf of the complex harmonic tone with the 1–4 harmonics, and the green line is the pdf of the fifth stimulus, which comprises 22–25 harmonics. The x-axis is limited to one octave in order to compare the pitch's relative heights and without considering the octave equivalence of consecutive harmonics. The blue and green arrows show the 1^st and the 2^nd largest peaks of the two curves, respectively. Computing the ratio for each curve between the 1^st and the 2^nd peaks yields a measure of the pitch's salience; a larger ratio indicates a more prominent percept of tha pitch. (B) Calculating the ratio between the 1^st and 2^nd peaks for harmonic tones with four consecutive tones at different harmonic numbers. The x-axis indicates the location of the first harmonic in each stimulus, and the y-axis shows the ratio between the 1^st and the 2^nd peaks (as demonstrated in (A)). Colored circles indicate the relevant stimuli that are shown in Fig 9.

More »

Fig 11 — Fig 11.

Pitch shift of equally spaced harmonics.
(A-D) The vectors h for complex harmonic stimuli that contain the four harmonics of 4–7 (Eq 8). The x-axis denotes f_d normalized by the fundamental frequency, f₀ = 200Hz. The four figures show the stimulus in Eq 8 for the cases of Δf = 0 Hz, 40 Hz, 100 Hz, and 200 Hz, respectively. The zero shift case represents a regular complex harmonic signal. The 40 Hz shift is no longer a complex tone of 200 Hz. The third option (C) is a harmonic complex of 100 Hz (with the harmonics 9, 11, 13, and 15). Finally, the Δf = 200 Hz shift results again in a complex harmonics of f₀ = 200 Hz but this time with the 5–8 harmonics. (E) The peaks of the probability functions, pdf(f_p), for 500 uniformly shifted stimuli. Each stimulus is given by Eq 8, i.e., each signal includes the first four terms (1–4) of the fundamental f₀ = 200Hz, plus an incremental frequency shift of Δf. The x-axis denotes the frequency of the lowest harmonic component of the input stimulus (f₀ + Δf) normalized by f₀ for visual clarity. The y-axis denotes the estimated pitch. To demonstrate the ambiguity of this process, we included the first four largest peaks of each of the resulted pdfs. We focused the view along the 100 Hz, 200 Hz, and 400 Hz in the y-axis; all other regions are mostly empty. Note the linear shifts in the pitch estimations and the changing of these slopes as a function of Δf [47].

More »

Fig 12 — Fig 12.

Transposing low-frequency tones into high-frequency regions of the cochlea.
(A) An example of three sparse coefficient vectors, h, for the three frequencies f₀ = 229 Hz, 249 Hz, and 269.7 Hz. The resulting h vectors have the same nonzero indices, i.e., these stimuli cannot be differentiated based on their sparse representations. (B) The pdfs of the three TTs are noisy and inconclusive, as expected. (C) Predictions of 100 epochs; only the 1^st peak in the pdf is considered. There are two distinct types of stimuli: (i) pure tones (blue), and (ii) TTs (red). Both stimuli are simulated with incremental fundamental frequencies of f₀ ∈ [100 Hz, 500 Hz]. Each stimulus is normalized relative to the fundamental f₀. The model could estimate the f₀ of the pure tones with a high degree of accuracy but could not predict those of the TTs at all (compare with [13]).

More »

Fig 13 — Fig 13.

Iterated rippled noise for different time delays and repetitions.
The figures show the results of 500 simulations for each case of IRN stimulus. Each subplot along the columns show the delays of d = 5, 4, and 2 ms that correspond to the fundamental frequencies of 200, 250, and 500 Hz, respectively. The subplots in the first row show the delay-add simulations, and the lower row shows the delay-subtract simulations. The results are derived from the first peaks of the resulting pdfs, and all estimations are taken from an interval of one octave around the appropriate fundamental frequency [42]. Simulations are done using Carney's model (Zilany et al. [27–29]) with stimuli of 70 dB SPL. The dictionary contained 1000 groups of sine-atoms with distinct CFs and 10 phases in each group (g = 10, Eq 3).The blue dots indicate rippled noise (one repetition), red points correspond to IRN with 2 repetitions, and yellow dots are for the 10 repetitions. (A-C) The delay-add simulations show distinct peaks around the 1/d frequencies. (D-F) The delay-subtract simulations show accumulation of the inferred pitches at frequencies equal to or greater than 1/d±10%, but the results for this case are noisy and inaccurate relative to psychoacoustic measurements.

More »

Fig 14 — Fig 14.

Analyzing a recorded stimulus of a violin.
(A) The Fourier transform of the recorded signal. This is a note of A5 (880 Hz) played by a bow (arco). The 880 Hz and its harmonics are clearly seen. (B) Each time step T_steps of the stimulus is processed separately. The results are collected to form the columns of the matricx H_g. (C) Each of the SC vectors (columns) of H_g are processed by the harmonic sieve separately to produce the pitch probability of that time step (P_g). (D) To compare between simulations, we average over the time steps to extract the most prominent pitch of the signal. The result is the usual pdf vector, and the estimated pitch is set to the maximum of this pdf.

More »

Fig 15 — Fig 15.

Results for musical notes on a chromatic scale.
We analyzed three musical instruments: a flute, a violin, and a piano for different notes. The results are shown on a chromatic musical scale (equal-tempered). The colored labels along the colored dots specify the notes played in specific recordings. All of the recordings were downloaded from [55]. Although not exact, the model does manage to assign most of the measurements to the right note (pitch).

More »

Publications
PLOS Aging and Health
PLOS Biology
PLOS Climate
PLOS Complex Systems
PLOS Computational Biology
PLOS Digital Health
PLOS Ecosystems
PLOS Genetics

PLOS Global Public Health
PLOS Medicine
PLOS Mental Health
PLOS Neglected Tropical Diseases
PLOS One
PLOS Pathogens
PLOS Sustainability and Transformation
PLOS Water

Home
Blogs
Collections
Give feedback
LOCKSS

Privacy Policy
Terms of Use
Advertise
Media Inquiries
Contact

PLOS is a nonprofit 501(c)(3) corporation, #C2354500, based in California, US