Figure 1.
The temporal coherence model consists of two stages.
(A) Transformation of sound into a cortical representation [34]: It begins with a computation of the auditory spectrogram (left panel), followed by an analysis of its spectral and temporal modulations in two steps (middle and right panels, respectively): a multi-scale (or a multi-bandwidth) wavelet analysis along the spectral dimension to create the frequency-scale responses, , followed by a wavelet analysis of the modulus of these outputs to create the final cortical outputs
(right panel). (B) Coincidence and clustering: The cortical outputs at each time-step are used to compute a family of coincidence matrices (left panel). Each matrix (
) is the outer product of the cortical outputs
(i.e., separately for each modulation rate
). The C-matrices are then stacked (middle panel) and simultaneously decomposed by a nonlinear auto-encoder network (right panel) into two principal components corresponding to the foreground and background masks which are used to segregate the cortical response.
Figure 2.
Stream segregation of tone sequences and complexes.
Top row of panels represent the "mixture" audio whose two segregated streams are depicted in the middle and bottom rows. (A) The classic case of the well-separated alternating tones (top panel) becoming rapidly segregated into two streams (middle and bottom panels). (B) Continuity of the streams causes the crossing alternating tone sequences (top) to bounce maintaining an upper and a lower stream (middle and bottom panels). (C) Continuity also helps a stream maintain its integrity despite a transient synchronization with another tone. (D) When a sequence of tone complexes becomes desynchronized by more than 40 ms (top panel), they segregate into different streams despite a significant overlap (middle and bottom panels).
Figure 3.
Segregation of harmonic complexes by the temporal coherence model.
(A) A sequence of alternating harmonic complexes (pitches = 500 and 630 Hz). (B) The complexes are segregated using all spectral and pitch channels. Closely spaced harmonics (1890, 2000 Hz) mutually interact and hence their channels are only partially correlated with the remaining harmonics, becoming weak or may even vanish in the segregated streams.
Figure 4.
Segregation of speech mixtures.
(A) Mixture of two sample utterances (left panel) spoken by a female (middle panel) and male (right panel); pitch tracks of the utterances are shown below each panel. (B) The segregated speech using all C-matrix columns. (C) The segregated speech using only coincidences among the frequency-scale channels (no pitch information). (D) The segregated speech using the channels surrounding the pitch channels of the female speaker as the anchor.
Figure 5.
Segregation of speech utterances based on auxiliary functions.
(A) Mixture of two sample utterances (right panel) spoken by a female (left panel) and male (middle panel) speakers; (B) The inter-lip distance of the female saying “twice each day” used as the anchor to segregate the mixture into its target female (middle panel) and the remaining male speech (bottom panel); (C) The envelope of the female speech “twice each day” used as anchor to segregate the mixture into its target female speaker (middle panel) and the remaining male speech (bottom speech).
Figure 6.
(A) Box plot of the SNR of the segregated speech and the mixture over 100 mixtures from the TIMIT corpus. (B) (Top) Notation used for coincidence measures computed between the original and segregated sentences plotted in panels below. (Middle) Distribution of coincidence in the cortical domain between each segregated speech and its corresponding original version (violet) and original interferer (magenta). 100 pairs of sentences from the TIMIT corpus were mixed together with equal power. (Bottom) Scatter plot of difference between correlation of original sentences with each segregated sentence demonstrates that the two segregated sentences correlate well with different original sentences.
Figure 7.
Extraction of speech from noise and music.
(A) Speech mixed with street noise of many overlapping spectral peaks (left panel). The two signals are uncorrelated and hence can be readily segregated and the speech reconstructed (right panel). (B) Extraction of speech (right panel) from a mixture of speech and a sustained oboe melody (left panel).