Segregating Complex Sound Sources through Temporal Coherence
Figure 1
The temporal coherence model consists of two stages.
(A) Transformation of sound into a cortical representation [34]: It begins with a computation of the auditory spectrogram (left panel), followed by an analysis of its spectral and temporal modulations in two steps (middle and right panels, respectively): a multi-scale (or a multi-bandwidth) wavelet analysis along the spectral dimension to create the frequency-scale responses, , followed by a wavelet analysis of the modulus of these outputs to create the final cortical outputs
(right panel). (B) Coincidence and clustering: The cortical outputs at each time-step are used to compute a family of coincidence matrices (left panel). Each matrix (
) is the outer product of the cortical outputs
(i.e., separately for each modulation rate
). The C-matrices are then stacked (middle panel) and simultaneously decomposed by a nonlinear auto-encoder network (right panel) into two principal components corresponding to the foreground and background masks which are used to segregate the cortical response.