Sparse Codes for Speech Predict Spectrotemporal Receptive Fields in the Inferior Colliculus
(a) Stimuli used to train the model consisted of examples of recorded speech. The blue curve represents the raw sound pressure waveform of a woman saying, “The north wind and the sun were disputing which was the stronger, when a traveler came along wrapped in a warm cloak.” (b) The raw waveforms were first put through one of two preprocessing steps meant to model the earliest stages of auditory processing to produce either a spectrogram or a “cochleogram” (not shown; see Methods for details). In either case, the power spectrum across acoustic frequencies is displayed as a function of time, with warmer colors indicating high power content and cooler colors indicating low power. (c) The spectrograms were then divided into overlapping 216 ms segments. (d) Subsequently, principal components analysis (PCA) was used to project each segment onto the space of the first two hundred principal components (first ten shown), in order to reduce the dimensionality of the data to make it tractable for further analysis while retaining its basic structure . (e) These projections were then input to a sparse coding network in order to learn a “dictionary” of basis elements analogous to neuronal receptive fields, which can then be used to form a representation of any given stimulus (i.e., to perform inference). We explored networks capable of learning either “hard” (L0) sparse dictionaries or “soft” (L1) sparse dictionaries (described in the text and Methods) that were undercomplete (fewer dictionary elements than PCA components), complete (equal number of dictionary elements), or over-complete (greater number of dictionary elements).