Understanding Auditory Spectro-Temporal Receptive Fields and Their Changes with Input Statistics by Efficient Coding Principles

Spectro-temporal receptive fields (STRFs) have been widely used as linear approximations to the signal transform from sound spectrograms to neural responses along the auditory pathway. Their dependence on statistical attributes of the stimuli, such as sound intensity, is usually explained by nonlinear mechanisms and models. Here, we apply an efficient coding principle which has been successfully used to understand receptive fields in early stages of visual processing, in order to provide a computational understanding of the STRFs. According to this principle, STRFs result from an optimal tradeoff between maximizing the sensory information the brain receives, and minimizing the cost of the neural activities required to represent and transmit this information. Both terms depend on the statistical properties of the sensory inputs and the noise that corrupts them. The STRFs should therefore depend on the input power spectrum and the signal-to-noise ratio, which is assumed to increase with input intensity. We analytically derive the optimal STRFs when signal and noise are approximated as Gaussians. Under the constraint that they should be spectro-temporally local, the STRFs are predicted to adapt from being band-pass to low-pass filters as the input intensity reduces, or the input correlation becomes longer range in sound frequency or time. These predictions qualitatively match physiological observations. Our prediction as to how the STRFs should be determined by the input power spectrum could readily be tested, since this spectrum depends on the stimulus ensemble. The potentials and limitations of the efficient coding principle are discussed.


Introduction
In response to acoustic input signals, neurons in the auditory pathway are typically selective to sound frequency f and have particular response latencies. At least ignoring cases with f v4 kHz, in which neuronal responses often phase lock to the sound waves, a spectro-temporal receptive field (STRF) is often used to describe the tuning properties of a neuron [1,2,3,4]. This is a two-dimensional function STRF (f ,t) that reports the sensitivity of the neuron at response latency t to acoustic inputs of frequency f for a given stimulus ensemble (i.e., given input statistics). More specifically, in a stimulus ensemble, the power S(f ,t) of the acoustic input at frequency f at time t fluctuates around an average level denoted by S S(f ). If we let O(t) denote the neuron's response at time t (typically its spike rate), then STRF (f ,t) best approximates the linear relationship between O(t) and S(f ,t) in this stimulus ensemble as

O(t)~ð
ð STRF (f ,t)S(f ,t{t)dtdf zspontaneous activity ð1Þ Note that in this paper, we refer to S(f ,t) as the input spectrogram, although some authors also include the average input power S S(f ). Though S(f ,t) is not a full description of acoustic input, since it ignores features such as the phase of the oscillation in the sound wave, it is the only relevant aspect of the auditory input as far as the STRF is concerned. Note that if we use O(t) to denote the deviation of the neural response from its spontaneous activity level, then both O(t) and S(f ,t) have zero mean. We will use this simplification throughout the paper. In studies in which the temporal dimension is omitted, the STRF is called the spectral receptive field (SRF). Figure 1 cartoons a typical STRF. This has excitatory and inhibitory regions, reflecting its preferred frequency and response latency. For example, if STRF (f ,t) peaks at frequency f~f f and time t~t t, then this neuron prefers frequencyf f and should respond to an input impulse S f ,t ð Þ~d f {f f d t ð Þ of this frequency with latencyt t. We will also refer to STRF (f ,t) as the receptive field, the filter kernel, or the transfer function from input to neural responses, as these all convey the same or similar meanings. A neuron's STRF is typically estimated using reverse correlation methods [5,4].
However, there are extensive nonlinearities in the signal transformation along the auditory pathway. Indeed, the STRF formulation of neural responses, though linear in spectral power, is already a second-order nonlinear function of the auditory sound wave. There are two kinds of nonlinearities when inputs are represented as spectrograms. The simpler one is a static nonlinearity f nonlinear (O(t)), which when applied to the linear approximation O(t) of equation (1) enables better predictions of the neural responses [6,7]. This static nonlinearity however does not alter the spectro-temporal selectivity of the neuron seen in the linear STRF. This paper is interested in the more complex nonlinearity that the STRFs are dependent on the stimulus ensemble used to estimate them [1,5,8,9]. For example, the STRFs are wider when the stimuli are narrow-band rather than wide-band [10], or when the stimuli are animal vocalizations rather than noise [11]. The STRF (or SRF) also becomes more band-pass when sound intensity increases. The dependence of the STRFs on the stimulus ensemble holds, for example, for type IV neurons in the cochlear nucleus of cats [12,13], the inferior colliculus (IC) of the frog [8] and the gerbil [7], and field L region of the songbird (which is analogous to mammalian auditory cortex) [14]. (The dependence on sound intensity also holds for the linear relationship between the auditory nerve responses and input sound waves [5]). Nonlinearities in the auditory system become progressively stronger further from the periphery.
Despite the nonlinearities, the concept of the STRF is still widely used, not only because it provides a meaningful description of the spectro-temporal selectivity of the neurons in a given stimulus ensemble, but also because it can predict neural responses to novel stimuli reasonably well, as long as the stimuli are drawn from the same stimulus ensemble as that used to estimate the STRF in the first place. Reasonable predictions from the STRFs have been obtained for the responses of auditory nerves(see [15]) and auditory midbrain neurons [6,7,16] (also see [2]). They have also been obtained for responses of the auditory cortical neurons when the stimulus ensemble is composed of biologically more meaningful static or dynamic ripples (broadband sound with sinusoidally modulated spectral envelopes and their linear combinations [17,18,19]). If the linear neural filter is augmented to include the filtering performed by the head and ears, it is also possible to predict the preferred locations of sound sources of auditory cortical neurons based on the linear neural filter for input spectrograms [20]. Meanwhile, linear STRF models fail to capture many complex phenomena, particularly in the auditory cortex, and nonlinearities are not limited to being just static or monotonic. It has been suggested that some auditory cortical neurons process auditory objects in a highly non-linear manner, by selectively responding to a weak object component while ignoring loud components that occupy the same region in frequency space in auditory mixtures of these object components [21], and some prefer low over high spectral contrast sounds [22]. Strong nonlinearities in the auditory processes have long since motivated nonlinear models of auditory responses (e.g., [5,12,23]).
This paper aims to understand from a computational, rather than a mechanistic, perspective why the auditory encoding transform should depend on the stimulus ensemble in the ways observed. More specifically, the paper focuses on cases in which STRFs can reasonably capture neural responses, and aims to identify and understand the computational goal of the STRFs for a given stimulus ensemble -finding a metric according to which the STRFs are optimal for the ensemble. This would provide a rationale for how the physiologically measured STRFs should depend on or adapt to the stimulus ensemble. This paper does not address what linear or nonlinear mechanisms could build the optimal STRFs, or whether or how nonlinear auditory processes enable the adaptation of the STRFs to the stimulus ensemble. Existing computational models of auditory neurons, including ones with the notion that cochlear hair cells perform independent component analysis to provide an efficient code for inputs using spikes in the auditory nerves [24,25], cannot explain the observed dependence of the STRFs on the stimulus ensemble (see Discussion for more details).
Restricting attention to the temporal properties of STRF, Lesica and Grothe [26] observed that the temporal filter in STRF adapted to the level of ambient noise in the input environment. In particular, the temporal receptive field in the STRF changed from being bandpass to being low pass with the increase of ambient noise. They argued using a simple model that such adaptation in the STRF enables more efficient coding of the input information.
This study applies the principles of efficient coding to understand the auditory STRF and its variations with sound intensities and other input characteristics. It generalizes the work of Lesica and Grothe [26] to understand the temporal and spectral filtering characteristics of STRF adaptation to changes in noise, signal and Figure 1. A schematic example of a typical spectro-temporal receptive field, plotted with a reversed abscissa. This STRF has one excitatory and three inhibitory regions, prefers frequencyf f , and evokes response at a typical latencyt t. Since the response at

Author Summary
Spectro-temporal receptive fields (STRFs) have been widely used as linear approximations of the signal transform from sound spectrograms to neural responses along the auditory pathway. Their dependence on the ensemble of input stimuli has usually been examined mechanistically as a possibly complex nonlinear process. We propose that the STRFs and their dependence on the input ensemble can be understood by an efficient coding principle, according to which the responses of the encoding neurons report the maximum amount of information about the sensory input, subject to limits on the neural cost in representing and transmitting information. This proposal is inspired by the success of the same principle in accounting for receptive fields in the early stages of the visual pathway and their adaptation to input statistics. The principle can account for the STRFs that have been observed, and the way they change with sound intensity. Further, it predicts how the STRFs should change with input correlations, an issue that has not been extensively investigated. In sum, our study provides a computational understanding of the neural transformations of auditory inputs, and makes testable predictions for future experiments.
correlations in input statistics. Explicitly, the principle of efficient coding states that the neural receptive fields should enable the neural responses to transmit as much sensory information as possible to the central nervous system, subject to the limitation in neural cost in representing and transmitting information. This principle has been proposed [27] and successfully applied to the visual system to understand the receptive fields in the early visual pathway [28,29,30,31,32,33] (see review [34]). We will borrow heavily techniques and intuitions from vision to derive and explain the results in this paper.
To make initial progress, it is necessary to start with some simplifying assumptions. First, we assume that the statistical characteristics of the stimulus ensemble do not change more rapidly than the speed at which the sensory encoding adapts, so that the stimulus ensemble can be approximated as being stationary as far as optimal encoding is concerned. Knowing when this assumption does not hold tells us when the encoding is not optimal, e.g., when one sees poorly for a brief moment before the visual encoding adapts to a sudden change from a dark room to a bright garden. Second, for mathematical convenience, we assume that the linear STRF model as in equation (1) can approximate adapted auditory neural responses reasonably well. As we know from above, this assumption often does not hold, particularly for auditory cortical neurons. This paper leaves the extension of the optimal encoding to nonlinear cases for future studies. Third, to derive a closedform, analytical, solution to the optimal STRF, we assume that the input statistics in the stimulus ensemble can be approximated as being Gaussian, with higher order correlations in the input contributing only negligibly to the inefficiency of the representation in the original sensory inputs. Although it is known that the natural auditory inputs are far from Gaussian [35], as for the case of vision, the discrepancy may have only a limited impact on the input inefficiency, as measured by the amount of information redundancy in the original sensory input [36,37,38].
To understand how sensory inputs should be recoded to increase coding efficiency, we start with visual encoding to draw insights and made analogies with auditory encoding. In vision, large amounts of raw data about the visual world are transduced by photoreceptors. However, the optic nerve, which transmits the input data to the visual cortex via thalamus, can only accommodate a dramatically smaller data rate. It has thus been proposed that early visual processes use an efficient coding strategy to encode as much information as possible given the limited bandwidth [27,34], in other words, to recode the data such that the redundancy in the data is reduced and consequently the data can be transmitted by the limited bandwidth. Compression (while preserving most information) is possible since images are very redundant [39,40,41,42], e.g., with strong correlations between visual inputs at nearby points in time and space. Removing such correlations can cut down the data rate substantially [34].
One way to remove the correlations is to transform the raw input S into a different representation O in neural responses that would then have a much smaller data rate than S, yet preserving essential input information. This transform is often approximated by the visual receptive field, analogous to the auditory STRFs. For instance, the (spatial) center-surround receptive fields of the retinal ganglion cells help remove spatial redundancy [30,31,43]. They do this by making the ganglion cells preferentially respond to spatial contrast in the input, and so eliminating responses to visual locations whose input is redundant with that of their neighbors. Consequently, the responses of retinal ganglion cells are much less correlated than those of the photoreceptors, making their representation much more efficient. One facet of this efficient encoding hypothesis is that the optimal receptive field transform should depend on the statistical properties, such as the correlation structure and intensity, of the input. This dependence has been used to explain adaptation, to changes in input statistics, of visual receptive field characteristics, such as the sizes of centersurround regions and the color tuning of retinal neurons, or the ocular dominance properties of striate cortical neurons [32,34,44,45,46,47]. In the auditory system, information redundancy is also reduced along the auditory pathway [48]. Although this redundancy reduction was only investigated in the neural responses to sensory inputs rather than in the coding (STRF) transform leading to the neural responses, it suggested that coding efficiency is one of the goals of early auditory processes.
More formally, the efficient coding scheme is depicted in Figure 2A. The input contains sensory signal S and noise N (e.g., input sampling noise). The net input SzN is encoded by a linear transfer function K into output.
which also contains additional noise N o introduced in the encoding process. When the input has multiple channels, e.g., many different photoreceptors or hair cells, S~S 1 ,S 2 ,:::,S j ,::: À Á is a vector with many components, as indeed is N. Output O is a vector representing the neural population responses from many neurons. For output neuron i, we have O i~X Therefore K is a matrix, and its i th row K i1 ,K i2 ,:::,K ij ,::: À Á models the receptive field for output neuron i as the array of effective weights from input receptors j to output neuron i. In the particular example when input neurons are photoreceptors and output neurons are retinal ganglion cells, K ij is the effective connection from photoreceptor j to ganglion cell i (implemented via the interneurons in the amacrine cell layers of the retina), and collectively, K i1 ,K i2 ,:::,K ij ,::: À Á describe the linear receptive field of this ganglion cell. We consider the problem of finding an optimal K that maximizes the information extracted by O about S, i.e., the mutual information I(O; S) [49] between O and S subject to a given cost of the neural encoding, which depends on the responses in a way we will describe shortly.
Therefore, the optimal K should minimize the objective function: where l is a parameter whose value specifies a particular balance between the needs to minimize costs and to maximize extracted information. Neural costs can arise from various sources, such as the metabolic energy cost for generating neural activities or spikes [50] and the cost of thicker axons to transmit higher rates of neural firing. We follow a formulation that has been productive in vision [31,34], and model the neural cost as where S:::T indicates the average over the stimulus ensemble. This gives It has been shown [29,33,51,34] that the K that provides the most efficient coding according to E(K) has the following properties. At high signal-to-noise ratio (SNR), K is such that O extracts the difference between correlated channels, and thus avoids transmitting redundant information. Hence, for example, in photopic conditions, retinal ganglion cells have center-surround spatial receptive fields which extract the spatial contrast of the input. By contrast, at low SNR, K is a smoothing filter that averages out input noise instead of reducing redundancy. This avoids spending neural cost on transmitting noise. Hence, for example, in scotopic conditions, when SNR can be considered as being low, the receptive fields of retinal ganglion cells expand the sizes of their center regions and weaken their suppressive surrounds [52]. We will apply this framework to the auditory encoding to understand STRFs and their adaptation to stimulus ensembles.

Auditory encoding system and its comparison to vision
To apply the efficient coding principle to auditory STRFs, we borrow insights from vision by making an analogy between (aspects of) the auditory and visual systems. For simplicity, we start by ignoring input noise. While sound signals are typically air vibrations over time, at the input sampling stage, they are sampled as S f ,t from a continuous time-frequency representation S(f ,t), namely the response at time t of a hair cell tuned to sound vibration frequency f . This is analogous to visual input sampling, in which the response of a photoreceptor at location i samples the light signal in the form of electromagnetic vibrations. Auditory hair cells are tonotopically arranged in the cochlea, so that neighboring hair cells are tuned to nearby sound frequencies. Therefore, at any instant t , the response pattern (S f1,t ,S f2,t , :::S fi,t ,:::) as a function of hair cell's location i over the cochlea is an auditory ''image'' of the pattern of powers across sound frequencies, analogous to a retinal image. (In our formulation, we focus on sampling the intensity or power in S f ,t , and ignore the phase of the sound wave at frequency f . This is because (1) auditory nerve responses do not encode the phase except for low frequency inputs via phase locking, and (2), as mentioned, our goal is to understand the STRFs which do not concern the phase information.) While a retinal image is two dimensional in space (and one additional dimension in time), the auditory ''image'' at any instant t is one dimensional in sound frequency f . One may use time t as the second dimension such that S f ,t for all f and t collectively can be seen as a single discrete sample of the two- ð Þ into a time-frequency representation S f ,t ð Þ, as the population activities of the auditory nerves, which is the input to the efficient encoding system. Signal and noise pass through a series of brain nuclei such as cochlear nucleus, superior olive, inferior colliculus, etc. The current work proposes that the effective transform STRF of the spectrogram that is collectively realized by these nuclei is, in its linear form, the optimal filter K implied by the efficient coding principle. The output O t ð Þ is the activity of neurons in a higher nucleus. (C) Three steps of signal flow within the linear encoding step K or STRF in (A) and (B). Note that these three steps are merely abstract algorithmic steps, rather than neural implementation processes for the effective transform K or STRF. doi:10.1371/journal.pcbi.1002123.g002 dimensional auditory ''image''. When input noise N is included, input S becomes SzN.
As for vision, we explore whether the auditory STRFs can be partly understood by the goal of efficiently coding auditory information. The sensory input is sampled as SzN, the responses of the cochlear hair cells. This input is encoded by the STRFs to give rise to outputs O as the neural activities of a higher nucleus, such as the inferior colliculus (IC) or the auditory cortex ( Figure 2B). The STRF is then analogous to a spatial receptive field, such as that of the retinal ganglion cells. Thus the STRF should be determined by the statistics of the auditory inputs, and in particular, the correlation where (f ,t) i labels a particular spectrotemporal combination of a frequency value f and time t. Note that for i=j, the frequency f or t, but not both, in the two indices (f ,t) i and (f ,t) j may be equal. (Here, for simplicity we assume, or preprocess the signal, such that all inputs have zero mean, i.e., SS (f ,t) i T~0, just like the input signal fluctuation S(f ,t) around the ensemble average in the definition of the STRF in equation (1)). As in vision, natural auditory inputs express substantial correlations between inputs of neighboring frequencies and at neighboring temporal instances. When the input SNR is sufficiently high, an optimal STRF should reduce these correlations to achieve efficient transmission. Such an STRF will have neighboring excitatory and inhibitory regions in the frequency-latency domain, making the neuron be tuned to spectro-temporal contrast and be insensitive to the spectro-temporal redundancy.

Auditory STRF filter as an efficient coding transform
The general formulation and derivation of the efficient coding transform K (or STRF) can be found in its application to vision [34]. Here we outline these results and illustrate their consequences for auditory coding. Let S be the input with p input channels: S~(S 1 ,S 2 ,:::,S p ) T ð5Þ (superscript T denotes vector or matrix transpose). These p input channels may correspond to p auditory nerves if we omit the temporal dimension, p time instances if we focus on a single frequency channel, or they may correspond to p spectro-temporal labels (f ,t) i for i~1,2,:::,p. Let the input correlation be described by correlation matrix R S with elements R S ij~S S i S j T. The optimal transform K that minimizes E(K) in equation (4) can be decomposed in three steps ( Figure 2C): (1) a principal component transform to de-correlate the inputs, (2) gain control of each principal component, (3) an ortho-normal or unitary transform on the array of the gain-controlled components to arrive at various output channels. We now elaborate and elucidate these three steps.
The first step is a coordinate rotation, or ortho-normal transform, S?K o S, by an ortho-normal matrix K o that de-correlates the input channels such that each of the channels in the transformed signal K o S contains a principal component of the original signal. We denote these principal components as S k~X j (K o ) kj S j , with subindex k (instead of i,j) as the indices of the de-correlated channels (later, we also use v to denote the de-correlated channels in the temporal domain, or (V,v) in spectro-temporal domain). Since the correlation between S k and S k' is SS k S k' T~(K o R S K T o ) kk' , decorrelation between principal components implies that k T is the k th eigenvalue of matrix R S and also the average signal power of the k th principal component S k . As we will see later, when the input correlation SS f ,t S f ',t' T depends mainly on the differences (f {f ',t{t') in frequency and time, it turns out that S k (with the index k denoting the spectro-temporal modulation frequency (V,v)) is the amplitude of a dynamic or moving ripple that some experiments use to estimate the STRFs of cortical and midbrain neurons [17,18,19,16,2].
The second step is gain control g k on each component S k , giving output g k S k . Including noise N k , which is the original input noise N projected to the k th channel by the transform K o , and the encoding noise N o,k (in the decorrelated k space), the total output becomes O k~gk (S k zN k )zN o,k . It can be shown (see [34]) that the gain g k that minimizes E(K) in equation (4) is determined by the input signal-to-noise ratio SS 2 k T=SN 2 T to satisfy where SN 2 T is the variance of N k , and also of the input noise N (assumed to be independent, identically distributed and Gaussian in each channel) , and SN 2 o T is the variance of the encoding noise N o,k in each channel k (and of the encoding noise N o,i in each i since different encoding noise channels are also assumed to be independently and identically distributed).
Note that the total noise at output neuron i is output noise i~Sj K ij N j zN o,i . One effect of the encoding transform K is that noise corrupting different output neurons can be correlated, even when the original input noise is independent. The additional encoding noise N o,i could also be correlated in different output neurons, since it could also reflect a common origin in intermediate stages of the encoding processes. Our assumption of independence between N o,i and N o,j for i=j is thus a simplification for mathematical convenience.
Since all the variables are assumed to be Gaussian, each output O k extracts the following amount of information about the input S and has an output power One can then verify that g 2 k in equation (6) indeed minimizes this E since dE=dg 2 k~0 at that value. Note that if S k is the amplitude of a moving ripple indexed by k, g k will be the sensitivity of the neuron to the moving ripple.
We can write these two steps as the product gK o , where K o is the principal component transform, and g performs the gain control. g is a diagonal matrix with diagonal elements g k . The net output is then O~gK o (SzN)zN o . Consider imposing on this transform an orthonormal or unitary transform U (with UU T~1 ), the third step in building the efficient coding filter K, giving K~UgK o . It follows [34] from the properties of unitary matrices that neither the first term nor the second term in E in equation (4) will be affected by U (at least when signal and noise are Gaussian and when the components of N o are independent and identically distributed).
Each row vector of the matrix K determines the receptive field of a particular output channel or neuron. Without U, K~gK o would specify receptive fields that would be gain controlled eigenvectors or principal components of the input correlation matrix. For example, they would look like ripples covering the entire spectro-temporal range. An appropriate choice of nontrivial U will alter the receptive field shape dramatically, giving rise to receptive field properties found in real neurons such as a finite span in input channel space. For example, if we consider only the input frequency channels f for auditory inputs and omit the time dimension, we may prefer that the STRF for an output neuron to be selective to only a finite band of input frequencies such that the neural responses O resemble periphery inputs S while maintaining coding efficiency. It can be shown [34,35] that this can be achieved by choosing o gK o . We will use this choice, U~K {1 o , in building our STRF in frequency domain. However, insensitive to the exact form of U, the critical feature of the STRF comes from the gain g k specified in the second step of the encoding model (as long as one does not impose additional computational goals that may restrict the final STRFs, see Discussion). We will show later that g k often corresponds to the modulation transfer functions (MTFs, also called ripple transfer function, RTF,in different literatures) of the STRFs.
We now apply this general framework to the case of auditory encoding. Sound spectrogram S(f ,t) is derived from the sound waveform W (t) as follows. The first step is to perform a temporally-windowed Fourier transform of W (t) to obtain the Since the cochlea performs approximately a log scale frequency analysis, we first let f~logf f to obtainŴ W (f ,t) (although the more accurate form would be f~21:4 log 10 4:37f f z1 [53]). Then the input power One may employ a further logarithmic transform S(f ,t)~logŜ S(f ,t) to characterize the cochlear response better (through capturing the compressive input/output transform realized by processes in the basilar membrane and hair cells) [54,55]. However, this further logarithmic transform is not essential for our formulation, and, as pointed out previously [56], it does not significantly affect the qualitative characteristics of the empirical STRFs. If one omits this logarithmic transform, then S(f ,t)~Ŝ S(f ,t). We then subtract the mean SS(f ,t)T from S(f ,t), and, for simplicity, denote the resulting zero mean signal still by S(f ,t), as in the definition of STRF. We next consider discrete samples S f ,t of the continuous S(f ,t). This leads to the input correlation matrix R S ij~S S (f ,t) i S (f ,t) j T. Finally, we follow the three encoding steps above to obtain the optimal encoding transform as STRF~K. In the sub-section ''The spectral filter SRF'', we discuss the simple case in which the temporal dimension t is omitted. Then, the input vector (equation (5)) is S~(S f1 ,S f2 ,:::) T , and the input correlation matrix is R S ij~S S fi S fj T. The efficient encoding procedure specifies the optimal spectral receptive field (SRF) K ij for neuron i, with O i~Sj K ij S fj znoise. When the temporal dimension is included S~(S (f ,t) 1 ,S (f ,t) 2 ,:::) T , R S ij~S S (f ,t) i S (f ,t) j T, and efficient coding specifies the optimal STRF as input weights or selectivity associated with the spectrogram fS (f ,t) i g.
It is apparent that the optimal SRF and STRF depend on input statistics via the input correlation R S and the input SNR (through the steps 1 and 2 in the encoding scheme). Therefore, when the stimulus ensemble changes, altering the input correlations and signal intensity, the form of the encoding receptive field should adapt in order to maintain encoding optimality. We propose that it is this that explains the input ensemble dependence of the STRFs.
A special class of input statistics has translation invariant correlations, i.e., with R S ij~S S (f ,t) i S (f ,t) j T depending only on the differences f i {f j (quantified in octaves) and t i {t j . This is a reasonable approximation of the input correlations in natural auditory scenes under two conditions. The first is that a local frequency range is considered that is not much larger than the range of the frequencies to which a neuron is sensitive, i.e., in the perspective of a neuron, the dependence of SS (f ,t) i S (f ,t) j T on the frequency is mainly through f i {f j . This is analogous to approximating spatial correlation of visual inputs as translation invariant to understand the retinal ganglion cell's spatial receptive fields although the spatial sampling density varies substantially with input eccentricity [31,34]. The second is that the environment is statistically stationary, as then the correlations in time depend only on the temporal difference t i {t j . It can then be shown that [34] the principal components are moving ripple !e i(2pVf z2pvt) , each of which has a 2D modulation frequency (V,v), which can be indexed by k:(V,v). The first encoding step is then a 2D Fourier transform dfdt. Meanwhile, the original input can be written as , as a weighted sum of the moving ripples [19]. The second encoding step determines the gains for the ripple amplitudes S(V,v) [34] as i.e., replacing g k and SS 2 k T in equation (6) by the corresponding g(V,v) and SS 2 (V,v)T. If U is chosen as the inverse Fourier transform with an extra phase function w(V,v), then the encoding transform which depends only on the differences f i {f j and t i {t j . Applying this transform to input S to give output O i (t i )~Ð Ð df j dt j K(f i {f j ,t i {t j )S(f j ,t j ), we see, by comparison with equation (1), that the STRF is STRF (f ,t)~K(f i {f ,t). This is a temporal filter tuned to sound frequency with a tuning pattern governed by g(V,v), and centered around frequency f i . Changing the center frequency from f i to f j is like shifting from one output neuron i to another neuron j. Altering the phase w(V,v) in equation (9) alters the STRF shape, in particular to ensure its temporal causality. In physiology, modulation tuning function (MTF) is often mentioned as the Fourier transform of auditory receptive field [19]. Therefore, it is clear from equation (10) that the gain profile g(V,v), which is determined by efficient coding, corresponds to the magnitude of the MTF. However, the shape of an STRF is determined by the phase as well as the magnitude of the MTF, and efficient coding does not strongly constrain the phase. Therefore, while we will illustrate the general properties of some example STRFs predicted by the theory by choosing particular U transforms (governed by the additional requirements of spectrotemporal locality and causality), in the Results, we will generally compare physiological data to the magnitudes of the MTFs that the theory predicts.
In the Results, we will discuss the efficient coding framework for situations both with (e.g., to study temporal aspects of STRFs) and without (e.g., to study their spectral aspects) translation invariance in input statistics.

Results
To illustrate how the framework explains and predicts physiological experiments, we first discuss a few examples when the temporal or the spectral dimension is omitted, and then show a full spectro-temporal STRF.

The spectral filter SRF
We first omit time, treating the input S(f ) as varying only in frequency. In this case, the encoding filter reduces from being an STRF to an SRF. We take f i as one of 250 discrete values i~1,2,:::,250, from low to high frequencies; hence input S is a one dimensional vector S~(S f1 ,S f2 ,:::,S f250 ) T . In simulations, input sample S is generated by smoothing a random noise vector S'~(S' f1 ,S' f2 ,:::,S' f250 ) T ( Figure 3A), with all the components S' fi taken to be independent, zero mean, unit variance, Gaussian noise. Specifically where I F is a factor to scale the overall input power intensity, and M is the smoothing matrix with elements is a normalization constant, and L controls the range of frequency difference jf i {f j j for significant correlation coefficient between the variation of S f i and that of S f j . Consequently, each S fi is also a zero mean Gaussian random variable, and the input correlations comprise a 2506250 matrix R S~I F MM T . One could also estimate R S from input samples S (as when animals adapt their auditory system to environmental sound through experience), in which case element R S ijS S fi {SS fi T À Á S fj {SS fj T À Á T. Figure 3B illustrates R S (obtained numerically from 250 samples of S in Figure 3A, of course one could use more than 250 samples to estimate R S ) for L~14. The correlation R S ij~I F A i A j (M M 2 ) ij scales with strengths of the original signals S fi and S fj through the scales A i and A j , and so decays with frequency f i and f j . Thus the statistics of the stimulus ensemble are not translation invariant in the spectral frequency f . Nevertheless, the correlation coefficient does depend mainly on the (frequency) difference i{j j j, since (M M 2 ) ii is almost independent of i and (M M 2 ) ij depends mainly on i{j j j except for the very small or very large i and j. This is evident in the fact that the rate of decay of R S ij with the difference f i {f j in Figure 3B is almost constant. Since the stimulus ensemble is not translation invariant, we will use the general formulation to obtain the SRF. From R S , we obtain its 250 eigenvalues and the corresponding eigenvectors. Each of these is a vector with 250 components. We list them in the order of descending eigenvalues, denoting the k th eigenvector as V k :½(K o ) k1 ,(K o ) k2 ,:::,(K o ) kj , :::(K o ) k250 T , and placing it as the k th row vector of the K o transform matrix. Figure 3C depicts the eigenvectors for k~5,10,:::,50, where smaller k is associated with a larger eigenvalue. Each principal component or eigenvector can be seen as a special input spectrum pattern S~V k , while a general input S~P k S k V k is a linear sum of the principal components with weights S k . The first encoding step is thus a transformation of the original input S by K o to obtain the decorrelated signal S k , for k~1,2,:::,250. The average power in S k is the k th eigenvalue of matrix R S The eigenvectors look roughly like oscillating waveforms (spectral oscillations) with different oscillation rates, and are comparable to the sinusoidal bases in the Fourier transform. They also resemble the ''ripples'' used in physiological experiments. This is because the input correlations are roughly translation invariant, at least within a small range of frequencies in which the signal power SS 2 f T is roughly independent of f (just like in vision when the statistics of inputs sampled at the retina can be seen as roughly translation invariant within a local region). Also note that smaller or larger k is associated with eigenvectors with fewer or more oscillations. This makes k relate monotonically to the spectral modulation frequency (corresponding to the ''ripple frequency'' V in physiological experiments). Larger eigenvalues, i.e., larger signal powers SS 2 k T, are associated with fewer spectral modulations or smaller indices k, because inputs of more similar sound frequencies are more correlated with each other, i.e., R S ij decreases with increasing f i {f j . The analogy between the eigenvectors and the Fourier bases can be understood as follows: if R S is strictly translation invariant, then the eigenvectors are sine waves with different spectral modulation frequencies V. The eigenvalues are the Fourier transforms of R S ij :R S (f i {f j ), and hence they decrease with the modulation frequency V because R S (f i {f j ) is non-negative and decreases with increasing f i {f j .
The second encoding step is to assign the gain g k to each of these channels S k according to equation (6), giving S k ?g k S k (see Figure 3D; I F~2 , SN 2 T~1 and l=SN 2 o T~10). Note that while the signal power SS 2 k T decreases with increasing k, the gain magnitude g k first increases with k and then decreases and drops to zero at higher k.
The gain for small k is low since the SNR SS 2 k T=SN 2 T is high enough to make amplifying S k less necessary. From equation (6) [34], This implies that g 2 k SS 2 k T~constant for sufficiently large SNRs. When each principal component S k is a modulation frequency mode, this gain profile g k is often called whitening. At smaller signal powers, the gain increases so as to utilize the channel's dynamic range fully. However, when SNR is too small, for example, when noise power is higher than signal power SS 2 k T=SN 2 Tv1, gain decreases with decreasing SS 2 k T [34]. This is because such input components are dominated by noise, and amplifying noise increases neural cost. Thus, in general, when SS 2 k T decreases with increasing k, the gain profile has a band-pass shape, first increasing, and then decreasing with increasing k (see the red curve in Figure 3D).
As the overall encoding transform gives outputs O~KSznoise, where noise~KNzN o , the i th output neuron O i has its SRF as a vector of weights for inputs S f j of various frequencies j~1,2,:::,250 It can thus be seen as a weighted sum of the eigenvectors V k of the input correlation matrix, with weights g k V k i for output neuron i. Figure 3E shows SRFs for four different output neurons (or channels i). These SRFs have different preferred frequencies f , so that the preferred frequencies of all the output neurons span the whole input frequency range. The shapes of the SRF depend on the input statistics via the dependence of V k and g k on the input correlation matrix R S . In particular, for sufficiently high input SNR, while a neuron is excited by its preferred frequency, it is suppressed by nearby frequencies. This form of contrast enhancement achieves a measure of decorrelation between neighboring output neurons that would otherwise reflect the strong correlations between neighboring frequencies. For SRFs tuned to higher frequencies, the center excitatory regions are larger and the surround suppression is weaker. This is because SNRs are weaker for higher frequency inputs (the dependency of SRF on SNR will be discussed in the next sub-section). If the input statistics are strictly translation invariant, the SRFs for different output channels will have the same shape, and will just be centered on different frequencies.

Adaptation of SRF to input signal-to-noise ratio
When sound intensity decreases, the basilar membrane in the cochlea undergoes a smaller vibration. This decreases the magnitudes of input signals S, and so, if the level of the noise stays unchanged, the signal-to-noise ratio SS k 2 T=SN 2 T will decrease. This will change the optimal encoding gain g k via equation (6), and thus change the final SRFs. In our example, we simulate the change in input intensity by changing I F in equation (11). Figure 4A shows three example input intensity profiles SS 2 k T, and the corresponding gain profiles g k . While an overall change of input intensity merely scales the profile SS 2 k T up and down, the gain profile g k does not trivially scale up and down. When input intensity decreases, the k at which SS 2 k T=SN 2 T~1 becomes smaller, thereby decreasing the k p at which g k peaks. Consequently, the gain profile turns from being band-pass to being lowpass ( Figure 4A).
The non-zero gain at higher k implies sensitivity to weaker principal components with more spectral oscillations (or higher ''ripple frequencies''). Thus, as input intensity decreases, the overall SRF filter changes in two ways ( Figure 4B): (1) it fluctuates less (i.e., has fewer excitatory and inhibitory regions, and with decreased strength inhibitory regions); (2) the width of the excitatory and inhibitory regions increases, as the result of losing contributions from spectral modulations V k with higher modulation frequencies.
The insights from Figure 4B can help to understand the difference between the four SRFs in Figure 3E. Given the I F as in Figure 3, one may divide the whole sound frequency range into two ranges of equal bandwidth, one for the lower and the other for the higher f 's, and treat the two ranges as if they were two different stimulus ensembles. If one ignores the overall sound frequency difference between these two ensembles, then these two ensembles differ from each other only in their SNRs, with a higher SNR for the ensemble for the lower sound frequencies f . In this perspective, one can understand why a SRF tuned to the lower frequencies in Figure 3E has a narrower excitatory region and a stronger surround suppression than a SRF tuned to higher frequencies, using the insights gained from Figure 4. (In comparing Figure 4B with Figure 3E, one should note that each SRF in Figure 4B is depicted by zooming to the frequency region around the preferred frequency f of the SRF.) One may even view the four SRFs in Figure 3E as if they were each exposed to one of the four different stimulus ensembles that differ in SNRs (and in sound frequency f , and we ignore this difference). Within each of these stimulus ensembles, the input statistics may be seen as approximately translation invariant, since SS 2 f T is almost independent of f and the correlation SS f S f ' T is approximately only a function of the frequency difference f {f ' within a small range of frequency f .

Adaptation of SRF to input signal correlation
As well as adapting to the input SNR, the SRF can adapt to the signal correlations in the input. These can also vary across auditory environments. We generate two stimulus ensembles (Ensemble short and Ensemble long ) based on equation (11), with short and long range (in frequency space) correlations between inputs S fi and S fj of different sound frequencies. We do this by setting the smoothing length L in equation (13) to be L short~1 0 and L long~2 0. Since short and long range correlations give respectively smaller and larger correlations or degrees of input redundancy, in this paper, we use the terms short/long-range and small/large correlations interchangeably. The two stimulus ensembles are made to have the same overall signal power S k SS 2 k T, and consequently their SS 2 k T vs. k curves cross each other at a particular frequency k x ( Figure 5A). In Ensemble long , signal power SS 2 k T is more concentrated in lower k's, and the ''bandwidth'' of gain, i.e., the range of k's with substantial g k , is consequently narrower.
If SS 2 k TwSN 2 T at k~k x , the k at which signal power SS 2 k T=SN 2 T~1 is larger in Ensemble short ( Figure 5A, upper panel, I F~2 , SN 2 T = 1, l=SN 2 o T~10). Thus, the frequency k p at which gain g k peaks is also larger in Ensemble short . If the SNR is lower, so that SS 2 k TvSN 2 T at k~k x , then k p is instead smaller in Ensemble short than in Ensemble long . However, this is less apparent since gain profiles in both ensembles become ''low-pass'' in k implying that there is no obvious ''peak position'' ( Figure 5A, lower panel, I F~0 :2 ). Nevertheless, the cutoff frequency k where g k~0 is always smaller for Ensemble long (Figure 5A), and the optimal SRFs for it consequently enjoy a greater spectral extent (i.e., the SRFs are non-zero for a larger range of f ( Figure 5B).
Intuition for this effect is that for it to be effective as either a contrast enhancing filter at a high SNR, or a smoothing filter at a low SNR, the SRF's spectral extent should match the range of the input correlations.

The temporal filter TRF
We can similarly ignore the frequency dimension of the input to understand the temporal receptive field (TRF). This is determined from the way O t~St' K tt' S t' +noise, the input temporal sequence S~(S t1 ,S t2 ,:::,S ti ,:::) is transformed to the output temporal sequence O. In a statistically stable auditory environment, the input correlation should be time shift invariant, i.e., R S tt'~S S t S t' T should depend only on t{t'. Denote However, the actual procedure to obtain the TRF is trickier in that the U transform in the third encoding step to give the overall K~UgK o has to be chosen to satisfy the causality constraint. That is, the output O t at time t should only depend on past input S t' for t'ƒt, i.e., K tt'~0 for t'wt. Moreover, it is better for the TRF to have a short temporal span and latency, an outcome that can be achieved by assuming that the optimal temporal filter K Figure 5. Adaptation of gain g k and spectral filter kernel SRF to input correlations under high/low SNR. Same input ensemble as that in Figure 3A, except that the smoothing parameter, L~10 and L~20, are set for short and long range correlations, respectively. Analogous figure format as in Figure 4, with added illustrations of the adaptation to input correlations. The thick and thin curves correspond to quantities for inputs with large and small correlations respectively, blue/red curves plot signal power SS 2 k T and gain g k respectively. doi:10.1371/journal.pcbi.1002123.g005 has a minimum phase-shift [57]. Short latency can feasibly be implemented by neural synaptic and membrane mechanisms that typically have time constants no longer than a few hundred milliseconds [58]. Hence, these offer credible constraints on the TRF. Note that if we choose U~K {1 o , i.e., U tv !e i : 2pvt , then K tt' ! P v g v e i : 2pv(t{t') would be an even function of t{t' and thus not a causal temporal filter given gains g v that are all real. The filter K can be made causal and minimal phase by choosing another U simply as U tv !e i : 2pvtziw(v) with a particular phase function w(v), so that K tt' ! P v g v e i : 2pv(t{t')ziw (v) . Instead of directly obtaining this phase function w(v), we can also equivalently obtain this minimum phase shift causal filter by transforming the acausal K using standard procedures in signal processing theory as follows (see [57] for the proof). Given a noncausal filter Kt t ð Þ with finite non-zero values in discrete timê t t~{M,{Mz1,:::,0,:::,N{M{1,N{M, first let t~t tzM to make a causal filter K t ð Þ whose nonzero values are at t~0,1,:::,N. Second definẽ Among the N complex roots of the equationK K z ð Þ~0, let z i denote the roots with z i j jw1 and z j the other roots with z j ƒ1.
Third, let The coefficients K m t ð Þ, t~0,1,:::,N are the values of the desired causal minimum phase filter. One example of this process is demonstrated in Figure 6A (before the minimum phase adjustment) and Figure 6B (after the minimum phase adjustment)(I F~2 ,L~14).
The temporal kernel also depends on the SNR and the input correlations. The change in g v when sound intensity becomes lower is similar to that in the spectral case: from band-pass to lowpass. A temporal kernel under lower SNR is demonstrated in Figure 6C. The changes in g v and TRF with input correlations are analogous to those in the spectral case as well (figure not shown).

Finally, we show examples of the two dimensional STRF (f ,t).
Here, we extended the assumption of shift invariance in the input correlations to the spectral dimension for the convenience of calculation. This assumption is reasonable when individual STRFs cover sufficiently small ranges of frequencies that the correlation in the spectral space is almost translation invariant within that range, as we see in our SRF examples. Then, spectral and temporal dimensions can be de-correlated at the same time by performing a 2-D Fourier transform on inputs S(f ,t), with the moving ripples as decorrelated channels, each denoted by a 2D index (V,v) marking the spectral and temporal modulation frequencies.
Let the signal power in the de-correlated channels (V,v) for input S(f ,t) be SS 2 (V,v)T~I F F(V,v). Here, F (V,v) typically decays with modulation frequency jVj and jvj since most natural inputs have input correlation SS(f ,t)S(f ',t')T that decays with jf {f 'j and jt{t'j. I F is a scale factor that controls the SNR. We use the following example in our simulations where a~1:8, V 0 and v 0 are parameters that control input correlation, and NORM~P V P v exp½{(jVj=V 0 ) 3 {a(jvj=v 0 ) 3 is a normalization factor. Figure 7A shows an example with V 0~4 , v 0~4 According to equation (8), the gain g(V,v) can be obtained as shown in Figure 7B (SN 2 T~1, l=SN 2 o T~10, and I F~6 0,500). In particular, in the frequency range (V,v) in which noise is negligible relative to the signal, the gain Figure 6. Simulation of temporal receptive field TRF, when the spectral dimension is omitted. The same stimulus ensemble is used as in Figure 3A, except the factor A i~1 in equation (12)  specifies the whitening filter of equation (14). This gain profile changes from being a band-pass to a low-pass two dimensional filter as the SNR is lowered. As we noted before, efficient coding predicts the gain g(V,v), or the modulation transfer function (MTF), but does not precisely determine the STRF shape. The latter depends on the less constrained U transform. Therefore, we qualitatively compare our g(V,v) for two different I F 's with the MTFs obtained from physiological experiments under two different input sound levels. Figure 7E and Figure 7F are obtained from data on STRFs of 40 cells in the inferior colliculus of animals exposed to natural rain sound at low and high sound levels [7]. We first did a twodimensional Fourier transform on the STRF of each cell to obtain its MTF. Then the spectral modulation frequency V p and the temporal modulation frequency v p where the MTF has its maximum value were identified and normalized by a fixed value across cells. The average V p and v p across all cells are shown in Figure 7E. These two ''peak frequencies'' both increased when sound intensity increased. The physiological MTF averaged across all cells ( Figure 7F) also becomes higher pass, both spectrally and temporally, under higher sound intensities, as predicted by efficient coding (Figure 7B).
For completeness, we illustrate in Figure 7C the model STRFs from the gain profiles g(V,v), using an inverse Fourier transform with a proper phase function w(V,v) as the candidate U matrix. Specifically, the model STRF is where the phase w(V,v) is chosen to make the STRF causal, and with minimum phase shifts in the temporal dimension. In practice, the STRF is obtained as follows, by extending our method for obtaining the causal 1-D TRF. For each V, we first obtain the temporal acausal filter K(V,t) acausal~ð g(V,v)e i : 2pvt dv and then transformed this into a causal minimum phase filter K(V,t) as for the one dimensional TRF filter. The final twodimensional STRF is then In general the model STRF has its highest amplitude at the preferred frequency on the spectral axis and for short latencies (i.e., the early part of the temporal axis). At low I F , the STRF has a large excitatory region and a weak inhibitory surround ( Figure 7C). At larger I F , the STRF involves more excitatory and inhibitory regions with an increased inhibitory strength. Overall this has a more band-pass gain profile. Meanwhile, the bandwidth for the gain g(V,v) increases with I F , thus shrinking the width of the main excitatory region. Therefore, adaptation to higher sound levels makes the frequency-time tuning curve sharper, or equivalently more narrowly tuned and so, at a single cell level, supporting a more precise read out of the time and frequency of auditory input. Qualitatively, physiologically observed STRFs adapt to the input intensity in the same way [7] (also see [14]). The model also predicts changes to MTFs and STRFs for different input correlations. Figure 7D shows the gain function g(V,v) and STRF for an example in which the input has longerrange correlations in both spectral and temporal dimensions (we set V 0~3 :2,v 0~3 :2 while holding I F~5 00 as in the high SNR case in Figure 7B and 7C). The peak modulation frequency in g(V,v) is decreased, and the excitatory region is wider compared with counterparts in Figure 7B and 7C at high SNR. This is consistent with our 1-D results in the spectral dimension ( Figure 5).

Summary of findings and predictions
In summary, this study set out to understand the computational role of auditory spectro-temporal receptive fields (STRFs). In particular, we generalized previous work [26] by proposing that STRFs are efficient codes for inputs which retain maximal information for a given neural cost associated with the output. We analyzed this proposal in detail for the case that input signals and noise are approximated as Gaussian. Mathematically, the STRF transform can be shown [34] to be composed of three abstract steps: input de-correlation, gain control, and multiplexing. For typical input statistics that are shift-invariant in sound frequency and time, the transform can be compared with two sorts of experimental data. First, gain control corresponds to the magnitude of the modulation transfer function of the STRFs. Second, by choosing the form of multiplexing to arrange the STRFs to have minimal phase, one can predict their full form. That the STRFs or the MTFs adapt to input statistics is a direct prediction of this efficient coding framework, since both the information conveyed and the neural coding cost depend on these statistics. Our efficient coding proposal is thus experimentally testable.
We made two particular predictions about the adaptation of the STRFs, one associated with input intensity, the other with input correlation. For the case of intensity, we predicted that the MTF of the STRFs should become more low pass when input intensity is lowered. Intuitively, as long as inputs at nearby frequencies and times are correlated, a low pass filter smoothes the input to reduce noise, whereas a band pass filter extracts differences between input frequencies and times to remove redundancy. Compared with a band pass STRF, a low pass STRF has one or all of the following characteristics: (1) it has fewer excitatory and inhibitory regions; (2) each excitatory/inhibitory region has a larger size; (3) the secondary or opponent region, e.g., the inhibitory region for a STRF with an primary excitatory region, is weaker. All three characteristics help to smooth noise, a necessary strategy for weak inputs. In contrast, a band-pass filter has the opposite characteristics, so as not to increase the neural cost due to the transmission of redundant input information. These predictions are analogous to those seen in adaptations of visual coding to input SNR [29,33,34,51,52]. They also generalize previous accounts of the adaptation of the temporal auditory filter [26] to input intensity.
For the case of adaptation to input correlation, our framework predicts that the sizes of the excitatory and inhibitory regions of the STRFs should adapt to the range of input correlations. That is, input ensembles with longer range correlations in frequency and/ or time should lead to STRFs with larger excitatory and inhibitory regions in the corresponding feature dimensions. Longer range input correlations are typically equivalent to greater input modulation power in the lower modulation frequency range in the stimulus ensemble. Equally, larger excitatory/inhibitory regions in the STRF are typically equivalent to its MTF being tuned to lower modulation frequencies. Thus, our prediction can be stated equivalently as saying that a stimulus ensemble with greater input power in the lower modulation frequency range, spectrally and/or temporally, should lead to neural MTFs tuned to the lower modulation frequency ranges. We demonstrated this form of adaptation for SRFs in Figure 5, and for STRFs in Figure 7. In particular, with a sufficiently high SNR, the MTF profile g(V,v) should whiten the ensemble specific input modulation power SS 2 (V,v)T.

Experimental evidence and tests of the predictions
Various experimental observations pertain to these predictions about adaptation to input intensity. Lesica and Grothe [7] presented natural rain sounds to gerbils and found that, for a majority of cells in inferior colliculus (IC), the STRFs have more excitatory/inhibitory regions for higher input sound levels, and only have excitatory regions, or at least very weak inhibitory regions for lower sound levels. Nagel and Doupe [14] conducted a similar study in field L of songbirds, an area analogous to mammalian auditory cortex. In both spectral and temporal dimensions, they found that the excitatory/inhibitory regions of the STRFs become smaller and sharper under higher sound intensity, while the number of such regions do not increase. These results paralleled those of an earlier study in which they only examined the temporal dimension of the receptive fields [58]. Both studies are consistent with our proposal that the MTF changes from lower to higher pass when input intensity (and hence, SNR) increases. They thus offer complementary confirmation of our predictions.
As mentioned in the Introduction, Lesica and Grothe [26] also examined the adaptation of the temporal receptive field(TRF) to vocalizations and ambient noises. They found that the TRF changed from being bandpass to lowpass when noise was mixed into the ensemble of vocalizations, and accounted for this finding in terms of efficient temporal coding. Their result can be understood as a special case of adaptation to SNR in our framework, focusing on the temporal dimension of the STRF, and treating the addition of noise as a reduction in input SNR. According to the principle of efficient coding, the spectral receptive field should also have changed from bandpass to lowpass when this noise was added.
There are as yet few physiological experiments that pertain to our prediction about adaptation to input correlations. One study by Woolley et al [11] examined the STRFs of midbrain neurons in zebra finch in response to bird songs or modulation-limited noise. Compared to that of the noise, the input modulation power of the songs is more concentrated in lower modulation frequencies. The MTFs of the STRFs matched the corresponding modulation frequency spans, consistent with our theoretical prediction.
The studies by Woolley et al [11] and Lesica and Grothe [26] could be extended to different ensembles of natural stimuli, e.g., songs, speech, animal vocalization, and environmental background, each with its own particular input correlations [59]. Findings from such extended studies would provide a stern test of the efficient coding framework. Generally, the input modulation power SS 2 (V,v)T in natural sounds decays with increasing modulation frequency (V,v), at a rate that is specific to the ensemble [59]. Ensembles with faster decays have longer range input correlations (or larger correlations), as modelled in our Figure 5A and Figure 7BCD. We predict that this decay rate in SS 2 (V,v)T should dictate the shape of the neural MTFs g(V,v), such that ensembles with faster decay should lead to neural MTFs focusing on lower modulation frequency ranges. In particular, for high input SNR, the MTF profile should be that of a whitening filter g(V,v)!(SS 2 (V,v)T) {1=2 , with the upper frequency limit (V,v) for this whitening (beyond which MTF quickly decays to zero) being around the frequency at which SS 2 (V,v)T is comparable to the power level of the noise. The recent study by Rodriguez et al [59] showed that inferior colliculus (IC) neurons, when examined collectively as a population, do seem to whiten typical natural stimuli, in that the population MTF g(V,v) increases with frequency (V,v) (up to a high frequency limit). This is to be expected for an efficient code, since natural input power SS 2 (V,v)T decreases with frequency. However, the neural STRFs in this study were obtained (using the moving ripple stimuli) without specific adaptation to any particular natural stimulus ensemble. We predict that if the STRFs had been measured under adaptation to the natural sounds for high SNR, then the neural MTF profile, at a neural population level if not at individual neuron level, should be ensemble specific, i.e., whitening the input power SS 2 (V,v)T of the adapting stimuli.

The neural implementation of the efficient STRF and its adaptations
We seek of the overall effective STRF rather than its realization. Thus, it is important to note that the three separate steps of our mathematical analysis of the efficient STRFs are purely abstract. They do not correspond to an actual physiological implementation. In principle, when a receptive field is entirely linear, it can as well be implemented in a single step, as in multiple linear steps in a cascade. Meanwhile, the observation that STRFs adapt to changes in the statistics of auditory inputs, and indeed that visual receptive fields expand when the visual environment changes from bright outdoors to dark indoors [52], attest to the availability of the mechanisms for implementing (and thus adapting) efficient sensory coding.
We speculate that the adaptation of a STRF in a midbrain auditory neuron is likely to involve gain control in many intervening and distributed neural processes upstream along the auditory pathway [60]. Even a simple adaptation of efficient coding, in the large monopolar cells (LMCs) in an insect compound eye to changes in the distribution of input contrasts in the visual environment, involves multiple stages of processes, some in the photoreceptors and others in lamina from the receptors to the LMCs [61]. Synaptic and intrinsic mechanism were also found in the adaptation of retinal bipolar and ganglion cells to temporal contrast [62,63]. Considering the multiple synapses from the hair cells to IC or auditory cortex, and the many recurrent and feedback networks with both excitatory and inhibitory connections [64,65] in this pathway (for example, medial olivocochlear (MOC) efferent effects [66]), we speculate that gain control processes are likely to include synaptic facilitation and depression and distributed channel based adaptations. They should collectively achieve the effective adaptation in the gain such as the g k in equation (6) and/or the underlying eigenmodes. Because there are multiple, redundant, and distributed synapses from the auditory periphery to the neuron whose STRF we model, a STRF could be implemented in multiple ways. Such implementational redundancy is likely to be needed to accommodate the many forms of adaptation that might be needed, given a limited degree of flexibility in any individual mechanism.
The timescale of STRF adaptation to sound levels or input SNRs should be less than several or tens of seconds, or even shorter, since, in the physiological experiments, the stimulus duration for one sound intensity level is 40 s in [7] and 5 s in [14], while adaptation to mixing noise into the vocalization inputs occurs within hundreds of milliseconds in [26]. Adaptation has been observed to occur over multiple time scales, ranging from tens of milliseconds to minutes in the fly visual system [67]. In the auditory systems, midbrain neurons adapt to sound levels within hundreds of milliseconds [68,69], while cortical adaptation happens over multiple timescales and is likely to arise from network activities [70,71]. We still know too little about the actual mechanisms for STRF adaptation [26] or sensory adaptation in general, although it has been suggested that channel based mechanisms at the cellular level are plausible candidates [67]. Understanding the computational roles of the STRFs should motivate future investigations of these mechanisms.

Limitations of the framework
As an initial attempt to understand the computational role of the STRFs, our framework has various limitations. First, the STRF model as a whole is quantitatively inaccurate since it specifies a linear mapping between sensory inputs and neural responses (in each adapted state). The accuracy could be improved in future work through the addition of a static nonlinearity after the STRF [6,7]. However, this would not be expected to lead to a qualitative change in STRFs or their adaptation. Extensions to dynamic nonlinearities would be much more complex. Second, for analytical convenience, we assumed that the input statistics are Gaussian, meaning that there are no input signal correlations higher than second order. The same approximation was made for the case of efficient visual coding, in the absence of good information about higher order input correlations [30,32,34]. Subsequent work using independent component analysis (ICA) on natural visual images avoided the Gaussian assumption, leading to models of visual encoding in primary visual cortex V1 [72,73]. This approach has been adopted to understand the STRFs in the auditory cortex [74] and avian primary auditory area field L [75], although it cannot predict adaptation to SNR and its whitening prediction does not go beyond that obtained under the Gaussian assumption. It is still controversial whether higher order statistics are the cause for the dramatic difference between the V1 encoding and that in the retina and the lateral geniculate nucleus [34].
Furthermore, higher order correlations in natural visual inputs contribute much less redundancy (measured in signal entropy) than second order correlations [36,37,38]. This may explain why the Gaussian assumption was not overly deleterious to the predictions of the efficient coding principle in vision. Although higher order correlations in auditory inputs are also poorly understood, they do cause auditory adaptation, e.g., in stimulusspecific adaptation to complex temporal patterns of tones [76]. To what extent higher order input statistics can influence auditory encoding remains to be answered in future studies.
Our focus on coding efficiency ignores aspects of auditory processing devoted to additional tasks such as sound source localization or stream segmentation. The observed STRFs may reflect elements of both efficient coding and requirements associated with these tasks. In fact, some variations are possible within the context of an efficient code. For instance, we have so far restricted ourselves by making all neurons share the same MTF profile predicted by efficient coding (by restricting the U transform to that in equation (9)). Relaxing this restriction would allow other STRFs. In particular, different neurons in the coding population could be tuned to different modulation frequency regions within the (V,v) extent covered by the overall MTF envelope g(V,v), and could have different shapes. Accordingly, different STRFs could have different spectral bandwidths (or resolution) and shapes, in addition to preferring different center frequencies f . Indeed, in the auditory cortex, different neurons exhibit different spectral resolutions, and even prefer different motion directions of the spectral ripples [77,78,19]. (Analogously, primary visual cortical neurons are tuned to multiple spatial sizes and prefer different orientations, a coding scheme that can be shown to be consistent with efficient coding [36].) Such a collection of STRFs could satisfy the joint goals of coding efficiency and detecting ecologically meaningful auditory objects (such as vocalizations). Diversity in the shape and bandwidth of the STRFs is already present, although perhaps less so, sub-cortically, e.g., in inferior colliculus [78]. When different neurons have different STRF bandwidths, our prediction that the input modulation power will be whitened by the neural MTFs should be modified, such that the 'neural MTFs' should mean the collective MTF of the whole neural population within a particular auditory stage (such as IC, see [59]).
There could be alternative formulations (other than equation (4)) of the efficient coding principle, in particular, in the formulation of the neural cost. Our formulation neural cost~S i SO 2 i T causes the degeneracy of the efficient coding solution, i.e., the existence of many choices of the equally efficient coding transforms, when the signals are gaussian. Other formulations of the neural cost could break this degeneracy. For example, formulation neural cost~S i H(O i ) in terms of the summation of individual neural channel capacity (or entropy H(O i )), or neural cost~S i SjO i jT in terms of the total activity level, would generate neural codes to encourage very different MTFs for different neurons. In both audition and vision, the MTFs (in audition) and the contrast sensitivity functions (the vision analog of the MTFs) for different neurons tend to be similar in the sensory periphery (cochlear nucleus and retina), but they are increasingly disparate further towards the central brain. These changes could be caused by the different cost functions in the nervous system, or, as discussed in the previous paragraph, due to the breaking of the degeneracy by additional computational tasks further downstream along the sensory pathway.
Redundancy redunction and information preservation are two essential ingredients of the efficient coding principle. While this principle has been quite successful in understanding the retinal coding, it cannot explain the enormous increase in the redundancy of the visual coding in the primary visual cortex (in which the number of neurons are about 100 times as many as those in the retina) [34], nor the drastic loss of visual information outside the focus of attention in the higher visual areas without introducing task-dependent factors. It remains to be investigated how much and in what form the efficient coding will take further along the auditory pathway. One can expect that more processes will be devoted to solving specific auditory tasks, in addition to the task of sensory encoding, in the higher stages of auditory processing.

Concluding remarks
This study was partly inspired by the success of the efficient coding principle in understanding receptive fields in the early stages of visual processing, and the way these receptive fields adapt across sensory environments. Analogies between visual and auditory processes have been explored by previous researchers [79], and we expect that they can be carried further in higher level sensory processes including segmentation, selective attention [80], and even object recognition.
In conclusion, efficient coding provides a plausible computational interpretation of various recent experimental observations on STRFs, and notably the way they adapt to input environments. By making testable predictions, it motivates experimental directions which should hopefully lead to further insights and understanding.