Invariant neural responses for sensory categories revealed by the time-varying information for communication calls

Although information theoretic approaches have been used extensively in the analysis of the neural code, they have yet to be used to describe how information is accumulated in time while sensory systems are categorizing dynamic sensory stimuli such as speech sounds or visual objects. Here, we present a novel method to estimate the cumulative information for stimuli or categories. We further define a time-varying categorical information index that, by comparing the information obtained for stimuli versus categories of these same stimuli, quantifies invariant neural representations. We use these methods to investigate the dynamic properties of avian cortical auditory neurons recorded in zebra finches that were listening to a large set of call stimuli sampled from the complete vocal repertoire of this species. We found that the time-varying rates carry 5 times more information than the mean firing rates even in the first 100 ms. We also found that cumulative information has slow time constants (100–600 ms) relative to the typical integration time of single neurons, reflecting the fact that the behaviorally informative features of auditory objects are time-varying sound patterns. When we correlated firing rates and information values, we found that average information correlates with average firing rate but that higher-rates found at the onset response yielded similar information values as the lower-rates found in the sustained response: the onset and sustained response of avian cortical auditory neurons provide similar levels of independent information about call identity and call-type. Finally, our information measures allowed us to rigorously define categorical neurons; these categorical neurons show a high degree of invariance for vocalizations within a call-type. Peak invariance is found around 150 ms after stimulus onset. Surprisingly, call-type invariant neurons were found in both primary and secondary avian auditory areas.


Introduction
Information theoretic analyses are well suited to the study of neural representation since this mathematical framework was developed to quantify and optimize the encoding of informative signals in communication channels [1]. In sensory systems, Information Theory (IT) has been applied extensively as a complementary approach to the estimation of stimulus-response functions such as tuning curves, spatio-temporal or spectro-temporal receptive fields or other higher-level encoding models [2]. Information theoretic approaches have been particularly powerful in explorations of the nature of the neural code and its redundancy or efficiency [3][4][5][6]. For example, IT was used in early studies in the visual system to demonstrate that spike patterns contain information beyond average rate both for static images [7] and dynamic visual stimuli [8]. IT was also used to show that spike doublets can contain synergistic information that cannot be explained by an analysis of successive single spikes [9] and that, although information can only decrease in a signal processing chain, the neural coding efficiency increases as one moves to higher levels of sensory processing [10]. Finally, IT investigations also revealed that neural efficiency is higher when sensory systems process natural stimuli versus synthetic stimuli [11][12][13], in support of ethological theories of optimal sensory processing [14].
In sensory systems, the mutual information between a stimulus and the neural responses has often been estimated in a stimulus reconstruction framework and for continuous dynamic stimuli in stationary conditions, where time averages can be performed. In the stimulus reconstruction framework, one attempts to estimate the information about all aspects of the stimulus; for example, in audition, the stimulus would be represented by its exact sound pressure waveform. As long as the stimulus set is rich (i.e. has very large entropy), the mutual information can be an estimate of the maximum information that can be transmitted by a neural communication channel, also known as the channel capacity [3,4]. For instance, one can obtain the mutual information of an adapted auditory neuron processing white noise or colored noise sounds [11]. As long as the stationary assumption is valid, using continuous stimuli is also beneficial as it provides large data sets that are needed to estimate the joint probability of stimuli and neural responses, both of which can have high dimensions. Even in these conditions, it is noteworthy that a direct estimation of information is only possible when many repeats of the same stimulus can be obtained [15] or when simplifying assumptions are made [9]. Ultimately, the calculation of information based on stimulus reconstruction gives a single number corresponding to the information transmitted by a single neuron or an ensemble of neurons for a particular stimulus ensemble. By repeating the calculation for different stimulus ensembles, one can investigate how the channel capacity of particular neurons or neural ensembles might depend on the stimulus statistics (e.g. for natural vs synthetic stimuli). Furthermore, by repeating the calculation using different symbols to represent the response, the potential nature of the neural code (e.g. time patterns vs. rate) can be revealed.
Here we are using IT in a different sensory encoding context: the accumulation of information in a recognition task, such as face recognition in the visual system [16] or word recognition in the auditory system [17]. Recognition or identification is one of the key computations performed by higher sensory areas as opposed to the task of efficient stimulus representations that is performed in lower sensory areas and that might therefore be well quantified by information values based on stimulus reconstruction. In the recognition tasks, each stimulus is described by a simple label, such as the word corresponding to a given speech sound or that is used to label a given visual object. The relevant value of information in that task is then the information about these discrete labels and the information capacity of the system in its ability to identify the stimulus as a whole. In the recognition framework, one can ask how the information about the stimulus identity or label changes as a function of time relative to the stimulus onset and to what extent that time-varying information is redundant and, thus, how it accumulates over time. For example, one could ask at what time after stimulus onset does the performance of single neurons or ensemble of neurons match a behavioral performance of word recognition. Such an IT analysis has been performed in the primate visual system using a delay-matching to sample paradigm, and using spike counts, estimated in progressively longer windows, as the neural symbols [18].
In information studies based on continuous stimulus reconstruction, the neural code can be investigated in terms of its temporal resolution (i.e. letter size) and its integration time (i.e. word length). While the same properties of the neural code can be deciphered in the recognition framework, one can also examine the relationship between spikes at different points in time and time-varying information. This analysis is meaningful because a time zero corresponding to stimulus onset can be clearly defined and is behaviorally relevant. Moreover, stimulus-response functions for such discrete stimuli are not time-invariant. Responses in sensory neurons, in vision [19,20] and in audition [21,22], are often characterized by an onset response (or on-response) and a sustained response, where both the precision of spikes and the information coded might be different [23,24]. For example, a first spike latency code has been proposed as a fast encoding scheme in vision [25], audition [26] and somato-sensation [27]. Rolls et al. tested this hypothesis, by quantifying the fraction of information that is present in the first spike relative the on-going response [28].
Finally, in the recognition framework, one can also compare the information values obtained when different labelling schemes are used for identifying the stimuli as objects. For example, speech sounds could be labeled hierarchically as unique utterances, as phonemes, as syllables, as words, etc. One can then compare time-varying information about each of the levels in such hierarchical labelling scheme and gain insight on the neuro processing involved in object categorization. Although such a hierarchical representation of stimulus features has been used in encoding models for studying human processing [29], it has not yet been used in an IT analysis.
In this study, we developed a new approach for estimating time-varying information and cumulative information for sensory object identification task. Our approach assumes that time-varying neural responses can be modeled as inhomogeneous Poisson processes and generalizes well to large number of stimulus categories and to long integration times relative to the dynamics of the time-varying response. Our motivation for developing this methodology was to gain additional understanding on the neural representation of communication signals in high level auditory areas. Animal communication calls, just as speech sounds in humans, are categorized into behaviorally meaningful units. Significant progress has been made in identifying brain regions involved in categorizing sounds, in particular in the primate brain, where neural responses that are correlated with progressively more abstracts concepts are found in primary auditory cortex, the lateral belt of the auditory cortex and the prefrontal cortex [30]. However, the neural computations involved in generating categorical responses remain poorly described [31] and only a small number of studies have examined the neural categorization of natural communication calls in non-human species [32][33][34][35][36]. We and others have been developing an avian model system to study the neural processing of relatively large and complex vocal repertoires [37,38]. Our prior studies include a detailed bioacoustical analysis of the features that define each call-type of the complete vocal repertoire of the zebra finch [39] and the first characterization of neural responses to the calls from that large repertoire in primary and secondary avian auditory cortical areas [40]. In that study, we found that approximately 45% of auditory neurons encode information about call-type categories. Among those, a minority show strong selectivity for single call-type categories and invariance for calls within that category. Here, we investigated the processing in time that could lead to those observed categorical responses by comparing the time-varying information for stimuli labelled as individual utterances to the time-varying information for the same stimuli labelled by their call-type category. With that analysis, we were able to obtain values of temporal integration for stimulus identification and call-type category identification. We also analyzed the relationship between the time-varying firing rate and the time-varying information and, in particular, examined differences in selectivity in the onset versus sustained response. Finally, we used anatomical data to examine the distribution of neurons in primary and secondary avian auditory cortical areas with distinct responses properties as revealed by this IT analysis.

Results
We studied the time-varying information in a population of neurons recorded from primary and secondary regions in the avian auditory cortex of head-fixed urethane anesthetized zebra finches listening to a large set of their own species natural communication calls. Zebra finches emit various calls in different behavioral contexts and have a complete vocal repertoire composed of 11 call-types. The acoustical characteristics of each call-type have previously been described in detail [39]. Here we focused on the neural representation of 9 calltypes: 1) 3 pro-social calls emitted for pair bonding and social cohesion: the Distance Call (DC), the Tet Call (Te), the Nest Call (Ne); 2) the Song (So) that is emitted as a sexual display in males; 3) 2 calls emitted in aggressive encounters: the aggressive Wsst Call (Ws) and the Distress Call (Di); 4) 1 alarm call, the Thuk Call (Th) and 5) 2 calls emitted by juveniles: the Begging Call (Be), used by young to request food, and the Long Tonal Call (LT), a contact call that is a precursor of the adult DC. These 9 call-types show distinctive acoustic features, but they are also idiosyncratically produced by each individual bird such that the caller identity is encoded in the vocalization [41]. Here we investigate how auditory neurons extract the call-type label, irrespective of the individual and rendition variability. In this context, the categorical neural responses are not only defined by their discrimination of call-types but also by their invariance for caller identity or rendition within each particular call-type. To identify categorical neurons, our stimulus set was composed of approximately 10 different exemplars of calls, train of calls or songs produced by different vocalizers for each of the 9 calltypes. The call stimuli were randomly sampled from a large annotated data base of calls and songs from the complete repertoire of the male and female zebra finch. Neural responses were recorded using electrode arrays implanted in both hemispheres of 4 male and 2 female adult zebra finches. We recorded from a total of 914 single auditory units in both primary (Field L) and secondary (CLM, CMM and NCM) avian auditory areas. In previous analysis, we showed using a decoding approach that information about call-types was found in 404 (44%) of these units [40]. Here, we further restricted our population analysis of information to neurons, from that same data set, that reached a significant level of information about the stimulus (see Methods) and analyzed the time-varying information of 337 neurons during the first 600 ms of their response. Note that many calls are shorter than 600 ms, but also that they are often produced in succession with short inter-call intervals. Thus, this analysis window could contain two or more calls or the beginning of a longer song motif composed of multiple syllables. We only analyzed the response in the first 600 ms because the estimation of the cumulative information for longer time windows became unreliable, as we will explain below. Additional details on the neurophysiological recordings can be found in the methods section and in Elie and Theunissen [40].
In the Results, we first describe the approach we developed to estimate instantaneous and cumulative time varying information using a Poisson statistic assumption. We illustrate these calculations with specific examples of model neurons. We then validate our assumption and use this methodology to estimate the cumulative information for neurons in the avian auditory cortex as they process natural vocalizations.
We then analyze the time-varying coding properties of the avian cortical auditory neurons with an emphasis on: the relationship between spike rate and instantaneous information, the time constants observed for the cumulative information and the relative fraction of stimulus cumulative information that is used for extracting the behaviorally relevant categories corresponding to distinct call-types.

Estimation of the time-varying information
At a given time t, the instantaneous mutual information between the stimulus S and the response Y t can be written as a difference in Shannon entropies: Here, H(Y t ) is the response entropy for a window at time t, while H(Y t |S) corresponds to the entropy of the response given the stimulus or the conditional response entropy. H(Y t |S) can also be called the neural noise since it represents the variability in the neural response to the same stimulus. For spiking neurons, y t represents the number of spikes in the window at time t (Note: in our notation, capitals are used for random variables and lower case for a sample from that random variable).
Similarly, the cumulative mutual information in neural responses that are discretized into time intervals is given by: The entropies now include the time course of the neural responses starting at t = 0 and up to time t. The reference time t = 0 is set to the stimulus onset in our analyses but could be any arbitrary reference point.
The conditional response entropy and the response entropy are obtained from the distribution of the conditional probability of neural responses given the stimulus, p(y t |s), and the distribution of probability of each stimulus p(s i ): À pðy t js i Þ log 2 pðy t js i Þ the neural response at time t, is measured in spike counts and takes values from zero to a maximum rate value, R Max (for example, as dictated by the neuron's refractory period or numerically as p log p becomes infinitely small for high spike counts that have very small probability of occurring). The probability of the stimulus p(s i ) is usually taken as 1/n s , where n s is the number of stimuli, unless the study incorporates natural stimulus statistics. The probability of spike counts at time t given a stimulus, (y t |s i ), could be estimated empirically by recording hundreds or thousands of responses of the same neuron to the same stimulus. Although this approach has been shown to be possible in certain preparations [15], it severely limits the number of stimuli that can be investigated in most neurophysiological experiments. Here we propose a parametric approach where we model the distribution of neural responses to a given stimulus s i as an inhomogeneous Poisson process. The conditional probability of response (spike count) given the stimulus is then: where m s i ðtÞ is the mean response at time t for stimulus s i . This mean rate was estimated empirically using the time varying kernel density estimation (KDE) proposed by Shimazaki and Shinomoto [42]. The instantaneous information estimated in this fashion is relatively straightforward, as long as the Poisson assumption is valid and a sufficient number of trials (i.e. stimulus presentations) is obtained to estimate m s i ðtÞ (see Methods). The validity of the Poisson assumption for our dataset is assessed and discussed further below. Note that the Poisson distribution corresponds to the distribution with maximum entropy for unbounded count probabilities given a mean rate [43]. Thus, the value of entropy calculated for the empirical data, p(y t |s i ), constitutes an upper-bound when using the Poisson distribution assumption. If the empirical data follows a different distribution, the true conditional response entropy, H(Y t |S), will be lower and, thus, the actual mutual information higher. In other words, the assumption of a Poisson distribution yields a lower bound for the estimation of the mutual information.
The estimation of the cumulative information then extends this approach to joint probabilities of responses in successive time windows, (y t , y t−1 , y t−2 , . . .). Due to what has been labelled as the "curse of dimensionality", the numerical estimation of the unconditional probability becomes exponentially more expensive as the integration time increases. We evaluated multiple approaches based on different assumptions and found that Monte Carlo with importance sampling yielded the best results (see Methods and S3 Fig for comparison to alternative approaches).
If the Poisson assumption also holds for successive time-windows, the conditional probability of response at t is independent of the conditional response at previous times. pðy t ; y tÀ 1 ; y tÀ 2 ; . . . js i Þ ¼ pðy t js i Þpðy tÀ 1 js i Þpðy tÀ 2 js i Þ . . .
Given this probabilistic independence, it can be shown that the joint conditional response entropy is simply the sum of the conditional response entropies at each time point (see Methods): HðY t ; Y tÀ 1 ; Y tÀ 2 ; . . . jSÞ ¼ HðY t jSÞ þ HðY tÀ 1 jSÞ þ HðY tÀ 2 jSÞ þ � � � Thus, the estimation of the conditional response entropy is straightforward and not affected by the integration time. The problem of dimensionality arises in the estimation of the unconditional probabilities of response and the corresponding response entropy.
The probability of the time varying response is the joint probability of observing (y t , y t−1 , y t−2 , . . .). This joint probability cannot be expressed as the product of the probabilities at different times because these are not independent. More intuitively observing a particular y t−1 will affect the probability of observing y t . This is true because the time varying means of the Poisson distributions m s i ðtÞ are correlated in time; for example, if m s i ðt À 1Þ is high we might expect m s i ðtÞ to also have high values. These high values could be true for one particular stimulus s i but not for the other stimuli. Then observing a high value of y t−1 would predict a higher value than expected for y t (and an increase in probability that it was caused by s i ). The joint unconditional probability distribution is: pðy t ; y tÀ 1 ; y tÀ 2 ; . . .Þ ¼ X s i pðs i Þ½pðy t js i Þpðy tÀ 1 js i Þpðy tÀ 2 js i Þ . . .� 6 ¼ pðy t Þpðy tÀ 1 Þpðy tÀ 2 Þ . . .
Given the lack of independence, the response entropy must then be calculated from the joint probability distribution: . . . pðy t ; y tÀ 1 ; y tÀ 2 ; . . .Þlog 2 pðy t ; y tÀ 1 ; y tÀ 2 ; . . .Þ The estimation of this entropy was performed using Monte Carlo with importance sampling. In Monte Carlo, random samples of a vector (y t , y t−1 , y t−2 , . . .) are drawn from a proposal distribution q(y t , y t−1 , y t−2 , . . .) and used to estimate the expected value of log 2 p(y t , y t−1 , y t−2 , . . .) by an algebraic average weighted by the likelihood ratio of p(y t , y t−1 , y t−2 , . . .)/q(y t , y t−1 , y t−2 , . . .). The sampling stops when entropy estimations reach an equilibrium. Information estimations are also known to suffer from positive bias [44]. Here, biased-corrected estimates and errors on information values were obtained from Jackknifing the estimation of the time-varying rates and bootstrapping the Monte Carlo samples.
For these calculations, one has also to determine the size of the time window used for estimating the instantaneous information and correspondingly the steps for the cumulative information. This time window is used to estimate the spike counts and the average rate m s i ðtÞ and depends both on the dynamics of the stimulus and on the response properties of the neurons. By performing a coherence analysis on spike trains and a power spectral analysis on time varying rate in response to natural stimuli, we found that 10 ms (or 50 Hz) captured between 97% and 99% of the dynamics in our system (see Methods and S1 Fig).
Finally, we also estimated the information values for stimulus categories by performing the weighted sum of probabilities for stimuli belonging to each category. The information about categories at time t: I t ¼ HðY t Þ À HðY t jCÞ: is obtained from the conditional probability of response given the category c k , which is in turn calculated as the average conditional probability of response for the stimuli belonging to that category: Here, p(s k ) is the probability of occurrence of stimulus s k within the category c k . In controlled playback experiments (as here), p(s k ) is 1=n s k where n s k is the number of stimuli used to sample the category c k . Similarly, p(c k ), the probability of occurrence of vocalizations in category c k was taken as 1/n c where n c is the number of categories. In our system, the stimuli are individual renditions of vocal communication calls that fall into 9 call categories of the zebra finch vocal repertoire. We will contrast stimulus information to categorical information, both instantaneous and cumulative for these behaviorally relevant categories of call-types.

Time-varying information for model neurons
To validate our approach and to illustrate the behavior of time-varying information values, we calculated instantaneous and cumulative information for model neurons with simple and stereotyped response properties. Fig 1 shows the firing rates, raster plots and information values for 3 model neurons in response to 4 stimuli (S1-S4). One model neuron responds to the 4 stimuli with different mean firing rates that are constant in time (Rate Neuron). A second model neuron responds to the four stimuli with the same fixed firing rate but with different latencies (Onset Neuron). The third model neuron also responds with equal average firing rates to the four stimuli but the response occurs at different times (Temporal Neuron).
These simulations illustrate some very basic principles of neural coding. First, many different response profiles can lead to very similar rates of information: in all three simulations, the cumulative information approaches the maximum possible value (2 bits). Second, the coding capacities of neurons are a function of both the range of firing rates that can be achieved (as in the rate neuron) and the modulations in time of this neural activity. Third, the estimation of the instantaneous information gives an incomplete picture of the neural coding of a neuron as it does not incorporate the redundancy or independence of the neural representation over time. For example, on the one hand, when comparing the instantaneous information in the Rate neuron to that of the Onset neuron, one might erroneously conclude that the Rate neuron has more information while, in fact, the cumulative information shows that the Onset neuron is more informative at short time scales. On the other hand, one can also observe that the cumulative information in the Rate neuron continues to increase while the firing rate and the instantaneous information are constant; additional time points allows for a better assessment of that firing rate by time-averaging out neural noise.
These simulations also allowed us to validate our methods. The KDE for the empirical estimation of the time-varying rate based on the generated spike rasters gave very good predictions (solid lines Fig 1A-1C) of the actual model rates (dashed lines Fig 1A-1C): over all stimuli and model neurons, the average error was less than 0.02 spikes/ms. Not surprisingly then, using the actual rate versus the estimated rate yielded practically identical results in the information calculations (dashed vs solid lines, Fig 1D-1F). We also checked that the bias corrected estimates were accurate: the instantaneous information was indeed centered at zero when the response to the 4 stimuli was identical. We verified that the actual values of instantaneous and cumulative information were correct. For example, in the Temporal neuron (Fig 1C  & 1F) a peak instantaneous information of 0.5 bits is expected as 1 out 4 stimuli will be almost perfectly discriminated. Finally, we also assessed the limitations of Monte Carlo with importance sampling for the estimation of the cumulative information. For neurons, with continuously high firing rates, this estimation can become unreliable at longer integration times as illustrated by the calculation for the Rate neuron (Fig 1A & 1D). However, in those cases, the estimate of the standard error also increased drastically and allowed us to define end points for the calculation of the cumulative information.

Avian auditory neurons have approximate Poisson statistics
Our estimation of the time-varying information values is based on a Poisson parametrization for the distribution of spike counts in each time window of analysis and on the independence of the conditional probability of spike counts, p(y t |s i ), between successive time windows. Here, we assessed the limits of these assumptions.
First, we examined for each neuron, for each stimulus and for each time bin of 10 ms, whether the empirical distribution of spike counts across stimulus presentations was different than the one expected from a Poisson distribution whose mean was given by the kernel density estimation (resampling test; see Methods). For each neuron, we obtained the proportion of time bins where the Poisson assumption was rejected. Fig 2A shows  The model rates for the four stimuli are shown in colored dotted lines. These rates are used to generate spike patterns using an inhomogeneous Poisson process. Ten realizations for S1-S4 are shown as spike rasters. The rate recovered from those rasters using a time-varying kernel density estimate (KDE) are shown as colored solid lines. The shaded area around each line indicate the error on the KDE rate estimated by the standard error on the Jackknife estimates. The bottom row (D,E,F) shows the instantaneous and cumulative information obtained using either the true model rate or the recovered rate from KDE. The error bars in the instantaneous information calculations were obtained by jackknifing the KDE estimates. For the cumulative information calculations, the Monte Carlo simulations were repeated for each jackknife estimate. The error bars therefore include both the errors in the estimation of the true rate from spike rasters (data limitation) and the errors in the estimation of the cumulative information due to limited sampling (computational limitation). The time-varying response rates were chosen to illustrate neural representation for stimuli using different neural codes labelled here as Rate (A, D), Onset (B, E), and Temporal (C, F). https://doi.org/10.1371/journal.pcbi.1006698.g001 Time-varying information for auditory objects the Poisson statistics assumption for spike is warranted in more than half of our neurons. For the other neurons, the deviation is relatively small. As stated above, the Poisson assumption used for non-Poisson data yields a lower bound on information. Given the small number of presentations of each stimulus (n = 10), one can argue that the test of conformity to a Poisson distribution is relatively conservative (i.e. we might fail to reject the Poisson statistic because of small sample size). Thus, we also assessed whether the first and second moments (mean and variance) obeyed Poisson statistics. Fig 2C shows the Fano Factor estimated at each Time-varying information for auditory objects consecutive time bins after stimulus onset, both for individual stimuli (insets) and average across stimuli (solid line). The mean Fano factor for the population was 1.04 with SEM of 0.04 ( Fig 2D). This second analysis further quantifies the degree with which our neural data follow Poisson statistics and also shows that for a majority of neurons the assumption is warranted.
The second assumption used in the calculation of cumulative information is that the conditional probability of spike counts, p(y t |s i ), is independent across successive time bins. To assess the validity of this assumption, we estimated the Pearson correlation coefficient of spike counts obtained in successive time windows in responses to the same stimulus, also known as the noise correlation. Fig 2B shows the distribution of noise correlations for the 404 neurons examined. Significant noise correlations were observed (red bars versus non-significant values shown in blue), but they were small, centered around zero and their distribution showed a slight positive skew. These positive values correspond to cases were a slight increase of the spike count above average at a given stimulus presentation and time bin is somewhat giving a higher chance of greater than average spike count in the successive time bin of the same stimulus presentation. Ignoring noise correlations can lead to either under or over estimates of mutual information depending on the functions that describe the mean time-varying response to each stimulus [45,46]. In the majority of cases, not taking into account noise correlations leads to an under-estimate of mutual information. For instance, if responses in successive time bins decrease for one stimulus and increase for another stimulus, positive or negative noise correlations will result in lower conditional response entropies, H(Y t , Y t−1 , Y t−2 , . . .|S), and thus higher mutual information. In the general case of stimulus time-varying rates that have different and random shapes depending on the stimulus, one can also show that noise correlations increase mutual information. In some cases, however, noise correlations can increase the conditional response entropies and not taking them into account lead to an over-estimate of the mutual information. Overestimation of mutual information commonly happens when the responses for stimuli change in the same manner in successive time bins (increase or decrease in mean rates for all sounds), then positive noise correlations can lead to an increase in conditional response entropies and thus a decrease in mutual information. In our data, the time varying rates of the different stimuli are positively correlated at the onset response, but mostly uncorrelated at other times (see . We might therefore slightly overestimate mutual information at stimulus onset, but for most of our analysis we might be slightly underestimating cumulative information. Given that the noise correlations are small, this estimation bias is small. While we can neglect noise correlations in our calculations and assume independence of conditional probability of spike counts, p(y t |s i ), across successive time bins, we are not assuming independence of the unconditional probability, p(y t ), across successive time bins. Indeed, the distribution of stimulus correlation (correlation of mean rate between pairs of consecutive time bins in the trial average response of each stimulus) across the neural population show high positive values centered around 0.4 ( Fig 1B). These values indicate that the change in the spike counts in successive time bins is highly driven by the stimulus identity. As such, the unconditional probability distributions in successive time windows are not independent. We do take into account these large stimulus correlations in our estimation of the cumulative mutual information.

Time-varying Information for 3 example neurons
In Figs 3-5, we show the time-varying firing rates, the time-varying instantaneous information and the cumulative information for 3 zebra finch auditory neurons with distinct response properties. The neuron in Fig 3 responded robustly to all communication calls with high and reliable firing rates. It also responded in a time locked fashion sometimes at multiple time points for single calls. Although this neuron was not selective for a particular stimulus or calltype category, by combining rate and temporal codes it reached very high levels of instantaneous information. Moreover, this instantaneous information showed little redundancy yielding very high cumulative information. One can also observe, that for this particular neuron, the average time-varying firing rate is not correlated with instantaneous information. This neuron shows a strong onset response in its firing rate while the instantaneous information is two examples of calls from each call-type; the color scheme is as in B; 0 ms corresponds to stimulus onset. B. Averaged time-varying firing rate obtained for each call-type category. These average responses are obtained by averaging the KDE estimates of firing rates for each stimulus belonging to the same call-type (~10 example stimuli per call-type). Shaded error bars show ± one SEM. C. The overall (averaged over all stimuli) time-varying firing rate for the same example neuron is shown by the solid black line (right y-axis) with ± SEM as shaded error bars. The green line and yellow line (left yaxis) correspond to the instantaneous information for stimuli and call-type categories calculated in successive 10 ms windows. Shaded error bars were obtained by a jackknife procedure D. Cumulative information for stimuli and call-type categories for the same example neuron. The shaded error bars are ± SE obtained by jackknifing and resampling (see Fig 1 and Methods). Each cumulative information curve is also fitted using an exponential function (dashed lines) characterized by three parameters: a latency (dt, in ms), an exponential rate (tau, in ms) and the infinite time limit value expressed as a fraction of the maximum information achievable (k). The MSE is the mean square error of the fit in bits. This example neuron had a very high stimulus evoked mean firing rate with rapid and reliable time-varying dynamics leading to high information rates. Spontaneous firing was very low (~0 spikes/s). Spike rasters for this neuron are shown on S4 The cumulative information curves were fitted with an exponential function that allows us to quantify the time constant of information accumulation (τ), the saturation level (k) relative to the maximum information that could be achieved (I Max ) and the latency (dt): Most of the calls in the zebra finch repertoire have durations that are shorter than 300 ms [39]. In order to investigate the cumulative information accumulated at this behaviorally  relevant fixed point in time, we estimated the relative cumulative information as the value of the model at 300 ms relative to I Max .; The results of the fit are shown in dashed lines in the cumulative information plot for each neuron. One can observe that the neuron in Fig 3 has exceptionally high values of saturation: with sufficiently long integrations time, single spike trials could be used to perfectly assess what stimulus (out of those used in the experiment) was heard and, thus, which call-type category it belonged to. The selective neurons in Figs 4 and 5 have much lower saturation levels as expected since they mostly respond to stimuli from a single call-type category. The neuron in    Fig 6A shows the time course of the firing rate averaged across all stimuli and its relationship with the instantaneous information. On average, avian cortical auditory neurons show an onset response followed by a sustained response as observed at many stages of auditory processing and in many sensory systems. The instantaneous information, however, remains almost constant during the entire time. This is true both for the information about individual stimuli or the information about categories. Thus, the onset response is only less informative than the sustained response in bits/spike but not in bits/s; when processing natural vocalizations, the onset and sustained response are equally informative. Moreover, the information in the sustained response continues to provide new information as reflected by the continuous and relatively fast increase in cumulative information shown in Fig 6B. The non-redundancy of the information in the sustained period is also well captured by examining the derivative or slope of the cumulative information shown in Fig 6C. That plot shows how much new information is acquired at each time bin. The new information for categories is relatively constant (red line Fig 6C) and similar to the instantaneous information about category (yellow line Fig 6A), while the spike rate is drastically decreasing between 50 and 150ms. Thus, as time progresses, individual spikes carry more new information for categories: there is very little redundancy in the code for call-type categories up to 200 ms and during that time the coding efficiency in bits/ spike increases for the new information that is acquired about categories. In contrast, the new information for individual stimuli decreases and is, on average, remarkably correlated with spike rate (Fig 6C). Thus, while, in terms of instantaneous information, the coding efficiency (in bits/spike) is greater in the sustained versus onset period, in terms of non-redundant information, the coding efficiency is remarkably constant throughout the response. This result suggests that an integrating average spike rate measure (i.e. the time running average spike count over stimuli) could serve as a relatively good proxy of cumulative information on individual stimuli. However, note that this cumulative information needs to take into account time varying spike rates (or stimulus locked spike patterns) as in our calculation.

Rate and time-varying information: Population analysis
Indeed, taking into account time varying spike rates for each stimulus as opposed to the mean firing rate across the entire time-window for each stimulus is crucial in the accurate calculation of time-varying cumulative information. As seen in Fig 6B, the increases in the cumulative information for the time-varying rate code (solid lines) is much larger than the increases obtained with the fixed rate code (dashed lines; the information still increases because of noise averaging as explained above). The time-varying rates observed in neural responses (as illustrated in Fig 3A) provide additional information. How much more? For the coding of individual stimuli (comparing the dashed blue line to the solid blue line in Fig 6B), a fixed rate code (or assumption) captures only 24% of the information at 100 ms and 21% at 300 ms. The effect for categorical information is smaller because some of the coding dynamics in time-varying responses to stimuli belonging to the same category effectively become neural noise: for categorical information, a fixed rate code captures 50% of the information at 100 ms and 41% at 300 ms.
The distributions of time constants, τ for the cumulative information for stimuli and calltype categories are shown on Fig 7A. The distributions of relative cumulative information at 300ms (k 300 ) are shown on Fig 7B. The range of time constants observed across the population Time-varying information for auditory objects of neurons was large (100 ms-600 ms) with average time constants of 459 ms for cumulative stimulus information and 372 ms for cumulative categorical information. This difference in means of time constants is statistically significant (Paired t-test t(214) = 3.49, p = 0.00058); the ongoing time-varying rate changes continue to provide more information for decoding stimuli and less so for decoding categories of stimuli. There is also a wide distribution of relative cumulative information values (k 300 ranging from close to zero to 0.6). On average across neurons, the k 300 was 0.2 (or 20%) for stimuli and 0.15 (or 15%) for categories. These differences in relative information values are highly significant suggesting that, single neurons, capture more variability in stimuli than in categories. Note however that measures of relative information depend on the number of stimuli or categories. Therefore, a direct statistical comparison is not warranted. The comparison between stimulus representation and category representation requires estimations of expected values of categorical information given stimulus information which is performed below. Relative cumulative information values can, however, be used for stimulus and categories independently to assess other coding properties: although average time-varying rates and instantaneous time varying information are not well correlated within single neurons (as shown in Figs 3 and 6), the relative cumulative information is correlated with average firing rates. For cumulative stimulus information, one finds an increase in k 300 of 1% per spike/s (Adj

Analysis of the categorical information
We are interested in identifying neurons that could play an important role in categorizing vocalizations. We had previously identified example neurons that were highly selective for particular call-types and showed a high degree of invariance such as those shown in Figs 4 and 5 Time-varying information for auditory objects [40]. Here, we attempted to quantify the neural invariance for call renditions within call-type categories along time. For this purpose, we computed a Categorical Information Index (CII). The CII compares the actual categorical cumulative information to three potential values (see cartoon probability distributions of neural responses in Fig 8A): a floor or minimum value (set at 0), an expected value for shared information between stimuli and categories (set at 1) and a ceiling or maximum value (set at 2). The floor is the categorical information that one would . CII is defined by comparing the cumulative categorical information to a floor (CII = 0), an expected value predicted from the stimulus information (CII = 1) and a ceiling value (CII = 2) as described in the text and methods. Each plot represents the distributions of hypothetical neural responses for different stimuli (S1 to S4) that yield different values of CII. When stimuli are grouped randomly into categories (S 1 and S 2 belong to C 1 , while S 3 and S 4 belong to C 2 ), CII is close to 0 (left panel). When stimulus categories preserve the order of the stimulus conditioned response distributions and these distributions are equally separated, effectively coding for both stimuli and categories, CII is equal to 1 (middle panel). When stimuli belonging to the same category have identical responses, the stimulus information is equal to the category information, the response being effectively invariant within categories and CII is equal to 2 (right panel). Note that the actual response probability distributions are multidimensional (one dimension for each time lag) with Poisson marginals. B. Example of the cumulative information for stimuli and calltype categories and CII for 3 different representative neurons. The blue and red lines are the cumulative information for stimuli and categories respectively. Estimates of the floor (dashed-green), expected (dashed-orange) and ceiling (dashed-purple, here overlapping with blue) values for the categorical cumulative information as described in A are given as well. The CII is shown as a solid line in the second row for each example neuron. Cell 1 has long periods of time with CII < 1, Cell 2 has long periods of time with CII~1 and Cell 3 has long periods of time with CII > 1 as shown by the arrows. https://doi.org/10.1371/journal.pcbi.1006698.g008 Time-varying information for auditory objects obtain if stimuli are randomly grouped. The shared-information value is the information that one would obtain if the information about stimuli is equally shared across all stimuli and the neural responses for stimuli are perfectly sorted for each natural call-type category; for example, at a given point in time the 10 renditions of DC would elicit the 10 highest rates, the 10 renditions of LT Call the next 10 higher firing rates and so forth. This shared-information value does therefore assume that, for a given neuronal signal to noise ratio, neural responses segregate categories maximally while also preserving the maximum discrimination between stimuli within categories. The ceiling value is the categorical information value that one would obtain if a maximum amount of information about stimuli was used for the discrimination of categories and a minimum for discriminating stimuli within categories; it assumes maximum invariance to variations within a category. The ceiling value is equal to the stimulus information until it reaches log 2 (n c ), where n c is the number of categories, corresponding here to the 9 call-types. Fig 8B shows Fig 9A). The average CII over neurons (bold black line) is very close to the shared value of 1 as one might expect if acoustical differences across stimuli drive neural responses in a linear fashion along some acoustical feature and call categories segregate perfectly along that same acoustical feature. However, the average CII is also slightly (and significantly) above 1 between 120 and 260 ms showing some small degree of average invariance for call renditions within a call-type category during those times. In addition it is clear that there is a wide distribution of CII around the shared value of 1. And this distribution is positively skewed towards values higher than 1 as exemplified by the density plots calculated at 150ms. This distribution includes many neurons that exhibit a high degree of invariance for call renditions within call-type categories as also shown by the average CII for the top quartile (red solid line). Focusing on these 25% of neurons with the highest CII, we found a maximum value of CII at 175 ms, indicating that, relative to the maximum invariance achievable given their individual stimulus information values, these neurons are maximally invariant around that time. In terms of absolute value of additional categorial cumulative information relative to the expected shared information value, these same neurons reach a maximum of 0.16 bits at 320 ms (S7 Fig).
Do these high CII neurons exhibit other characteristic response properties? In the scatter plots of Fig 9B and 9C, we examined the relationship between CII (color coded) and, respectively, time constants and relative level of the cumulative information. It can be seen from Fig  9B, that neurons with high CII have relatively long stimulus time constants in comparison to their corresponding time constant observed for categorical information. This relationship is also significant for the entire population of neurons (Linear Regression explaining CII from τ cat /τ stim : Adj R 2 = 0.11, F(1,213) = 28.5, p = 2.37 10 −7 ). As shown in Fig 9C, neurons with high CII also have higher relative levels of categorical information in comparison to their relative values for stimulus information although this result is expected given our definition of CII (Linear Regression explaining CII from k 300,cat -k 300,stim : Adj R 2 = 0.76, F(1,213) = 675, p = 5.5 10 −68 ). Finally, one can also notice that neurons with high CII have low values of relative stimulus cumulative information (Linear Regression explaining CII from k stim (300): Adj R 2 = 0.28, F(1,213) = 84, p = 3.96 10 −17 ). This effect is caused by the correlation between invariance and selectivity as we have shown previously [40]: neurons that show the highest degrees of invariance also tend to respond to a small number of call-type categories and thus a small number of stimuli. As such, high invariance and high selectivity goes hand in hand with lower values of information.

Anatomical organization
We examined whether neurons were organized in the avian auditory cortex based on their CII, cumulative information time constants (τ) and the relative cumulative information at 300 ms relative to the maximum (k 300 ). Most of our recording sites were identified histologically and could be assigned to avian cortical areas that had been segregated into the thalamic recipient area, L2; intermediate primary auditory regions, L1, L3, CLM and L; and secondary areas, CMM and NCM [47][48][49][50] (n = 303/337 neurons). We also obtained spatial x,y,z coordinates of the recording sites relative to the midline, the position along the rostral-caudal axis where the lamina pallio-subpallialis (LPS) is the most dorsal, and the top of the brain (Fig 10). shown on a parasagittal view (the feedfoward pathways are in solid arrows, the feedback connections are in dotted arrows and motor areas are delineated by grey dotted lines). Our recording sites were from the primary and secondary avian auditory cortical areas (the auditory pallium) and are shown on the right panel. The recording sites from different birds were aligned in the rostral-caudal dimension using the oblique rostro-caudal long axis of L2 (dotted line). Recording site positions also varied along the medial-lateral dimension (in and out of the page) and are all superimposed here. Abbreviations: CN, cochlear nucleus; SO, superior olive; LL, lateral lemniscus; MLd, dorsolateral nucleus of the mesencephalon; Ov, nucleus ovoidalis; RA, robust nucleus of the arcopallium; HVC, used as a proper name; NCM, caudomedial nidopallium; CM, caudal mesopallium. B. Functional anatomical maps of the Categorical Information Index, the time constant τ and the relative value at 300ms k 300 of the cumulative information for stimuli for all single units with anatomical data (303/337 neurons). Each site is colored according to the value(s) of the variable being displayed. Transparency is used to reveal overlapping sites in this view as well as, potentially, combination of single units that were recorded from the same electrode and site. https://doi.org/10.1371/journal.pcbi.1006698.g010

Time-varying information for auditory objects
In all regions, we found a range of CII and thus only weak anatomical effects across areas. Using the L2 dorsal-ventral oblique axis as reference point, CII was slightly higher as one moved rostrally or caudally to higher regions of auditory processing (Linear regression: Adj R 2 = 0.036, F(2,195) = 8.39, p = 0.0042). However, an ANOVA also suggested that both regions NCM and L2 had slightly higher mean CII (Adj R 2 = 0.027, F(4,185) = 2.77, p = 0.042). We also observed an increase in the time constant (τ) for the stimulus cumulative information that parallel the increase in CII as one moved away from the L2 axis (Linear regression: Ajd R 2 = 0.04, F(2,195) = 9.32, p = 0.00258). The saturation constant at 300 ms (κ 300 ) for stimulus information was also higher in the central auditory region and smaller as one moved away from the L2 axis (Linear Regression: Ajd R 2 = 0.035, F(2,195) = 8.18, p = 0.0047). Thus, there is more invariance in higher auditory areas and longer time constants but less absolute information about the stimulus. Note again that these significant anatomical trends were characterized by very small effect sizes. We did not find, however, any effects of anatomy on the time constant, (τ), or saturation constants (κ 300 ) for the cumulative categorical information; in this data set, the subset of neurons that have high invariance have similar response properties in both lower and higher regions of the avian auditory pallium.

Discussion
We modeled responses that are observed in the auditory system as inhomogeneous Poisson processes in order to estimate the time-varying instantaneous and cumulative information for vocalizations used in communications. We showed that using Kernel Density Estimation for the time-varying firing rate and Monte Carlo with importance sampling for estimating probabilities, we were able to obtain accurate and bias-free estimates of these time-varying information values. This parametric approach is powerful because a relatively small number of trials can be used to estimate the time-varying response and thus information values can be estimated in response to a relatively large set of stimuli (here over 100 distinct vocalizations) given typical recording times. The auditory avian cortical neurons recorded here and stimulated with natural communication calls had approximate Poisson statistics with small violations that mostly resulted in our estimation of information being a tight lower bound. More generally, the same procedures could also be used with spike trains statistics that can be parametrized with probability functions that depend only on a time varying rate such as inhomogeneous gamma or inhomogeneous inverse Gaussian [12,51] and could also be extended to ensemble of neurons. Although Poisson statistics are often observed in neural data and will correctly fit any data set obtained from pooling responses across a large number of trials [52], we realize that they make the strong assumption that the firing rate for a particular trial and for a particular neuron depends only on time. Refractory period in spiking neurons and other correlations in single neurons or in an ensemble of neurons that are not phased locked to the stimulus (i.e. noise correlations) are common violations of this assumption. In our data, we found small positive noise correlations reflecting the fact that neural noise effects extend beyond our analysis windows of 10 ms. Taking noise correlations into account can increase stimulus decoding accuracies [23]. In cases where noise correlations are informative, the method proposed here would only yield lower bound estimates of information theoretic values. Alternatively, one should try the use of spike metrics measures that can capture noise correlations [53,54] in combination with stimulus decoding approaches to then obtain measures of information from the confusion matrix of predicted versus actual stimuli [13,40,55,56]. If spike metrics can be estimated accurately (with limited data), repeating the stimulus decoding procedure for progressively longer time windows would yield cumulative information values such as those calculated here but which could consider noise correlations.
Beyond our methodological contribution, the principal goal of this analysis was to characterize the neural code in higher auditory areas for communication calls. We found that, on average, auditory cortical neurons responded to these natural stimuli with time-varying firing rates that exhibited an onset and a sustained component of the response. Although in some situations the auditory cortex appears to respond only transiently [26,57], our data supports findings from mammalian species that have showed strong sustained responses when neurons are driven by preferred stimuli [21]. Similarly, in the human superior temporal gyrus, the sustained responses to speech have been shown to be more informative for decoding speech phonemes [58]. Natural sounds in general have also been shown to be particularly efficient at driving auditory areas [12,[59][60][61][62][63][64] and thus the presence of informative sustained responses is not surprising. Indeed, in some of our most selective neurons, such as the neuron in Fig 5, the onset response is missing and only the sustained response is observed. More generally, and on average, we found that both the transient and sustained response had information about the stimulus identity and that the information in these two response phases was not redundant: the stimulus space of natural vocalizations is very large and although the initial response provides some clue as to the nature of the vocalization, new and additional information is observed in the sustained response. On the one hand, one might argue that, in natural vocalizations, the sound itself is changing with time. These stimulus changes occur both within calls that are made of composite notes or in call bouts and song motifs made of multiple syllables. Sustained responses in such cases could be interpreted as multiple onset responses (but with decreasing amplitude). On the other hand, from a behavioral perspective, these communication calls correspond to a single auditory object: a message from a particular individual about a particular state. From either perspective, these observations and analyses illustrate the importance of using behaviorally relevant stimuli when analyzing the nature of the neural code.
The information about stimulus identity was shown to be approximately equal in bits/s in the onset and sustained response. Given that the average onset response (in spikes/s) is greater than the sustained response, one might conclude that the onset coding (in bit/spikes) is not as efficient. Although, this is true for instantaneous information, we also found that the coding efficiency for novel information was remarkably constant for stimulus information while increasing in the first 200 ms for categorical information.
The information efficiency in the onset versus sustained response might also be different when an ensemble of neurons is considered; the relative timing of the first spike has been shown both in audition [26] and in other sensory modalities [25,27] to be highly informative. Moreover, in addition to stimulus identity other stimulus features are also encoded in neural responses; the timing of the stimulus is clearly marked by the onset response or transient response [22,58,65,66] and other stimulus attributes such as the location, fundamental and loudness/distance of the sound source are also processed in auditory cortex [67][68][69]. It is therefore very likely that the onset response contains information about stimulus features other than stimulus identity or category and, potentially to a greater extent than in the sustained response (e.g. relative timing of onset). We note, however, that the opposite was found for the neural discrimination of two vowel sounds in the ferret auditory cortex where the onset response or early response was the most informative for decoding vowel identity relative to other attributes [69].
Our IT analysis also revealed the importance of stimulus locked spike patterns even in relatively short neural responses: in just 100 ms, the mean rate captured only a quarter of the information present when time-varying firing rates are considered. Thus, our analyses provide additional evidence that spiking patterns carry a significant amount of stimulus information and should therefore not be ignored in the analysis of neural responses [70,71]. The spiking precision analyzed here was relatively coarse (10 ms windows) and matched to the time scales of the relevant dynamics in the stimulus: although zebra finch calls are much longer than 10 ms, they are complex sounds with fast spectro-temporal features [39]. Therefore, although the neural code observed here uses fast varying spike rates, it cannot be labelled as a temporal code. In a rigorous definition of a temporal code, temporal information in spike patterns must code stimulus attributes other than the stimulus dynamics [72].
The cumulative information for stimulus identity or for categories increased for sustained periods of time before saturating. These saturating curves were well fitted with exponentials and yielded relatively long-time constants of approximately 460 ms for stimuli identity and 370 ms for call categories. These information time constants are long in comparison to the integration times that are usually found for auditory cortical neurons; the spectro-temporal receptive fields of avian auditory neurons rarely extend beyond 50 ms [73][74][75][76] although adaptive responses on longer time scales have also been described [77][78][79]. Information time constants depend on the integration and adaptation time constants of neurons but also on the stimulus dynamics: although stimulus identity and stimulus category are fixed in time, the sound itself has time varying features that can provide additional information as time goes on. It is the triple combination of the dynamics of the natural stimulus statistics, the neuronal integration time and the neuronal signal to noise levels that are going to affect the information time constant. Natural sounds and in particular communication calls are informative objects not only because of their spectral structure but also their rich dynamical structure. The importance of time in the neural code used in the auditory system has been emphasized multiple times [80][81][82][83][84] and our cumulative information analysis further stresses the importance of using natural sounds or synthetic sounds that carefully match natural dynamics when probing auditory neural encoding. Ultimately, it is an information time constant that includes stimulus dynamics that should be compared to behavioral responses [18]. A behavioral assay of reaction time for all the calls in the zebra finch repertoire has not been performed but the values of a few hundred of ms correspond to the shortest time intervals between call and call back in anti-phonal calling in paired zebra finches [85]. Although, faster times could be obtained in reaction to any sounds (e.g. startle or orientation), the processing of stimulus identity to extract the information on who is calling and what is being said might require these longer processing times.
Finally, we quantified the fraction of the information about stimulus identity that could be used for extracting the call-type category. We used that analysis to characterize neurons not only in terms of their absolute coding capacities for call-type categories but also in terms of their invariance in their response to different stimuli belonging to the same call-type category. We found that, on average, information for categories is close to what one would expect if neural representation for stimulus identity is also segregated along call-type categories. Nonetheless, the invariance for categories was significantly above that expected value between 120 and 260 ms after stimulus onset. In addition, the distribution of our categorial index also had a positive skew with a long tail: many neurons had a high degree of invariance and could therefore be classified as categorical. We have shown in previous work that this highly invariant neurons could not be simply the result of sampling; when invariance metrics are calculated using stimuli that are grouped randomly but that preserve the acoustical distance found in the natural categories, one does not find the similar highly invariant neurons [40]. To better understand the response properties that give rise to this category specific invariance, one could compare the empirical distribution of Categorical Information Indices (CII) to the distribution of CII obtained either with a bank of linear modulation filters or with models that include particular non-linearities as was done in [86]. The properties of the model neurons that yield the closest CII distribution to the one empirically observed could then be compared to the actual non-linear receptive fields of the neurons that could be estimated by a multiple filter decomposition [87].
Are the categorical neurons in our study anatomically organized? On the one hand, the categorical and invariant neurons described here had even longer time constants for cumulative stimulus information than the non-invariant neurons in our dataset suggesting that they could be higher in the auditory processing stream. On the other hand, we did not find a separate population of invariant neurons across all of our recordings of neurons informative for categories: the distribution of our categorical index was unimodal. Moreover, we found only weak anatomical correlations with very small increases in the Categorical Information Index along an anatomical axis corresponding to lower vs higher auditory areas. Higher avian auditory cortical areas have been associated with more complex spectro-temporal receptive fields [73,76,88], increase robustness to noise [89][90][91] and more specificity for processing natural sounds [12,92]. It is possible that we failed to find a large anatomical effect because our sampling of higher auditory areas was relatively sparse; in particular we have a small number of recording sites in CLM and in the more ventral and medial regions of CMM. It is also possible that, although high-invariance neurons are found throughout the avian auditory pallium, this invariance is achieved differently and potentially for different purposes in different regions. A better description of the non-linear receptive fields of these neurons would also allow us to investigate this possibility.
However, it is also striking to see that both in our study and in previous research that investigated invariant properties of avian auditory neurons to classes of natural sounds [92,93] there is a high degree of heterogeneity in the neural responses within each area. For example, although small differences across auditory areas were reported (and of similar effect size than the spatial effect described here), Meliza and Margoliash [92] also found neurons that were tolerant to different renditions of song types throughout the auditory pallium of the starling. Here, we found that neurons with high Categorical Information Indices could be found anatomically next to neurons with low Categorical Information Indices. Some of this functional heterogeneity could be associated with different cell types [75,92] and a better understanding of the micro-circuitry in the avian auditory cortical areas is clearly needed [30]. It is interesting to note, however, that this mixing of low level and high-level response properties is not unique to the avian auditory system as similar heterogeneity has been found in the mouse [94] and ferret auditory cortex [95,96]. Contrary to the visual system, the auditory system might preserve a higher mixture of low-level and high-level sensory responses properties at multiple stages of processing including the higher auditory areas involved in auditory object recognition. If this is true and universally found in vertebrates, it might be a necessary property of the computations needed for auditory object recognition, potentially related to the fact that complexity in auditory signals is in time-varying spectral patterns that quickly disappear; the fleeting nature of sounds could prevent higher processing stages from subsequently accessing lower-level representations for additional information.

Ethics statement
All animal procedures were approved by the Animal Care and Use Committee of the University of California Berkeley (AUP-2016-09-9157) and were in accordance with the NIH guidelines regarding the care and use of animals for experimental procedures. Birds subjects were euthanized after neural recordings by overdose of Isoflurane.

Animals and stimuli
Four male and two female adult zebra finches (Taeniopygia guttata) were used for the electrophysiological experiments. The birds were bred and raised in family cages until they reached adulthood, and then maintained in uni-sex groups. Although birds could only freely interact with their cage-mates, all cages were in the same room allowing for visual and acoustical interactions between all birds in the colony. All birds were given seeds, water, grid and nest material ad libitum and were supplemented with eggs, lettuce and bath once a week.
Vocalizations used as stimuli during neurophysiological experiments were recorded from 15 adult birds and 8 chicks (20-30 days old). The vocalization bank obtained contains 486 vocalizations that included for each bird most of the calls in the Zebra finch repertoire: 7 calltypes in adults and 2 in chicks. The adult calls included the following affiliative calls: Song (So), Distance Call (DC), Tet call (Te) and Nest call (Ne); and the following non-affiliative calls: Wsst or aggressive call (Ws), the Distress call (Di) and one of the two alarm calls, the Thuk (Th). The juvenile calls included the Begging call (Be) and the Long Tonal call (LT). Additional information about these stimuli and their behavioral meanings can be found in [39,40,97].
For the neurophysiological experiments, a new subset of the vocalization bank was used at each electrophysiological recording site. This subset was made from a representative number of vocalizations from the repertoire of individuals: three adult females, three adult males, two female chicks and two male chicks. From each individual caller, we randomly chose 3 call bouts from each category or fewer if fewer than 3 call bouts were obtained for that particular call-type and individual. The maximum number of stimuli that could be selected in that procedure was therefore 3x7x3 (adult males x repertoire x renditions) + 3x6x3 (adult females x repertoire x renditions) + 4x2x3 (juveniles x repertoire x renditions) = 141. The average number of stimuli played for each single unit was 114 (sd = 22, min = 34, max = 123). Ten trials were acquired for each stimulus with a few exceptions (min = 9, max = 11). Sounds were broadcasted in a random order using an RX8 processor (TDT System III, sample frequency 24414.0625 Hz) connected to a speaker (PCxt352, Blaupunkt, IL, USA) facing the bird at approximately 40cm. The sound level was calibrated on song stimuli to obtain playbacks at 75dB SPL measured at the bird's location using a sound meter (Digital Sound Level Meter, RadioShack).

Neurophysiological and histological procedures
Extra-cellular electrophysiological recordings were performed in 6 urethane anesthetized adult zebra finches. The birds were placed in a sound-attenuated chamber (Acoustic Systems, MSR West, Louisville, CO, USA) and sound presentation and neural recording were performed using custom code written in TDT software language and TDT hardware (TDT System III). Sounds were broadcasted in a random order as described above. Neural responses were recorded using the signal of two (5 subjects) or one (1 subject) electrode arrays, bandpass filtered between 300Hz and 5kHz and collected by an RZ5-2 processor (TDT System III, sample frequency 24414.0625 Hz). The electrode arrays consisted of two rows of 8 tungsten electrodes with row separation of 500 μm and inter-electrode separation within row of 250 μm (Tucker Davis Technologies, FL). Electrode impedances were approximately 2 MOhms. When two electrode arrays were used, they were placed each in one hemisphere. Spike arrival times and spike shapes of multiple units were obtained by voltage threshold. The level of the threshold was set automatically by the TDT software using the variance of the voltage trace in absence of any stimuli. Electrodes were progressively lowered and neural responses were collected as soon as auditory responses to song, white noise, Distance call or limited modulation noise could be identified on half of the electrodes in each hemisphere (the stimuli used to identify auditory neurons were different from the stimuli used in the analysis). Several recording sites were randomly selected by progressively deepening the penetration of the electrodes and ensuring at least 100 μm between two sites. This distance between recordings insured that we did not record from the same neurons at separate sites.
On average 4.2±2 sites (mean ± sd) were recorded per bird and per hemisphere at a depth ranging from 400 μm to 2550 μm.
After the last recording site, the subject was euthanized by overdose of isoflurane and transcardially perfused. Coronal slices of 20μm obtained with a cryostat were then alternatively stained with Nissl staining or simply mounted in Fluoroshield medium (F-6057, Fluoroshield with DAPI, Sigma-Aldrich). While Fluoroshield slices were used to localize electrode tracks, Nissl stained slices were used to identify the position of the 6 auditory areas investigated here: the three regions of Field L (L1, L2 and L3), 2 regions of Mesopallium Caudale (CM): Mesopallium Caudomediale (CMM) and Mesopallium Caudolaterale (CLM); and Nidopallium Caudomediale (NCM). By aligning pictures, we were able to anatomically localize most of the recording sites and calculate the approximate coordinates of these sites. Since we could not localize the Y-sinus on slices, we used the position of the Lamina Pallio-Subpallialis (LPS) peak as the reference point for the rostro-caudal axis in all subjects. The surface of the brain and the midline were the reference for respectively the dorsal-ventral axis and the medial-lateral axis. The approximate coordinates of units were used to build 3-D reconstructions of all single unit positions in an hypothetic brain. We were not able to recover all the electrode traces in one of the six birds and excluded that data from the anatomical analysis. Table 1 shows the number of units from each bird for which we had an anatomical location and that had significant stimulus cumulative information (303/337 neurons; see below).
Single unit isolation was performed off-line using custom software that used a combination of supervised and unsupervised clustering algorithms. These clustering algorithms used the spike-snippets shape as described by a PCA. Sorted units where declared to be single units based on spike shape reliability across snippets. The spike shape reliability measure was a signal to noise ratio (SNR) where the signal was the difference between the maximum and the minimum of the average snippet and the noise was the standard deviation of this measure across all snippets. Single units in our data set have an SNR > 5. Additional details on these experimental procedures can be found in [40].
The data of neural responses from 404 out of 914 isolated single units were used in this study. The 404 neurons were selected based on a prior analysis that showed that this subset of units were not only auditory but also contained information about call-types, in the sense that call-types could be decoded above chance level from neural responses (see [40]). Here, we analyzed the neural response in the first 600ms after stimulus onset.

Data analysis: Time-varying spike rate estimation
Calculations of cumulative information requires the estimation of very large distributions which sizes grow exponentially with the number of time points investigated. It is therefore imperative to find the optimal width of the time bin for the analysis: as small as possible to capture information in spike patterns but as large as possible to reduce the dimensionality for the Table 1 (n = 303). The UNC column counts the units that we left as unclassified because they were found at the boundary between regions. Time-varying information for auditory objects estimation of mutual information. It is also well known that the relevant time scale can be different in different epochs of the responses (e.g. more precise spike timing at the onset phase of the response and less precise during the sustained phase) [98]. Here, we addressed this issue by first estimating the time-varying rate of our neurons using methods of density estimation based on adaptive kernels. In this methods, narrower kernels are used when spike precision is high and wider kernels at times when the spike precision is low. We then estimated the largest fixed time window that could capture most of the fastest changes in those average time-varying spike rate. Alternative methods based directly on measures of information have also been proposed [98]. For each stimulus, the 9 to 11 raw spike patterns of 600ms, sampled at 10kHz, were combined to obtain the corresponding time varying spike rate (sample frequency set at 1kHz) by applying a locally adaptive kernel bandwidth optimization method [42]. In cases where the neuron did not respond to any of the presentations of the stimulus or responded only once over all presentations, the rate was estimated as being constant for the 600ms duration of the neural response. For those two unresponsive cases, the rate was set to be 1/(2 � Ntrials � NTimes) in the absence of any spike or 1/(Ntrials � NTimes) in case of one spike, with Ntrials the number of stimulus presentations (9)(10)(11) and NTimes the number of time points at which the rate was estimated (here 600, for a 600ms neural response section with a sampling rate set at 1kHz).

Bird
To investigate cumulative information values up to 600ms after stimulus onset, time-varying rates were sampled at 10ms (Nyquist limit frequency of 50Hz). To estimate, the amount of information potentially lost by this low-pass filtering, we estimated an information value based on coherence analysis of the signal to noise ratio in the raw spike train. The coherence between a single spike train (R) and the actual time-varying mean response (A) g 2 AR can be derived from the coherence between the peristimulus time histogram (PSTH) obtained from half of the trials and the PSTH obtained from the other half [99].
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 where M the total number of trials (presentations of the stimuli) and g 2 � R 1;M=2 � R 2;M=2 the coherence between the two PSTHs calculated on half of the trials. The coherence between two responses is a function of frequency (ω). An estimate of the mutual information (in bits per second) between R and A responses can then be obtained by integrating over all frequencies [4,99]: For each unit, we estimated the percentage of information preserved as the ratio between I AR calculated up to 50Hz and I AR calculated over all frequencies. Over all units, 96.7% ±6.9% (mean ± SD) of information was conserved by a lowpass filtering at 50Hz and only 88 out of 404 cells had information losses greater than 5%. For each unit, we also calculated the proportion of cumulative power across frequencies in the time varying spike rate estimation obtained with the KDE before low-pass filtering and down-sampling (averaged periodogram in overlapping 200 ms Hanning windows and 1 kHz sampling rate). The cumulative sum of the power was calculated across frequencies and normalized by the maximum power value to obtain the proportion of cumulative power. On average across units, the cumulative power reached 98.8%±1.6% (mean ± SD) at 50Hz, further validating our choice of the temporal resolution (S1 Fig).

Testing the poisson assumptions
To estimate values of mutual information, we assumed that spike counts in our 10 ms windows had a Poisson distribution. To assess the validity of this assumption, we calculated for each neuron, for each stimulus s i and for each time bin t, the log-likelihood function from the 10 spike counts y j each obtained in 1 trial j assuming a Poisson distribution with the mean response m s i;j given by the delete one estimate.
We then generated a bootstrap distribution of 1000 of these likelihood measures each obtained for 10 samples from a Poisson with mean rate given by our KDE estimate. Finally, we calculated the number of times the log-likelihood value obtained for the data was greater or smaller than the bootstrap values to generate a p-value (two-tailed test). Given the very large number of statistical tests, a False Discovery Rate correction, the Benjamini-Hochberg procedure, was used for each neuron to estimate the number of bins for which the Poisson hypothesis could be rejected with p = 0.05.
In our estimations of mutual information, we also assumed that the conditional probability of the response given a stimulus was independent between time bins. In other words, knowing that the spike count is greater (or smaller) than the mean in one time bin provides no information about probability of spike counts in the same trial in another time bin. To test this assumption, we calculated the noise correlation. The noise correlation is simply the Pearson correlation coefficient obtained from successive time bins. A single noise correlation was obtained for each neuron by averaging across pairs of time bins and stimuli.

Information theoretic calculations
As described in the results, the time-varying instantaneous mutual information between the stimulus S and the response Y t can be written as a difference in Shannon entropies: I t ¼ HðY t Þ À HðY t jSÞ and the cumulative mutual information for neural responses that are discretized into time intervals is given by: In the present paper we calculated 4 different types of information: the stimulus instantaneous information, the categorical instantaneous information, the stimulus cumulative information and the categorical cumulative information. Stimulus instantaneous and cumulative information were calculated for all 404 units, while categorical instantaneous and cumulative information were calculated on a restricted set of 337 neurons that presented at least one time point with a significant value of stimulus cumulative information (significant threshold set as 3 times the local error, see below for error calculations). While we verified our assumptions (minimal information loss with spike rate binning, Poisson distributions of spike counts and maximum value of spike counts) on the full set of 404 units, the population analysis of timevarying information presented in the results section only include the relevant dataset of 337 neurons.
The custom Matlab code used to calculate time-varying information values is available at https://github.com/julieelie/PoissonTimeVaryingInfo along with a tutorial on how to use the core functions.
The instantaneous information. The conditional response entropy and the response entropy are obtained from the distribution of the conditional probability of neural responses given the stimulus, p(y t |s), and the distribution of probability of each stimulus p(s i ): À pðy t js i Þ log 2 pðy t js i Þ We modeled the distribution of neural responses to a given stimulus s i as an inhomogeneous Poisson process. The conditional probability of response (spike count) given the stimulus is then: And the local entropy is: Because, in our data, the probability of response is very small for high values of y t , calculations of entropies were bounded for y t between zero and R Max . Here we set R Max = 20 which corresponds to a rate of 2 spike/ms in the 10 ms analysis windows. The maximum rate observed in all of our neurons across all time bins was 0.8 spike/ms (S2 Fig). Note that this very high firing rate (800 Hz) was only observed once; in other words, it only occurred in one 10 ms time windows across all neurons (404) and all stimuli (114 � 60 = 6840) or with a p = 1/ 2763360. This is clearly the very upper limit of a distribution with a long tail. Our time-varying rates were well below that upper bound but we verified that the cumulative probability up to R Max = 20 was numerically identical to 1 before estimating entropies. The error due to this sum truncation (from an analytical bound; summing up to 20 instead of up to infinity) can be estimated to a value that is below machine level precision relative to the estimated entropy: H error < 10 −16 bits). As described above (see time-varying spike rate estimation), we also enforced a lower bound for m s i ðtÞ of 1/20.
The total conditional entropy at time t is: À pðs i Þðpðy t js i Þ log 2 pðy t js i ÞÞ where p(s i ) is the probability of observing stimulus s i and n s is the number of stimuli sampled: it is the average of the local entropies obtained for the conditional probability of response for each stimulus s i . Assuming that our sample is representative of stimuli encountered, each stimulus is equally probable, p s i ð Þ ¼ 1 n s or: HðY t js i Þ Alternatively, one can assume that each call category is equally probable. If k i is the number of stimuli in the particular call category c to which s i belongs and n c is the number of categories, then: for s i 2 c. Then: The unconditional probability (i.e. across all stimuli) of a response at time t is not Poisson but a Poisson mixture characterized by its set of mean rates: The response entropy is obtained from these unconditional probabilities: The cumulative information. The conditional probability of a time varying response is the joint probability of observing (y t , y t−1 , y t−2 ,. . .) given s i . Given our Poisson assumption, the conditional probability of response at t is independent of the conditional response at previous times. pðy t ; y tÀ 1 ; y tÀ 2 ; . . . js i Þ ¼ pðy t js i Þpðy tÀ 1 js i Þpðy tÀ 2 js i Þ . . .
We can show that the conditional entropy of the joint responses is the sum of the individual entropies: The probability of the time varying response is the joint probability of observing (y t , y t−1 , y t−2 ,. . .). This joint probability cannot be expressed as the product of the probabilities at different times because these are not independent. The joint unconditional probability distribution is: This joint probability distribution could be expressed as a product of probabilities by assuming that most of the interdependence can be calculated from the previous time point, as in the Markov chain assumption at a beginning of the time series at t = 0. In all cases, the joint probability distribution can be written as: pðy t ; y tÀ 1 ; y tÀ 2 ; . . .Þ ¼ pðy t jy tÀ 1 ; y tÀ 2 ; . . .Þpðy tÀ 1 jy tÀ 2 ; y tÀ 3 ; . . .Þ . . . pðy 0 Þ or when it is approximated by a first-order Markov chain: pðy t ; y tÀ 1 ; y tÀ 2 ; . . .Þ ffi pðy t jy tÀ 1 Þpðy tÀ 1 jy tÀ 2 Þ . . . pðy 0 Þ The conditional probability of y t given y t−1 is:  19], the number of probabilities that must be estimated is 20 nt where nt is the number of windows in time. For example, calculating all the probability of all the outcomes for 10 windows (100ms) requires 20 10~1 0 13 calculations. On the other hand, the Markov chain approximation only requires the estimation of all pair-wise joint probability distributions: for 10 windows and 20 outcomes, the number of calculations is (10)(20 2 )~4000.
The response entropy is then calculated from the joint probability distribution: . . . pðy t ; y tÀ 1 ; y tÀ 2 ; . . .Þlog 2 pðy t ; y tÀ 1 ; y tÀ 2 ; . . .Þ To estimate this response entropy, we investigated various approaches: a time-running cumulative information, the Markov chain approximation and Monte Carlo with importance sampling. Monte Carlo with importance sampling gave the best results and was therefore used in our analyses. The Markov chain approximation overestimates the entropy and therefore can also be used to obtain an upper bound on the cumulative information, which could be useful in other applications. We briefly describe the three approaches. The time-running cumulative information consisted in calculating the full cumulative information (using all possible spike events) but only for a fixed number of successive time windows. We estimated that we could easily calculate all possible probabilities for 4 time-windows, corresponding to a 40 ms history. This approach gave the best estimate of the information in 40 ms windows but, in our system, grossly underestimated the cumulative information: some of the information in successive 40 ms is clearly independent (S3 Fig).
The second approximation was based on Markov chain of variable orders up to 4 (also 40 ms). With this approximation, we overestimated the cumulative information: the correlation time of the time-varying rates for different stimuli is clearly also greater than 40 ms. Using the first order Markov chain approximation:    The second to last term is: The third to last term is:  Finally, we estimated cumulative information using Monte Carlo with importance sampling. In Monte Carlo in conjunction with importance sampling, N i samples or, here, time varying responsesỹ i , are taken from a proposal distribution, qðỹÞ. The actual probability, pðỹ i Þ, is calculated exactly at those samples and an estimate of the expected value of the measure of interest (here f ðỹÞ ¼ log pðỹÞ) is obtained by the average of f ðỹÞ at the sample points, y i , weighted by the likelihood ratio p/q: Our proposal distribution was based on the distribution at each time point and assuming independence across time: qðy t ; y tÀ 1 ; y tÀ 2 ; . . .Þ ¼ pðy t Þpðy tÀ 1 Þpðy tÀ 2 Þ . . .
For each sample obtained from the proposal distribution q we calculated the actual probability value using: 1 k i pðy t js i Þpðy tÀ 1 js i Þpðy tÀ 2 js i Þ . . .

½ �
and used that probability value in the estimation of the entropy. Monte Carlo samples were chunked in groups of 100,000 samples and at each additional sample chunk, a bootstrapped/ jackknife bias corrected mean and standard error were estimated (see below). Sampling stopped when the standard error was below 0.2 bits or at a max number of samples set to 5,000,000 samples. If the error at the maximum number of samples was greater than 0.6 bits, the estimation was deemed unreliable and the calculation was not performed for any successive time points. Bias correction and standard error for information calculations. The small number of trials used to estimate spike rates is the source of bias and uncertainty in our calculation of information: a small number of stimulus presentations increase the probability of obtaining by chance estimated spike rate that are different between stimuli and so yields a positive bias on information calculations. We used a Jackknife procedure on the estimation of spike rate for each stimulus to correct for this positive bias. Jackknife kernel density estimate of the rate were obtained by applying the locally adaptive kernel bandwidth optimization method on the Ntrials possible sets of Ntrials-1 spike patterns of each stimulus (Ntrials being the number of stimulus presentation of a given stimulus). Moreover, uncertainty about information values comes from the sampling errors on spike rate and on the Monte Carlo estimation of joint spike rate probability distributions (cumulative information only). These errors on information calculations were estimated by bootstrapping the jackknife procedure:

Calculation of the expected value of categorical information given the stimulus information
We computed a Categorical Information Index (CII) that compared the empirical categorical cumulative information for call-type categories, CCI, to three hypothetical values: a floor (CCI Floor ), an expected value (CCI Exp ) and a ceiling value (CCI Ceil ). The floor value is the categorical cumulative information obtained from random categories. The expected value is the categorical cumulative information that would be achieved if the stimulus information was 1) equally distributed for each stimulus and 2) could be used for classifying stimuli into groups. Note that the second assumption is not necessarily true in the actual data because the categorical information is based on averaging the probabilities for stimulus from the same category and thus effectively averaging time-varying rates. If time-varying rates are not grouped by categories, then it is possible that two stimuli from two different categories are distinguishable based on their time-varying rate, but that, the average time-varying rates for the two categories are not distinguishable, or less so than expected from the average pair-wise distances. The ceiling value corresponds to the case where all the cumulative information about stimuli is used for the categorization and none to discriminate stimuli belonging to the same category: CCI Ceil . The CII is a number between 0 and 2 that is then calculated as: The following three steps were taken to calculate the expected categorical information (CCI Exp ) from the stimulus cumulative information. First, the stimulus mutual information, mi, was expressed as the conditional probability of correct stimulus decoding, p, for any given stimulus (and assumed to be equal for all stimuli). Given a confusion matrix of size nxn obtained from a decoder for n stimuli, with p as the conditional probability given a stimulus of correct decoding (diagonal terms) and thus (1-p)/(n-1) as the conditional probability of error (off-diagonal terms), the mutual information is equal to: mi ¼ p 2 log p À log p n h i þ 1 À p ð Þ 2 log 1 À p n À 1 À log 1 À p ðn À 1Þn

� �
The above equation was inverted numerically to solve for p given mi. Second, a new matrix was generated by grouping rows and columns of joint probabilities (and not conditional) to form a confusion matrix for categories. The number of stimulus in each category was matched to the actual values in the neurophysiological data on a unit per unit basis. Third, the expected mutual information for categories was then estimated from this new confusion matrix by subtracting the total entropy obtained from the joint probabilities, from the sum of the entropies of the marginal distributions for the rows and columns: Supporting information S1 Fig. Tests for temporal resolution. We performed two tests to assess the potential information loss from sampling the time-varying rate at 50 Hz (10 ms bins). A. The Coherence Test is based on the coherence between individual spike trains. A measure of total coherence (Information Coherence) can be obtained by integrating over frequencies (see Methods). The Information Coherence obtained by integrating from 0 to 50 Hz can then be compared to the Information Coherence obtained for the entire frequency range of 0 to 500 Hz. The histogram shows the number of cells versus the fraction of Information Coherence in 0-50 Hz. B. The Power Spectrum Test is based on the power spectrum of the time-varying rates for each neuron obtained with the Kernel Density estimation (sample frequency: 1kHz). Just as for the Coherence Information, the fraction of the power between 0-50 Hz relative to the power between 0-500 Hz was estimated for all cells. The histogram shows the number of neurons as a function of that fraction. This neuron had a very high stimulus driven firing rate and responded to all sound stimuli. The mlnoise stimulus is modulation limited noise: white noise that is low-pass filtered in amplitude and spectral modulations. This stimulus was used here to search for auditory regions but the responses to these synthetic sounds were not included in these analyses. and in the text, the cumulative information (CI) for call-type categories can be analyzed in terms of a floor or minimum value, a shared value and a ceiling or maximum value. This graph represents the difference in the observed cumulative information in bits relative to the shared value. The bold black line is the average across all neurons (n = 337; error bars show 2 SEMs). The thin colored lines correspond to individual neurons. Each line color corresponds to the time average Categorical Information Index (CII) of the neuron obtained between 50 and 300ms (Fig 9A). The same color code is used in all panels of Fig 9. The solid red line in the main plot is the average obtained for the quartile of neurons with the highest CII and the blue solid line is average obtained for the quartile of neurons with the lowest CII. The vertical plot on the right represents the density of neurons for each value of difference in information at 150 ms. (EPS)