Stimulus-dependent Maximum Entropy Models of Neural Population Codes

Neural populations encode information about their stimulus in a collective fashion, by joint activity patterns of spiking and silence. A full account of this mapping from stimulus to neural activity is given by the conditional probability distribution over neural codewords given the sensory input. For large populations, direct sampling of these distributions is impossible, and so we must rely on constructing appropriate models. We show here that in a population of 100 retinal ganglion cells in the salamander retina responding to temporal white-noise stimuli, dependencies between cells play an important encoding role. We introduce the stimulus-dependent maximum entropy (SDME) model—a minimal extension of the canonical linear-nonlinear model of a single neuron, to a pairwise-coupled neural population. We find that the SDME model gives a more accurate account of single cell responses and in particular significantly outperforms uncoupled models in reproducing the distributions of population codewords emitted in response to a stimulus. We show how the SDME model, in conjunction with static maximum entropy models of population vocabulary, can be used to estimate information-theoretic quantities like average surprise and information transmission in a neural population.


INTRODUCTION
Neurons represent and transmit information using temporal sequences of short stereotyped bursts of electrical activity, or spikes [1].Much of what we know about this encoding has been learned by studying the mapping between stimuli and responses at the level of single neurons, and building detailed models of what stimulus features drive a single neuron to spike [2][3][4].In most of the nervous system, however, information is represented by joint activity patterns of spiking and silence over populations of cells.In a sensory context, these patterns can be thought of as codewords that convey information about external stimuli to the central nervous system.One of the challenges of neuroscience is to understand the neural codebook -a map from the stimuli to the neural codewords-a task made difficult by the fact that neurons respond to the stimulus neither deterministically nor independently.
The structure of correlations among the neurons determines the organization of the code, that is, how different stimuli are represented by the population activity [5][6][7][8].These correlations also determine what the brain, having no access to the stimulus apart from the spikes coming from the sensory periphery, can learn about the outside world [9][10][11].The source of these correlations, which arise either from the correlated external stimuli to the neurons, from "shared" local input from other neurons, or from "private" independent noise, have been heavily debated [12][13][14][15].In many neural systems, the correlation between pairs of (even nearby or functionally similar) neurons was found to be weak [16,17,26].Similarly, the redundancy between pairs in terms of the information they convey about their stimuli was also typically weak [18][19][20].The low correlations and redundancies between pairs of neurons therefore led to the suggestion that neurons in larger populations might encode information independently [21], which was echoed by theoretical ideas of maximally efficient neural codes [22][23][24].
Recent studies of the neural code in large populations have, however, revealed that while the typical pairwise correlations may be weak, larger populations of neurons can nevertheless be strongly correlated as a whole [25][26][27][28][29][30][31][32][33].Maximum entropy models of neural populations have shown that such strong network correlations can be the result of collective effects of pairwise dependencies between cells, and, in some cases, of sparse high-order dependencies [26,34,35].Most of these studies have characterized the strength of network effects and spiking synchrony at the level of the total vocabulary of the population, i.e. the distribution of codewords averaged over all the stimuli.It is not immediately clear how these findings affect stimulus encoding, where one needs to distinguish the impact of correlated stimuli that the cells receive ("stimulus correlations"), from the impact of covariance of the cells conditional on the stimulus ("noise correlations").For small populations of neurons, it has been shown that taking into account correlations for decoding or reconstructing the stimulus can be beneficial compared to the case where correlations are neglected (e.g.[35][36][37][38][39]).Similarly, generalized linear models high-lighted the importance of dependencies between cells in accounting for correlations between pairs and triplets of retinal ganglion cell responses [40].
Here we present a new encoding model that allows us to study in fine detail the codebook of large neural populations.Importantly, this model gives a joint probability distribution over the activity patterns of the whole population for a given stimulus, while capturing both the stimulus and noise correlations.This new model belongs to a class of maximum entropy models with strong links to statistical physics [27,[41][42][43][44][45][46][47][48][49][50][51][52] and is directly related to maximum entropy models of neural vocabulary [26][27][28][29][30][31][32], allowing us estimate the entropy and its derivative quantities for the neural code.In sum, the maximum entropy framework enables us to progress towards our goal of focusing attention on the level of joint patterns of activity, rather than capturing low-level statistics (e.g., the individual firing rates) of the neural code alone.
We start by showing that linear-nonlinear (LN) models of retinal ganglion cells responding to spatially unstructured stimuli capture a significant part of the single neuron response, but still miss much of the detail; in particular, we show that they fail to capture the correlation structure of firing among the cells.We next present our new stimulus-dependent maximum entropy (SDME) model, which is a hybrid between linear-nonlinear models for single cells and the pairwise maximum entropy models.Applied to groups of ∼ 100 neurons recorded simultaneously, we find that SDME models outperform the LN models for the stimulus-response mapping of single cells and, crucially, give a significantly better account of the distribution of codewords in the neural population.

RESULTS
We recorded the simultaneous spiking activity of ∼ 110 ganglion cells from the salamander retina [53], presented with repeats of a 10 s long full-field flicker ("Gaussian FFF") movie, where the light intensity on the screen was sampled independently from a Gaussian distribution with a frequency of 30 Hz (Fig. 1a).This "frozen noise" stimulus was repeated 726 times, for a total of ∼ 2 h of stimulation.Most of the recorded cells exhibited temporal OFF-like behaviors (Fig. 1b).We chose for further analysis N = 100 cells that were reliably sorted, demonstrated a robust and stable response over repeats, and generated at least 2500 spikes during the course of the experiment.
We discretized neural responses into ∆t = 10 ms bins, and represented the activity of the neurons in response to the stimulus as binary words in each of the time bins.If neuron i = 1, . . ., N was active in time bin t, we denoted a spike (or more spikes) as x i (t) = 1, and x i (t) = 0 if it was silent.In this representation, the whole experiment yielded a total of about T ∼ 7.3 • 10 5 binary word samples.Using repeated presentations of the same movie, we estimated the average response of each of the cells across repeats, r i (t) = x i (t) rep , or the peristimulus time histogram (PSTH).Following Refs.[4,54], we fitted a linear-nonlinear model for each of the cells in the experiment, such that the predicted rate where k i is a linear filter matched for the i-th cell, N i is its point-wise nonlinear function, and s(t) is the stimulus fragment from time t − τ until t (here we used τ = 400 ms, making s(t) a vector of light intensities with 40 components).Linear filters were reconstructed using reverse correlation (spike-triggered average), and nonlinearities were obtained by histograming ) by inverting P (k i • s(t)|spike) using Bayes' rule.These LN models captured most of structure of the PSTH, yet as the example cell in Fig. 2a shows, they often misestimated the exact firing rates of the neuron, or sometimes even missed parts of the neural response altogether.In the Gaussian FFF condition, the normalized (Pearson) correlation between the measured and predicted PSTH, Corr(r i (t), r LN i (t)), was 0.69 ± 0.06 (mean ± std across 100 cells).
The performance gap of the canonical LN models in predicting single neuron responses suggests that either the single-neuron models need to be improved to account for the observed behavior, or that interactions between neurons play an important encoding role and need to be included.While firing rate prediction performance can be improved for single neurons by models with higherdimensional stimulus sensitivity (e.g.[54,55]) or dynamical aspects of spiking behavior (e.g.[56,57]), previous work, as well as the results below, demonstrated that even conditionally-independent models which by construction perfectly reproduce the firing rate behavior of single cells, often fail to capture the measured correlation structure of firing between pairs of cells, as well as higher-order statistical structure [26].We find two salient features of the correlations between pairs of neurons: (i) the pairwise correlations between cells in response to the Gaussian FFF movie are typically weak but are not zero (Fig. 1c, consistently with previous reports [26,27,32]); (ii) the correlation in neural activities shows a fast decay with distance despite the infinite correlation length of the stimulus, but the decay does not reach zero correlation even at relatively large distances (Fig. 1d).This salient structure, along with any other potential statistical correlation at the pairwise order, is characterized by the covariance matrix of activities, C ij = x i x j − x i x j , where the averages are taken across time and repeats.
We would like to find a model of the neural code that will be able to reproduce these pairwise statistics.Motivated by the shortcomings of the canonical LN model, we therefore asked whether a model that combined the LN (receptive-field based) aspect of single cells with the interactions between cells, could give a better account of the neural stimulus-response mapping.Importantly, the new model should capture not only the firing rate of single cells and the full pairwise correlation structure between them, but should also accurately predict the full distribution of the joint activity patterns across the whole population.Because the joint distributions of activity are high-dimensional (e.g., the distribution over codewords across the duration of the experiment, P ({x i }), has 2 N components), this is a very demanding benchmark for any model.
Here we propose the simplest extension to the conditionally-independent set of LN models for each cell in the recorded population, by including pairwise couplings between cells, so that the spiking of cell i can increase or decrease the probability of spiking for cell j [58,59].In contrast to previous proposals, this coupling will be introduced so that the resulting model is a maximum-entropy model for P ({x i }|s), the conditional distribution over population activity patterns given the stimulus.We recall that the maximum entropy models give the most parsimonious probabilistic description of the joint activity patterns, which perfectly reproduces a chosen set of measured statistics over these patterns, without making any additional assumptions [60].
We start by introducing the least structured (maximum entropy) distribution P (x 1 , x 2 , . . ., x N |t) that reproduces exactly the observed average firing rate for each time bin t and for each neuron i, r i (t) = x i (t) data = x i (t) P , as well as the overall correlation matrix C ij between all pairs of cells (c.f.[61]).Thus, we seek P ({x i }|t) that maximizes L: where the subscript to brackets • denotes whether the averaging is done over the maximum entropy distribution (P ), or over the recorded data; Lagrange multipliers µ ensure that the distributions are normalized.This is an optimization problem for parameters α i (t) and β ij , which has a unique solution since the entropy is convex.
The functional form of the solution to this optimization problem is well-known; in our case it can be written as where the individual time-dependent parameters for each of the cells, α i (t), and the stimulus-independent pairwise interaction terms β ij , are set to match the measured firing rates r i (t) and the pairwise correlations C ij ; Z(t) is a normalization factor or partition function for each time bin t, given by Z(t) = The time-dependent maximum entropy (TDME) model in Eq. ( 2) is equivalent to an Ising model from physics, where the single-cell parameters are timedependent local fields acting on each of the neurons (spins), and static (stimulus-independent) infinite-range interaction terms couple each pair of spins.In the limit where interactions go to zero, β ij = 0, the model in Eq. ( 2) becomes the full conditionally-independent model, itself a maximum entropy model that reproduces exactly the firing rate of every neuron, r i (t); in this case the probability distribution factorizes, and the solution for α i (t) and Z(t) becomes trivially computable from the firing rates, r i (t).Time-dependent maximum entropy models are powerful, since they make no assumptions about how the stimulus drives the response; they can serve as useful benchmarks for other models (especially the conditionally independent model with β ij = 0).On the other hand, these models require repeated stimulus presentations to fit, involve a number of parameters that grows linearly with the duration of the stimulus, do not generalize to new stimuli, and do not provide an explicit map from the stimuli to the responses.
To make a direct link to the stimulus and allow comparison with a set of uncoupled LN models, we take the general time-dependent maximum entropy model of Eq. ( 2) and specialize it to a particular form of stimulus dependence.Rather than having an arbitrary timedependent parameter for every neuron for each time bin, α i (t), we assume that this dependence takes place through the stimulus projection alone, i.e. α i (t) = α i (k i • s(t)), much like in an LN model, where the neural firing depends on the value of the stimulus projection onto the linear filter k i .This choice is made purely for the sake of convenience: the model could be generalized to, e.g., neurons that depend on two linear projections of the stimulus, by making α i depend jointly on (k 1 • s(t), k 2 • s(t)), although such models would be progressively more difficult to infer from data.
Concretely, we estimated the linear filter k i for each cell i using reverse correlation, and convolved the filter with the stimulus sequence, s(t), to get the "generator signal" g i (t) = k i • s(t).We then looked for the maximum entropy probability distribution P ({x i }|s(t)), by requiring that the average firing rate of every cell given the generator signal is the same in the data and under the model, i.e. x i (g i ) data = x i (g i ) P (see Methods); as before, we also required that the model reproduced the overall correlation between every pair of cells, C ij .This gives then a stimulus-dependent maximum entropy (SDME) model, which takes the following form: The parameters of this model are: N × (N − 1)/2 couplings β ij , K × N parameters α i , and a linear filter k i for each cell.We used a Monte Carlo based gradient descent learning procedure to find the model parameters α, β numerically (see Methods).By construction, the SDME model exactly reproduces the covariance in activities, C ij , between all pairs of cells, and also the LN model properties of every cell: an arbitrary nonlinear function N can be encoded by properly choosing how parameters α i depend on the linear projections of the stimulus, g i .We can construct a maximum entropy model with β ij = 0 (no constraints on the pairwise correlations C ij ).The result is a set of uncoupled (conditionally independent) LN models: In sum, the time-dependent maximum entropy (TDME) model of Eq. ( 2) is an extension of conditionally independent model that additionally reproduces the measured pairwise correlations between cells.In a directly analogous way, the stimulus-dependent maximum entropy (SDME) model of Eq. ( 3) is an extension to the set of uncoupled LN models, Eq. ( 4), that additionally reproduces the measured pairwise correlations between cells.Because P SDM E (Eq. 3) agrees with P LN (Eq.4) exactly in all constrained single-neuron statistics, any improvement in prediction of the SDME, be it in the firing rate or the codeword distributions, can be directly ascribed to the effect of the interaction terms, β ij .We found that the SDME predicted the firing rate of single cells better than the LN models, with the normalized correlation coefficient between the true and predicted firing rate, Corr(r i (t), r SDM E i (t)) being 0.74±0.06(mean ± std across 100 cells), as shown in Fig. 2b.The differences between the SDME and the LN models become more striking on the level of the activity patterns of the whole population.Figures 3a,b show the log-likelihood ratio for the population activity patterns x = {x i } under the two models, showing that the SDME can be orders of magnitude better in predicting the population neural response.These differences are large in particular for those stimuli that elicit a strong response (Fig. 3c), that is, precisely where the response consists of synchronous spiking and the structure of the codewords can be nontrivial.Moreover, the difference between the models becomes increasingly significant with the size of the population N , and particularly apparent for groups of more than 20 cells (Fig. 3d).We next examined how well various models for the neural codebook, P ({x i }|s), explain the total vocabulary, that is, the distribution of neural codewords observed across the whole duration of the experiment, P ({x i }) = P ({x i }|s(t)) t .Despite the nominally large space of possible codewords-much larger than the total number of samples in the experiment (2 N T )-the sparsity of spikes and the correlations between neurons restrict the vocabulary to a much smaller set of patterns.Some of these occur many times during our stimulus presentation, allowing us to estimate their empirical probability, P data ({x i }), directly from the experiment, and compare it to the model prediction [35].The most prominent example of such frequently observed codewords is the silent pattern, x i = 0, which is seen ∼ 72% of the time.
Figure 4 shows the likelihood ratio of the model probability and empirical probability for various codewords observed in the experiment, as a function of the rate at which these codewords appear in the experiment.While SDME model in Fig. 4a does not predict the frequencies of all patterns perfectly, it strongly outperforms the uncoupled set of LN models in Fig. 4b, and has a slightly better performance than the full conditionally independent model (Fig. 4c), despite the fact that the latter is determined by N × 1000 = 1 • 10 5 parameters, the firing rates of every cell in every time bin.On average, SDME predicts the probabilities of the patterns of activity with no bias, and with a standard deviation of log(P SDM E /P data ) of about 1; uncoupled LN models in comparison are biased and have a spread that is more than twice as large.Even more striking is the fact that LN models assign such low probabilities to some codewords that they are never generated during our Monte Carlo sampling (and are therefore not even shown in scatterplots of Fig. 4), while they are frequently observed in the experiment.This discrepancy is quantified by enu-merating the M most probable patterns in the data and in the model (by sampling; see Methods), and measuring the size of the intersection of the two sets of patterns; in other words, we ask if the model is even able to access all the patterns that one is likely to record in the experiment.As shown in the third row of Fig. 4, SDME does well on this task, with 434 codewords in the intersection of the 500 most likely patterns in the data and the model; this is a much better performance than for the uncoupled model, and slightly better than for the conditionally independent model.The SDME model was constructed to capture exactly the total correlations in neuronal spiking, C ij = x i x j − x i x j .With repeated stimulus, this total correlation can be broken down into the signal and noise components.The signal correlations, C s ij , are inferred by applying the same formula as for the total correlation, but on the spiking raster where the repeated trial indices have been randomly and independently permuted for each time bin.This removes any correlation due to interactions between spikes on simultaneously recorded trials, and only leaves the correlations induced by the response being locked to the stimulus.The noise correlation, C n ij , is then defined as the difference between the total and the signal components, We calculated the noise correlations between all pairs in our N = 100 neuron dataset.By their definition, the conditionally independent models cannot reproduce C n ij , which are always zero.To assess the performance of the SDME, we drew samples from our model distribution using the Monte Carlo simulation and compared the noise correlations in the simulated rasters to the true noise correlations.The model prediction tightly correlates with the measured values, as shown in Fig. 5.We observe a systematic deviation of ∼ 25%, most likely because the assumed dependence on the stimulus through one linear filter per neuron is insufficient to capture the complete dependence on stimulus, thereby underestimating the full structure of stimulus correlation and inducing an excess in the noise correlation.Despite this, the degree of correspondence in noise correlations observed in Fig. 5 is telling us that SDME has clearly captured a large amount of noise covariance structure in neural firing.How should we interpret the inferred parameters of the SDME model?LN models have a clear "mechanistic" interpretation in terms of the cell's receptive field and the nonlinear spiking mechanism.Here, similarly, the stimulus dependent part of the model for each cell, α i , is a nonlinear function of a filtered version of the stimulus g i (t) = k i • s(t); in the absence of neuron-to-neuron couplings, the nonlinearity of every neuron would correspond to N i (g i ) ∼ f (α i (g i )), where f (•) = exp(•)/(1 + exp(•)), according to Eq. ( 4).The dependence of α i on the stimulus projection g i is similar across the recorded cells as shown in Fig. 6a; as expected, higher overlaps with the linear filter induce higher probability of spiking.
The pairwise interaction terms in the model, β ij , are symmetric, static, and stimulus independent by construc- tion.As such, they represent only functional and not physical (i.e.synaptic) connections between the cells.Fig. 6b shows the pairwise interaction map for 100 cells; the histogram of their values (in Fig. 6c) reflects that they can be of both signs, but the distribution has a stronger positive tail, i.e. a number of cell pairs tend to spike together or be silent together with a probability than is higher than expected from their respective LN models.We can compare these interactions to the interactions of a static (non-stimulus-dependent) maximum entropy model for the population vocabulary [26,28]: In this model for the total distribution of codewords, there is no stimulus dependence, and the parameters α 0 i and β 0 ij are chosen to that the distribution is as random as possible, while reproducing exactly the measured mean firing rate of every neuron x i data = x i P M E , and every pairwise correlation, x i x j data = x i x j P M E , across the whole duration of the experiment.
Interestingly, we find that the pairwise interaction terms in the SDME model of Eq. ( 3) are closely related to the interactions in the static pairwise maximum entropy model of Eq. ( 5): SDME interactions, β ij , tend to be smaller in magnitude, but have an equal sign and relative ordering, as the static ME interactions, β 0 ij .Some degree of correspondence is expected: an interaction between neurons i and j in the static ME model captures the combined effect of the stimulus and noise correlations, while in the corresponding SDME interaction, (most of) the stimulus correlation has been factored out into the correlated dynamics of the inputs to the neurons i and j, i.e. α i (g i (t)) and α j (g j (t)).The surprisingly high degree of correspondence, however, indicates that even the interactions learned from static maximum entropy models can account for, up to a scaling factor, the pairwise neuron dependencies that are not due to the correlated stimulus inputs.
The SDME model is an approximation to the neural codebook, P ({x i }|s), while the static ME model describes the population vocabulary, P ({x i }).With these two distributions in hand, we can explore how the population jointly encodes the information about the stimulus into neural codewords-the joint activity patterns of spiking and silence.We make use of the fact that we can estimate the entropy of the maximum entropy distributions using a procedure of heat capacity integration, as explained in Refs.[27,32] (see Methods).Information (in bits) per codeword is that is, the information can be written as a difference of the entropy of the neural vocabulary, and the noise entropy (the average of the entropy of the codebook), where the entropy is S[p(x)] = − dx p(x) log 2 p(x).Because of the maximum entropy property of our model for P M E ({x i }), the entropy of our static pairwise model in Eq. ( 5) is an upper bound on the transmitted information; expressed as an entropy rate, this amounts to s ≡ S[P M E ({x i })]/∆t ≈ 730 bit/s.The brain does not have direct access to the stimulus, but only receives codewords {x i }, "drawn" from P ({x i }), by the retina.At every moment in time, − log 2 P ({x i }) measures the surprise about the output of the retina, and thus about the stimulus.We, as experimenters-but not the brain-have access to stimulus repeats and thus to P ({x i }|s(t)), so we can compute the average value of surprise (per unit time) at every instant t in the stimulus: This quantity can be expressed using the entropies and the learned parameters of our maximum entropy models, and is plotted as a function of time in Fig. 7. Since averaging across time is equal to averaging over the stimulus ensemble, we see from Eq. ( 7) that S(t) t would have to be identically equal to S[P ({x})] under the condition that P ({x i }|s(t)) t = P ({x i }) (marginalization).Since we build models for P ({x i }) (static ME) and P ({x i }|s) (SDME) from data independently, they need not obey the marginalization condition exactly, but they will do so if they provide a good account of the data.Indeed, by using the static ME and SDME distributions in Eq. ( 7) for surprise, we find that S(t) t ≈ 740 bit/s, very close to the entropy rate s of the total vocabulary and within our estimated 1% error bars on entropy computation.
To estimate the information transmission, we have to subtract the noise entropy rate from the output entropy rate s, as dictated by Eq. ( 6).The entropy of the SDME model is an upper bound on the noise entropy; since this is not a lower bound, we cannot put a strict bound on the information transmission, but can nevertheless estimate it.Figure 7 shows the "instantaneous information", I(t) = S(t) − S[P SDM E ({x i }|s(t))]/∆t, as a function of time; from Eq. ( 6), the mutual information rate is a time average of this quantity, R = I({x i }; s)/∆t = I(t) t .We find R ≈ 130 bit/s.This quantity can be compared to the total entropy rate of the stimulus itself (which must be higher than R), which in our case is ≈ 210 bit/s (see Methods).While our estimates seem to indicate that a lot of vocabulary bandwidth (730 bit/s) is "lost" to noise (600 bit/s), the last comparison shows that the Gaussian FFF stimulus source itself is not very rich, so that the estimated information transmission takes up more than half of the actual entropy rate of the source.
Lastly, we asked how important is the inclusion of pairwise interactions, β ij , into the SDME model, compared to a set of uncoupled LN models, when accounting for information transmission.We therefore estimated the noise entropy rate for a set of uncoupled LN models, S[P LN ({x i }|s(t))]/∆t, which was found to be ≈ 770 bit/s, considerably higher than the noise entropy of the SDME model.Crucially, this noise entropy rate is larger than the total entropy rate s estimated above, which is impossible for consistent models of the neural codebook and the vocabulary (since it would lead to negative information rates).This failure is a quantitative demonstration of the inability of the uncoupled LN models to reproduce the statistics of the population vocabulary, as shown in Fig. 4b, despite a seemingly small performance difference on the level of single cell PSTH prediction.

DISCUSSION
We presented a modeling framework for stimulus encoding by large populations of neurons, which combines an individual neuronal receptive field model, with the ability to include pairwise interactions between neurons.The result is a stimulus-dependent maximum entropy (SDME) model, which is the most parsimonious model of the population response to the stimulus that reproduces the linear-nonlinear (LN) aspect of single cells, as well as the correlation structure between neurons.In two limiting cases, the SDME model reduces to known models: if the single cell parameters α are static, SDME becomes the static maximum entropy model of the population vocabulary; if the couplings β are 0, SDME becomes a set of uncoupled LN models.The framework is general, and could be easily applied to other neural systems.
We applied this modeling framework to the salamander retina presented with Gaussian white noise stimuli, and found that the interactions between neurons play an important role in determining the detailed patterns of population response.In particular, the SDME model gave better prediction of PSTH of single cells, yielded orders of magnitude improvement in describing the population patterns, and captured significant aspects of noise correlations.The deviations between the SDME and the uncoupled LN model became significant for > 20 cells, and tended to occur at "interesting" times in the stimulus, precisely when the neural population was not silent.
The responses of the neural system in the maximum entropy framework are binary codewords of spiking and silence across the neural population.The choice of timescale over which these codewords are defined, here ∆t = 10 ms, is short enough such that multiple spikes are rarely observed in the same time bin, but long enough so that most of the strong spike-spike interactions (as well as fine temporal detail, such as spike-timing jitter) occur within a single bin.This allows us to view successive time bins as codewords, although some statistical dependence between them remains, possibly in the conditional SDME model (due to multiple timescales on which the neurons respond to stimuli), and certainly in the static ME model [31].If we were to make the time scale much shorter, e.g. by an order of magnitude, we could make the conditional independence assumption of the responses given the stimuli and previous spiking, which would lead us to GLM models [40] or nonequilibrium generalizations of Ising models [47].GLMs, in particular, are excellent generating models for precise spiking rasters, are easy to infer, and allow for asymmetric couplings between neurons.However, the inference in all these cases is tractable because there are no interactions between the spikes within the same time bin (as there are in SDME).This necessitates the use of very short time bins and introduces strong dependencies between successive time bins, making the interpretation of the discretized neural responses in terms of individual codewords difficult.For this reason, GLM and SDME are complementary approaches: the first allows for a temporally-detailed probabilistic description of a spiking process, while the second gives an explicit expression for the probability distribution over codewords in longer temporal bins.SDME allowed us to improve over LN models for salamander retinal ganglion cells both in terms of the PSTH prediction and, especially, in terms of population activity patterns.Interestingly, for parasol cells in the macaque retina under flickering checkerboard stimulation, the generalized linear model did not yield firing rate improve-ment relative to uncoupled LN models (but did improve higher order statistics of firing) [40].In both cases, however, the improvements reflect the role of dependencies among cells in encoding the stimulus, and their effect becomes apparent when we ask questions about information transmission by a neural population.Maximum entropy models can only put an upper bounds on the total entropy and the noise entropy of the neural code (and this statement remains true even if successive codewords are not independent), and as such cannot set a strict bound, but only give an estimate, for the information transmission.Nevertheless, ignoring the inter-neuron dependencies and using an uncoupled set of LN models predicts the total population responses so badly that the estimated noise entropy is higher than the upper bound on the total entropy, which is a clear impossibility, while the SDME model gives transmission rates that appear reasonable (and positive), amounting to about 60% of the source entropy rate.
Tkačik and colleagues [61] have suggested that one can interpret β ij in an SDME model as a prior over the activity patterns that the population would use to optimally encode the stimulus.For low noise level they argued that the prior should be minimal (and could help decorrelate the responses), as the population could faithfully encode the stimulus, whereas in the noisy regime, the prior should match the statistics of the sensory world and thus counteract the effects of noise.Similarly, Berkes and colleagues [62] suggested a similar reason for the similarity of ongoing and induced activity patterns in the visual cortex.Our results show that interactions are necessary for capturing the network encoding, and implicitly reflect the existence of such a prior.The recovered interactions are strongly correlated with the interaction parameters of a static, stimulus independent model over the distribution of patterns, making it possible for the brain (which only has access to the spikes, not the stimulus) to learn these values.Whether the interactions are matched to the statistics of the visual inputs as suggested by Ref [61] is the focus of future work.In parallel, increasingly detailed statistical models of neural codes going beyond SDME (e.g. by including temporal dependencies as in Ref [48]), and efforts to infer such models from experimental data, should focus our attention on population-level statistics and on finding principled information-theoretic measures for quantifying the code, like the surprise and instantaneous information suggested here.

METHODS
Electrophysiology.Experiments were performed on the adult tiger salamander, Ambystoma tigrinum.All experiments were in accordance with Ben-Gurion University of the Negev and government regulations.Extracted retinas were placed with the ganglion cell layer facing a multielectrode array with 252 electrodes (Ayanda Biosystems, Switzerland), and superfused with oxygenated Ringer medium at room temperature.Extracellularly recorded signals were amplified (MultiChannel Systems, Germany) and digitized at 10k Samples/s, and spikesorted using custom software written in MATLAB.
Visual stimulation.Stimuli were projected onto the retina from a CRT video monitor (ViewSonic G90fB) at a frame rate of 60 Hz; each movie frame was presented twice, using standard optics.Full Field Flicker (FFF) stimuli were generated by independently sampling spatially uniform gray levels (with a resolution of 8 bits) from a Gaussian distribution, with mean luminance of 147 lux and the standard deviation of 33 lux.These data allow us to estimate the entropy rate of the source (as used in the main text), by multiplying the entropy of the luminance distribution with the refresh rate.To estimate the cells' receptive fields, checkerboard stimulus was generated by selecting each checker (∼ 100 µm on the retina) randomly every 33 ms to be either black or white.To identify the RF centers, a two-dimensional Gaussian was fitted to the spatial profile of the response.The movies were gamma corrected for the computer monitor.In all cases the visual stimulus entirely covered the retinal patch that was used for the experiment.
Inferring SDME from data.The LN model for each neuron i consists of the linear filter k i , and the nonlinear function N i , which is defined pointwise on a set of binned values for the generator signal, g i = k i • s.We used binning into K = 20 bins such that initially each bin contains roughly the same number of values for g i , but subsequently the binning is adaptively adjusted (separately for each neuron) to be denser at higher values of g i , where the firing rates are higher.We fitted LN models with varying number of K bins, and have chosen K = 20 when the performance of the LN models appeared to saturate [63].
To find the parameters of the stimulus-dependent maximum entropy model (α i (g i ), β ij ), we retained the binning of the generator signal used for LN model construction.Given trial values for the SDME parameters, we estimated the chosen expectation values (covariance matrix C ij in firing, and the firing rate conditional on g i , r i (g i )) by Monte Carlo sampling from the trial distribution in Eq. (3); the learning step of the algorithm is computed by comparing the expectation values in the trial distribution and the empirical distribution (computed over the training half of the stimulus repeats).In detail, we used a gradient ascent algorithm, applying a combination of Gibbs sampling and importance sampling in order to efficiently estimate the gradient, by using optimizations similar to those described in Ref. [64].Sampling was carried out in parallel on a 16 node cluster with two 2.66GHz Intel Quad-Core Xeon processors and 16GB of memory per node.The calculation was terminated when the average error in firing rates and coincident firing rates reached below 1% and 5% respectively, which is within the experimental error.
To compute the single neuron PSTH and compare the distributions of codewords from the model to the empirical distribution, we used Metropolis Monte Carlo sampling to draw codewords from the model distributions; we drew 5000 independent samples (to draw uncorrelated configurations, a sample was recorded only after 100 "spin-flip" trials) for every timepoint, for a total of 5 • 10 6 samples; the same procedure was used also to draw from the uncoupled (β = 0) models.To estimate the entropies of high dimensional SDME distributions, we used the "heat capacity integration" method, detailed in Ref [32].Briefly, a maximum entropy model P (x) = Z −1 exp(−E(x)) (where E is the Hamiltonian function determined by the choice of constrained operators and the conjugated parameters) is extended by introducing a new parameter T , much like the temperature in physics, so that P T (x) = Z −1 T exp(−E(x)/T ).The entropy of the distribution is given by S[P T =1 ] = 1 0 C(T )/T dT , where the heat capacity C(T ) = σ 2 E (T )/T 2 , and the variance in energy can be estimated at each T by Monte Carlo sampling.In practice, we run a separate Monte Carlo sampling for a finely discretized interval of temperatures, T ∈ [0, 1], estimate C(T ) for each temperature, and numerically integrate to get the entropy S. We have previously shown that this procedure yields robust entropy estimates even for large numbers of neurons [27,32].

FIG. 1 :
FIG.1: Response of a large population of ganglion cells to a 10 s long repeated visual stimulus.(a) White noise uncorrelated Gaussian stimulus presented at 30 Hz and the spiking patterns of 3 cells to repeated presentations of the stimulus.(b) Spike-trigerred averages of 110 simultaneously recorded cells; a subset of 100 cells was chosen for further analysis.(c) The histogram of pairwise correlations between cells for repeated Gaussian white noise stimulus (green), repeated natural pixel movie (red), and non-repeated natural pixel movie (blue)[35].(d) Average pairwise correlation coefficient between cells as a function of the distance (mean and std are across pairs of cells at a given distance).

FIG. 2 :
FIG. 2: SDME model predicts the firing rate of single cells better than LN models.(a) Example of the PSTH segment for one cell (blue), the best prediction of an LN model (green) and of a SDME model (red).(b) Correlation between the true PSTH and SDME model prediction (vertical axis) vs. the correlation between the true PSTH and the LN model prediction (horizontal axis); each plot symbol is a separate cell, dotted line shows equality.The neuron chosen in panel (a) is shown in orange.

FIG. 3 :
FIG. 3: SDME model predicts population activity patterns for N = 100 neurons better than conditionally independent LN models.(a) The log-likelihood ratio of the population firing patterns under the SDME model and under a collection of LN models, shown as a function of time (red) for an example stimulus repeat.For reference, the average population firing rate is shown in grey.(b) A scatterplot of the log-likelihoods under the SDME and LN models; each dot represents an average over activity patterns {xi} (across repeats) at a given time bin; dotted line shows equality.(c) The log-likelihood ratio under the SDME and LN models as a function of the average firing rate in the population; SDME outperforms LN models in particular for patterns with more spiking activity.(d) The average likelihood ratio between the SDME and LN models as a function of the population size, N (error bars = std over 10 randomly chosen groups of neurons at that N ).

FIG. 4 :
FIG. 4:The performance of various models in accounting for the total vocabulary of the population, P ({xi}).The results for the SDME model are shown in (a), the results for an uncoupled set of LN models in (b), the results for a full conditionally independent model in (c).The first row displays the log ratio of model to empirical probabilities for various codewords (dots), as a function of that codeword's empirical frequency in the recorded data.The model probabilities were estimated by generating Monte Carlo samples from the corresponding model distributions (see Methods); only patterns that were generated in the MC run as well as found in the recorded data are shown.The second row summarizes this scatterplot by binning codewords according to their frequency, and showing the average log probability ratio in the bin (solid line), as well as the 1 std scatter across the codewords in the bin (shaded area).The highly probable all-silent state, {xi} = 0, is shown separately as a circle.The third row shows the overlap between 500 most frequent patterns in the data and 500 most likely patterns generated by the model (see text).

FIG. 5 :
FIG. 5: Measured vs predicted noise correlations for the SDME model.Noise correlation (see text) is estimated from recorded data for every pair of neurons, and plotted against the noise correlation predicted by the SDME model (each pair of neurons = one dot; shown are N (N − 1)/2 dots for N = 100 neurons).Conditionally independent models predict zero noise correlation for all pairs.

FIG. 6 :
FIG. 6: SDME model parameters.(a) Average values of the LN-like driving term, αi(gi), where gi = ki • s, across all cells i (error bars = std across cells), for each of the K = 20 adaptive bins for gi (see Methods).(b) Pairwise interaction map βij of the SDME model, between all N = 100 neurons in the experiment.(c) Histogram of pairwise interaction values from (b), and their average value as a function of the distance between cells (inset).(d) For each pair of cells i and j, we plot the value of β 0 ij under the static maximum entropy model of Eq. (5) vs. the βij from the stimulus-dependent maximum entropy model of Eq. (3).

FIG. 7 :
FIG. 7: Surprise and information transmission estimated from the SDME model.(a) Surprise rate (blue) is estimated from the static ME and SDME models assuming independence of codewords across time bins.The instantaneous information rate (red) is the difference between the surprise and the noise entropy rate, estimated from the SDME model (see text).The information transmission rate is the average of the instantaneous information across time.(b) Population firing rate as a function of time shows that bursts of spiking strongly correlate with the bursts of surprise and information transmission in the population.(c) The stimulus (normalized to zero mean and unit variance) is shown for reference as a function of time.