The structured ‘low temperature’ phase of the retinal population code

Recent advances in experimental techniques have allowed the simultaneous recordings of populations of hundreds of neurons, fostering a debate about the nature of the collective structure of population neural activity. Much of this debate has focused on the empirical findings of a phase transition in the parameter space of maximum entropy models describing the measured neural probability distributions, interpreting this phase transition to indicate a critical tuning of the neural code. Here, we instead focus on the possibility that this is a first-order phase transition which provides evidence that the real neural population is in a ‘structured’, collective state. We show that this collective state is robust to changes in stimulus ensemble and adaptive state. We find that the pattern of pairwise correlations between neurons has a strength that is well within the strongly correlated regime and does not require fine tuning, suggesting that this state is generic for populations of 100+ neurons. We find a clear correspondence between the emergence of a phase transition, and the emergence of attractor-like structure in the inferred energy landscape. A collective state in the neural population, in which neural activity patterns naturally form clusters, provides a consistent interpretation for our results.


Estimating Chromatic Sensitivities
We estimated color-dependent spike triggered averages (STAs) for each ganglion cell. To identify the Receptive Field (RF) centers within each STA, we took the normalized dot product of each checker's time course with the time course of the extremal checker. Checkers were identified as belonging to the RF center if this dot product exceeded 0.3, and they were connected by 4-neighbor connectivity to the extremal checker. The time course of the sum over the RF centers was defined as the temporal time course of the STA.
To estimate each ganglion cell's chromatic sensitivity we found (across all three color-dependent temporal time courses) the absolute maximum deviation from zero. We then took the three values of the color dependent STA's at this time point, and normalized, to generate a 3 x 1 unit vector Φ (d/l) i representing the color sensitivity of each ganglion cell i in each light condition (d/l). To model the chromatic sensitivity of the 'red rod' or 'green cone' photopigment in our experimental setup, we measured the spectral output of the monitor's red, green and blue guns directly [Ocean Optics USB 2000]. We took estimates of the red rod and green cone absorption spectrum from the literature [Perry et al., 1991;Baylor et al., 1987;Cornwall et al., 1984], and took the dot product of these spectra with the measured monitor outputs. Our estimates of the green cone and red rod photopigment sensitivites to the red, green and blue monitor guns were X g = [1.15, 1.23, 0.42] and X r = [0.20, 1.49, 0.86]. For example, for cell i in the dark adapted condition (d) the projection of the STA onto the rod (r) was defined as z d i,r = X r · Φ d i /3. The rod-cone separation for each projection was defined as the distance of the projection to the unity line (

Model Inference and Fitting the Expectation Values
We binned each ganglion cell's spike train in 20 ms time windows, assigning 0 for no spikes and 1 for one or more spikes, r i = [0, 1]. We denote a particular population activity pattern as R = {r i }. Maximum entropy models describing the distributions of time-averaged variables are fit with unitless 'energies': With δ the Kronecker delta. This model is constrained to match the expectation values r i , r i r j and i r i , averaged over the training data.
The maximum entropy model inference process implements sequential coordinate gradient descent, described in [Dudik et al., 2004;Broderick et al., 2007;Tkačik et al., 2014] , which uses an L1 regularization cost on parameters, derived from the expected measurement error on the corresponding expectation values. For the k-pairwise model, we inferred without a regularization cost on the local fields. This emphasises solutions to the inference of an 'independent' nature, so that terms in the interaction matrix, and the k-potential, are inferred only when necessary.
During inference, we randomly selected four fifths of the data states for training, and withheld the rest for testing. All models presented in our work were inferred with the same preset hyperparam- × 10 -4 Figure S1: Comparing Error Estimates. Error estimate for the pairwise expectation value r i r j in the light dataset, for each pair of cells, over 128 · 127/2 pairs. Bootstrap resamples were estimated as standard deviations over 30 resamples of the data. The standard error was given by the standard deviation over samples, divided by the square root of the number of samples (1.4 · 10 5 training data states). The dashed line is the unity line.
appropriate error estimate, i.e. these were z-scores) that we were fitting, excluding those rare expectation values which were zero in the training data. If the rms of z-scores fell below 1 on a particular round, that round exited early, and we progressed to the next set of hyperparameters. These early exits were implemented to avoid overfitting in the networks with artifically low correlation (such as the α = 0.1 network in the scaled covariance manipulation in the main text).
The error estimate in the z-scores had contributions from both error estimates in the testing data and MC samples drawn from the model. We used the standard error of the mean as estimates of the error, because bootstrap resampling to estimate the errors of our expectation estimates was not always an available option. For example, in the manipulation where we scaled the covariance matrix by a constant α, we had direct access to the new covariance matrix, but there were no 'data' states to resample. Thus for consistency across all inference procedures we worked with the standard error of the mean. We have found the two estimates of variability to be comparable in our data (Fig. S1).
We found that fit quality improvement (estimated on the testing data) slowed substantially after the first several thousand parameter updates [Fig. S2 E,G and Fig. S3 E,G]. Note that the error bar depends on the number of samples in both the data (there was 4X more training data than testing data), and in the model (the number of MC samples on a given iteration). Thus on each subsequent increase in MC sample number the estimated errors decreased and the z-scores increased.
In detail, we show the quality of fits for the two datasets. These are from the recording in Experiment #1, for the natural movie stimulus (M1), in the two light conditions ( Fig. S2 and Fig. S3, panels A-G).

Matching Other Statistics In the Distributions of Neural Responses
To test how well our models captured aspects of the distribution that were not explicitly fit by the models, we followed two recent publications [Tkačik et al., 2014;Ganmor et al., 2011].
We estimated the triplet correlations C ijk = (r i − r i )(r j − r j )(r k − r k ) over all states in the data, and over 5 · 10 5 MC samples drawn from the model. These were then sorted by value in the data, and grouped into one thousand equally populated groupings. The mean and standard deviation over those groupings were close to the unity line (Panel A in Fig. S4,5). These results appear very similar to those in [Tkačik et al., 2014] (there is a factor of 8 shift in the absolute values of C ijk arising from a change in variables from their spin notation, [-1,+1], to our spiking notation, [0,1]).
Maximum entropy models also do fairly well in matching the conditional probability of a spike [Tkačik et al., 2014]. Specifically, for any given state R, each cell i has a predicted probability of a spike: ( K = j =i r j ).
Following [Tkačik et al., 2014] we compared how well the model captured the experimental conditional probabilities (Panel B in Fig. S4,5). To do this, we took all the states in the data, and estimated the effective field for every state for every cell. We then sorted the actual binary activities by the strength of the effective field, and averaged these populations of activities in equally spaced bins (black dots). Error bars are standard deviations over the appropriate populations. These should be compared to the model prediction, which is in red (a logistic function). The gray shaded area is the probability density function of the effective field values. Inset is the same graph, on a log scale, to demonstrate that for low effective fields the conditional probabilities are well matched.
We also estimated the ability of our models to capture the probabilities of particular states. Here we compared our results to [Ganmor et al., 2011] (Panel C in Fig. S4,5). All states in the data were grouped by their unique probability (estimated directly from the data). Over these groups, we estimated the average and standard deviation of the log likelihood ratio of the model to the data. The shaded area corresponds to the 95% confidence interval given by a binary distribution.
Our model fits performed well in capturing these higher order statistics, with comparable quality to the fits in previously published results [Tkačik et al., 2014;Ganmor et al., 2011].

Analytical vs. Model Fits for Independent Networks
The analytical solution for independent neurons consists of a local field to each neuron (h ind , which depends solely on the measured firing rates ( r i ). All other terms (J ij , λ k ) are zero. However, for shuffled data, inference returned models with non-zero entries in the interaction matrix, and k-potential. Because of this discrepancy we compared the shuffled data model to the analytical solution by comparing the expectation values and shapes of the specific heat predicted by both (Fig. S6).
Triplet Correlations in Data, Averaged

Pairwise Maximum Entropy Model Results
Our main result concerning the presence of a phase transition was qualitatively similar in pairwise maximum entropy model fits to the data. As in the main text, the pairwise maximum entropy model was fit to subsets of cells in the light and dark datasets. A fictitious temperature was introduced, and the variance of energy levels was evaluated over Monte Carlo samples from these distributions.
The specific heat at different system sizes for the light dataset is plotted in Fig. S7A, for the dark in Fig. S7B. The peak temperature, T max , and peak value, C(T max ), of the specific heat are plotted as a function of the inverse size of the analyzed neural population, 1/N (Fig. S7C,D). As for the k-pairwise model, the peak sharpens, yet is always to the right of the operating point of the network, suggesting that the network is on the low-temperature side of the phase transition.

Characterizing the Transition With Respect to Correlation, α
We found a transition in the properties of inferred networks while scaling the correlation strength down by a constant α, with α ranging from 0 to 1 (N = 128 curve in Figure 7C). To understand this transition further, we subsampled networks of different system size, inferred pairwise maximum entropy models for these subnetworks of neurons, and calculated the specific heat of each subnetwork ( Figure 7C, Figure S8A). When we plotted the specific heat as a function of correlation strength for different subnetwork sizes, we found that the shapes of all of the curves appeared qualtatively similar (Fig. S8A). This observation suggested that a relatively simple scaling relationship for the contribution to the specific heat that depended on the system size might be able to account for this set of different curves.
To estimate this N -dependent scaling, we subtracted out the specific heat for subnetworks with N = 20 cells from all the other larger networks, and asked how this change in the specific heat  (Fig. S8C). The shape of the specific heat above some critical point (α * ) could be described as having an additional sigmoidal term whose amplitude had a N-dependence, but which was mostly unchanged across system size.
We write mostly because the scaled curves in Fig. S8C did exhibit some minor systematic dependence on system size N. There were two possibilities here: First, that the contribution to the specific heat above α * vanishes at α * . This would indicate a discontinuity in the derivative of the specific heat, classifying the phase transition as third-order. Second, the contribution to the specific heat is finite, providing a jump at the critical point α * . Such a discontinuity would classify the phase transition as second-order. We applied finite-size extrapolation to help disambiguate between the two possibilities.
To do this, we fit the scaled change in the specific heat to the inverse of the scaled change in system size for the three largest networks (N = 70, 100, 128), for a given value of the correlation, α. The intercept of such a fit is the extrapolated estimate of the behavior of the specific heat at the thermodynamic limit (as N → ∞). An example of this extrapolation for α = 0.34 is plotted in Fig. S8D. When we plotted the extrapolated values of the specific heat (N → ∞) as a function of the correlation strength, we found that this curve again appeared to be sigmoidal (Fig. S8E).
This suggests that the term in the specific heat that depends on system size vanishes at the critical point, which is more consistent with a third-order phase transition than a second-order one.
To estimate the critical point, we fit the finite-size extrapolation with a hyperbolic tangent truncated at the halfway point: all values less than α * set to zero (Matlab's nlinit function, dashed line in Fig. S8E). Such a function is continuous but its derivative is discontinuous at α * . This provided us with an estimate of the critical point, α * = 0.232.
To summarize, the extrapolated behavior of the specific heat in the thermodynamic limit was similar in shape to the measured behavior of the full system with N = 128 neurons. Thus, extrapolating did not change the qualitative nature of the result: this phase transition remained more consistent with third-order than second-order.

Persistence Indices
We saw in the main text that the persistence index for any given neural activity state correlated well with its dwell time estimated by finite temperature (T = 1) MC sampling dynamics (Fig. 9C).
We show here a plot of the persistence indices from the same neural activity state measured in the light vs dark conditions, which exhibited high correlation (Fig. S9A). This correlation is evidence that the energy landscape has a similar shape in light and dark conditions, even though the specific details of the neural code change considerably (Fig. 2). This similarity is important, because the stimulus was the same in both conditions. We expect that the retina should encode many of the same stimulus features across different light levels. One possibility is that the retina uses different codewords across different light levels, in which case the brain must change its decoding strategy as  Fig. 1 and [Tikidji-Hamburyan et al., 2015]), these data provide the first evidence that a form of light-level invariance does exist at the population level.
For comparison, we also plot the persistence indices estimated from independent models in the light and dark adapted conditions (Fig. S9B). Here, the correspondence between the neural code in light and dark conditions was poor, with a correlation coefficient = 0.37 in Fig. S8B vs 0.90 in Fig. S9A.
Clearly, the form of invariance to light level that we observed in the population neural code (Fig.   S9A) is a consequence of the correlations in the experiment. To test the reproducibility of this measure, we split the light data into two random, complementary halves, and inferred the kpairwise maximum entropy model parameters for these halves two separately. The persistence indices display a high degree of similarity across these two random halves (Fig. S9C), indicating that this a highly reproducible measure that can be well sampled in our data. The comparison with Fig. S9A also shows that the invariance of the population code to light level nearly approaches the limit set by sampling noise. Figure S10: Zipf Relationships. Probability vs. Rank for States from the light (light blue) and dark (dark blue) datasets evaluated directly. On the left, all states are plotted for both datasets. On the right, the probabilities of the states ranked from 50 to 1000 only are plotted, and a linear fit to this subset of states is shown in dashed lines. The fitted slopes are -0.96 for the light and -1.11 for the dark datasets.

Zipf-like Relationships in the Distribution of Codewords
A separate line of work has investigated the presence of Zipf-like relationships in the probability distribution of neural codewords [Mora & Bialek, 2011;Schwab et al., 2014]. We largely avoided working with these quantities out of concerns about the adequacy of sampling the we could achieve.
In Fig. S10 we plot the Zipf relationship between the probability and rank of a state (estimated directly from the natural movie dataset in Experiment #1). While the slopes of a power law fit to our experimental data were close to -1, it is important to note that a clear power law relationship existed only over a small range of rank ∼ 200 to 1000. This leaves a substantial uncertainty in the estimate of the slope. But more fundamentally, we are not sure how to interpret deviations from a slope of -1 as well as how to think about deviations from the power law form itself, such as the "bump" near rank ∼ 70 in the dark and near rank ∼ 300 in the light (Fig. S10).
We calculated the nonlinear susceptibility for the retinal population code in the light. This quantity is defined as (see Chpt. 2.5 in [Fischer & Hertz]): Where the inner brackets denote averages over (2.5 · 10 5 ) monte carlo samples at the given temperature. The nonlinear susceptibility is a higher-order derivative of the free energy. This quantity is typically studied in spin glass models, where a peak in χ (nl) indicates a spin glass phase transition at finite size. We evaluated this term in the pairwise maximum entropy models corresponding to the retinal population code in the light, as a function of system size N and temperature T (Fig.   S11 A).
When the nonlinear susceptibility is divergent as it is near the freezing transition in a spin glass (a high-order critical point), the value of the nonlinear susceptibility at a given temperature increases with increasing system size. Because the nonlinear susceptibility is the square of the correlation, and the correlations can only increase as larger system sizes are considered (near a critical point of this type), the value of the nonlinear susceptibility grows everywhere in the vicinity of such a critical point (see for example, [Aspelmeier et al., 2016]).
What we observe is that the nonlinear susceptibility decreases with system size for T < 1 (see Fig.   S11 A). At T = 1, the inference constraint on correlations means that the nonlinear susceptibilities are all equal. As T > 1, there is a large peak in the nonlinear susceptibility that increases and sharpens with system size. Somewhere around T = 1.5 the nonlinear susceptibility of the full, N = 128 neuron network falls below the nonlinear susceptibility of the N = 100 and N = 70 neuron networks (see Fig. S11 B). This is not the dependence on system size that one would observe near a spin glass critical point. Similar cross-overs were seen at higher temperatures -for instance the nonlinear susceptibility for N = 70 drops below that of N = 40 around T = 1.9.
The fact that these intersections occur at different locations in temperature means that there is no Temperature ( T ) Nonlinear Susceptibility ( χ (nl) ) A B Figure S11: Nonlinear susceptibility. The nonlinear susceptibility is evaluated here during an annealing process (same as for the specific heat calculations), for the networks of different system size N, taken from the light condition. Error bars are standard error of the mean, evaluated over ten different networks of the same size N, as in the specific heat calculations earlier. B is the same as A, magnified along the horizontal axis.
single point at which the nonlinear susceptibility starts to decrease with system size. That rules out the existence of a second phase transition at any higher temperature. Finally, this type of behavior in the nonlinear susceptibility is very similar to the sharpening that was observed in the specific heat, which can be interpreted to mean that no additional discontinuities were found at these higher order derivatives of the free energy. This evidence is then again consistent with the underlying phase transition being first-order.