Signatures of criticality arise from random subsampling in simple population models

The rise of large-scale recordings of neuronal activity has fueled the hope to gain new insights into the collective activity of neural ensembles. How can one link the statistics of neural population activity to underlying principles and theories? One attempt to interpret such data builds upon analogies to the behaviour of collective systems in statistical physics. Divergence of the specific heat—a measure of population statistics derived from thermodynamics—has been used to suggest that neural populations are optimized to operate at a “critical point”. However, these findings have been challenged by theoretical studies which have shown that common inputs can lead to diverging specific heat. Here, we connect “signatures of criticality”, and in particular the divergence of specific heat, back to statistics of neural population activity commonly studied in neural coding: firing rates and pairwise correlations. We show that the specific heat diverges whenever the average correlation strength does not depend on population size. This is necessarily true when data with correlations is randomly subsampled during the analysis process, irrespective of the detailed structure or origin of correlations. We also show how the characteristic shape of specific heat capacity curves depends on firing rates and correlations, using both analytically tractable models and numerical simulations of a canonical feed-forward population model. To analyze these simulations, we develop efficient methods for characterizing large-scale neural population activity with maximum entropy models. We find that, consistent with experimental findings, increases in firing rates and correlation directly lead to more pronounced signatures. Thus, previous reports of thermodynamical criticality in neural populations based on the analysis of specific heat can be explained by average firing rates and correlations, and are not indicative of an optimized coding strategy. We conclude that a reliable interpretation of statistical tests for theories of neural coding is possible only in reference to relevant ground-truth models.


Introduction
Recent advances in neural recording technology [1,2] and computational tools for describing neural population activity [3] make it possible to empirically examine the statistics of large neural populations and search for principles underlying their collective dynamics [4].One intriguing hypothesis that has emerged from this approach is the idea that neural populations might be poised at a thermodynamic critical point [5][6][7], and that this might have important consequences for how neural populations process and encode sensory information [7,8].As similar observations have been made in other biological systems (e.g.[9][10][11]), it has been suggested that this might reflect a more general organizing principle [12].
In the case of neural coding, evidence in favour of this hypothesis has been put forward by a series of studies which measured neural activity from large populations of retinal ganglion cells and reported that their statistics resemble those of physical systems at a critical point [7,8].Using large-scale multielectrode array recordings [2], spike-sorting methods that scale to large (N > 100) populations [2,13] and specially developed maximum entropy models [3,12,[14][15][16][17][18][19], Tkačik et al. observed that the specific heat-a global population statistic which measures the range of probabilities of spike patterns-diverges as a function of population size.In addition, when an artificial temperature parameter is introduced, specific heat is maximised for the statistics of the observed data rather than for statistics which have been perturbed by changing the temperature parameter.These properties of retinal populations resemble the behaviour of physical systems at critical points, and gave rise to the hypothesis that neural systems might also be poised at critical points.
What neural mechanisms can explain these observations?It had been hypothesised that the properties of the system need to be finely tuned [7,12] to keep the system at a critical point, for example through adaptation [20].A competing hypothesis [21][22][23] had stated that generic mechanisms based on latent-variable models could be sufficient to give rise to activity data with these statistics, but neither of these theoretical studies had investigated mechanistic models of retinal population activity.Thus, subsequent studies advocating criticality in the retina [7,8], continued to interpret their observations as indication for the retina to be poised at a special state that is advantageous for coding.It is therefore still an open question as to whether previously reported signatures of criticality reveal a new mechanism of retinal coding, or they are a direct consequence of the standard enconding models of retinal ganglion cell responses [24][25][26][27][28].
We here challenge the conclusion of studies which used tools from statistical physics to search for signatures of criticality by applying exactly the same data analysis approach to a simplistic feed-forward cascade model of retinal ganglion cell responses and showing that it exhibits the same effects.Focusing on how the specific heat of this simulated data varies with population size and temperature, we show that this simple model exhibits signatures of criticality and reproduces the experimentally reported dependence on different stimulus ensembles [7].We provide a theoretical analysis of an analytically tractable model [21,29,30], and show mathematically that it exhibits signatures of criticality under a wide range of parameters.
This analysis also points to a subtle but important difference between how practical neural data analysis and theoretical studies often differ in how they study scaling behaviour of the system: Whereas many theoretical studies describe different systems of size N , in practical neural data analysis populations of different size are typically constructed by randomly subsampling a large (but fixed) recording of neural activity.We show that this sampling process produces 'signatures of criticality' whenever neural data has non-zero correlations, which could arise from a shared stimulus drive, recurrent connectivity or global state-fluctuations [31][32][33][34].

Signatures of criticality arise in a simple model of retinal ganglion cell activity
A hallmark of criticality is that the specific heat of the model diverges when the temperature reaches the critical temperature [5].Tkačik et al. [7] developed a statistical approach for translating this concept to neural data analysis.In their analysis, neural populations of different size n are generated from the full recording by randomly subsampling the entire population.The statistics of activity for each population of size n are characterized using a maximum entropy model [3,14,15,17,18].Finally, the maximum entropy models are perturbed by introducing a temperature parameter, and specific heat is computed for each population size n and temperature T from the (perturbed) maximum entropy model fit.Divergence of specific heat with population size n, and a peak of the specific heat near unit temperature T = 1 (the 'temperature' of the original data) are interpreted as evidence for the system being at a critical point [7].
To test if these signatures of criticality can be reproduced by canonical properties of retinal circuits, we first created a simple phenomenological model of retinal ganglion cell (RGC) activity based on linear-nonlinear neuron models [24,25,28].In this model (Fig. 1a), we assumed retinal ganglion cells to have centre-surround receptive fields [28,35] with linear spatial integration [36], sigmoid nonlinearities and stochastic binary spikes, i.e. in each time bin of size 20ms, each neuron i either emitted a spike (x i = 1) or not (x i = 0).We used a sequence of natural images (see Methods 3.1 for details).In addition to the feed-forward drive by the stimulus, nearby neurons received shared Gaussian noise, mimicking common input from bipolar cells [37].Thus, cross-neural correlations in the model arise from correlations in the stimulus, receptive-field overlap and shared noise, but not from lateral connections between RGCs.Parameters of the model were chosen to approximate the statistics of receptive-field centre locations of RGCs (Fig. 1b), as well as histograms of firing rates, pairwise correlation-coefficients and population spike-counts (Fig. 1d).Nevertheless, the model clearly cannot accurately capture all statistics of real RGC activity: Our goal was not to provide a realistic model of retinal processing.Rather, we wanted to directly test whether canonical mechanisms of retinal processing (overlapping centre-surround receptive fields, spiking nonlinearities, shared Gaussian noise) are sufficient for the signatures of criticality to arise, or whether this would require fine-tuning or sophisticated neural circuitry.As a next step in the analysis, we subsampled populations of different size n by uniformly sampling cells from our simulated recording of size N = 316 neurons.For each population we fit a 'K-pairwise' maximum entropy model [3].This model assigns a probability P (x) to each spike-pattern x.It is an extension of pairwise maximum entropy models (i.e.Ising models) [14,15] which reproduces the firing rates and pairwise covariances which has additional terms which make sure that the model also captures the population spike-counts of the data [3] (see Fig. 1d, and Methods 3.2 for details of model specification and parameterisation).As we needed to efficiently fit this model [38][39][40] to multiple simulated data sets, we developed an improved fitting algorithm based on maximum-likelihood techniques using Markov chain Monte Carlo (MCMC) techniques (Matlab implementation available at https://github.com/mackelab/CorBinian),building on work by [16].In particular, we made the most computationally expensive component of the algorithm, the estimation of pairwise covariances via MCMC sampling, more efficient by using a 'pairwise' Gibbs-sampling scheme with Rao-Blackwellisation [41,42] (see Methods 3.2 for details).Rao-Blackwellisation resulted in a reduction of the number of samples (and computation time) needed for achieving low-variance estimates of the covariances by a factor of approximately 3 (Fig. 2a, Suppl.Inf.S1).After parameter fitting, the model reproduced the statistics of the simulated data relevant for the model (Fig. 2b).Using the formalism developed by Tkačik et al., we then introduced a temperature parameter which rescales the probabilities of the model, Here, temperature T = 1 corresponds to the statistics of the empirical data.By changing T to other parameter values one can perturb the statistics of the system [43]: Increasing temperature leads to models with higher firing rates and weaker correlations (Fig. 2c), with P T (x) approaching the uniform distribution for very large T .If the temperature is decreased towards zero, P T (x) has most of its probability mass over the most probable spike patterns.In many probabilistic systems, lowering T leads to increasing correlations, as the systems then 'jumps' between several different patterns and thus the activation probabilities of different elements are strongly dependent on each other.However, for the simulated RGC activity, the sparsity of data leads to a decrease of correlations: At a bin size of 20 ms [14], the most probable state is the silent state, followed by patterns in which exactly one neurons spikes.In an example population of size n = 100, 53.8% of observed spike patterns contain at most one spike.When decreasing the temperature to T < 1, patterns with at most one spike dominate the systems even more strongly: For the same population and temperature T = 0.8, we find 95.6% of observed patterns to contain at most one spike.Thus when the temperature is lowered, the shift in probability mass to single-spike patterns decreases correlations.
We compute the specific heat of a population directly from the probabilistic model fit to data [7], using i.e. the variance of the log-probabilities of the model, normalised by n [7].Specific heat is minimal for data in which all patterns x that occur in the data are equally probable, and big for data in which pattern-probabilities span a large range.We used MCMC-sampling to approximate the variance across all probabilities (see Methods 3.3), and used this approach to calculate, for each population of size n, the specific heat as a function of temperature (Fig. 2d).We found that the temperature curves obtained from the simulated data qualitatively reproduces the critical features of those that had been observed for large-scale recordings in the salamander [7] and rat [8] retina: The peak of the curves diverges as the population size n is increased, and moves closer to unit temperature for increasing n (Fig. 2e).Consistent with experimental findings, we found that specific heat diverged linearly with population size (Fig. 2e).These results show that signatures of criticality arise in a simple feed-forward LN cascade model based on generic properties of retinal ganglion cells, and do not require finely tuned parameters or sophisticated circuitry.2.2 Specific heat diverges linearly in flat population models In the phenomenological population model above, we observed that specific heat grew linearly with population size, as it did in previous studies built on experimental data [7,8,44].Can we understand this phenomenon analytically in a simplified model?In particular, is the divergence indeed linear, and what determines its rate?To address these questions, we replaced the K-pairwise maximum entropy model by a model which only captures the distribution of population spike-count K = i x i [21,29,30,32] of the data, and in which all neurons have the same mean firing rate and pairwise correlations.This 'flat' model can be fit to data by matching its parameters to the population spike-count distribution, side-stepping the computational challenges of the K-pairwise model (see Methods 3.4 for details).We here introduce a new parametrised flat model in which the spike-count distribution is given by beta-binomial distribution P (K|α, β, n), reducing the number of free parameters from n to 2. The beta-binomial model is a straightforward extension of an independent (i.e.binomial) population model: At each time-point, a new firing probability p is drawn from a beta-distribution with parameters α and β, and neurons then spike independently with probability p.The fact that the underlying fluctuations in p are shared across the population leads to correlations in neural activity.This beta-binomial model provided a good fit to the population spike-count distributions of the simulated data (Fig. 3a) across different population sizes n (Fig. 3b).The best-fitting parameters α and β did not vary systematically across population sizes, and converged to values of α = 0.38 and β = 12.35 (Fig. 3 c), corresponding to an average firing rate of µ = 1.5 Hz (i.e. each neuron has a probability of spiking of 0.03 in each bin) and average pairwise correlations of ρ = 0.073.The beta-binomial model also provided good fits to population spike-count distributions published in [30] and [32], [8] (Fig. 3d).
When we applied this flat model to populations subsampled from the RGC simulation, we could qualitatively reproduce the heat curves of the K-pairwise model.In particular, we found a linearly diverging peak that moved closer to T = 1 as the population size was increased (Fig. 3 e).Thus, linear divergence of specific heat is qualitatively captured by flat models.We note that the absolute values of the specific heat do not match those of the K-pairwise model or simulated data, but are substantially bigger (c max = 4.02 at T = 1.07).
One of the difficulties of interpreting the scaling behaviour of maximum entropy models fit to neural data is the fact that the construction of the limit in n differs from those studied in statistical physics: In statistical physics, different 'n' typically correspond to systems of different total size, and the parameters are scaled as a deterministic function of n (e.g.drawn from a Gaussian with variance proportional to 1/n in spin-glasses [45,46]).In studies using maximum entropy models for neural data analysis, populations of different n are obtained by randomly subsampling a fixed large recording, and the parameters are fit to each subpopulation individually.Thus, there is no analytical relationship between population size and parameter values in this approach, and this has made it hard to determine whether the scaling observed in these studies is surprising or not.
With the flat model, it is possible to analytically characterise the behaviour of the specific heat for large population sizes for this sampling process: We assume that each population of size n is randomly drawn from an underlying, infinitely large flat population model [21,29].Using this approach, one can mathematically show (see Methods 3.4, Suppl.Inf.S2. 3 and [21] for details) that for virtually all flat models, the specific heat diverges linearly at unit temperature, but not for any other temperature T > 1 or T < 1 (Suppl.Inf.S2.4).As a consequence, the peak must move to T = 1 as n is increased.Hence almost any data set analysed with the methods developed by [7] will under the flat model exhibit signatures of criticality.These results hold irrespective of the details of the properties of the full populations that the subpopulations are sampled from, including full populations that are more weakly or more strongly correlated than real neural populations, and even for models with unrealistic population spike-count distributions (see Suppl.Fig. S3 for an illustration).There are only two exceptions: The first one is a model in which all neurons are independent (i.e. a binomial population model), and the second one is a flat pairwise maximum entropy model-indeed, this is the only flat model with non-vanishing correlations for which the specific heat does not have its peak at unit temperature (see Fig. 3f and [21] for an illustration).

Strong neural correlations lead to fast divergence of specific heat.
The rate at which the specific heat diverges provides a mean of quantifying the 'strength' of criticality.What is the relationship between correlations in a neural population and the rate of divergence?To study how the specific heat rate c = c(T = 1)/n depends on the strength of correlations, we used a beta-binomial model to generate simulated data with firing rate µ = 1.5 Hz (i.e. each neuron has a probability of spiking of 0.03 in each bin), and different (population-wide, as all neuron pairs have the same correlation) pairwise correlation coefficients ρ ranging from ρ = 0.01 to ρ = 0.25 (Fig. 4a).We found that the heat curves had the same shape as in the analyses above, with a peak that increases and moves to unit temperature (Fig. 4b).Comparing the results for different specified correlation strengths within the populations, we found that the specific heat rates c increased strictly monotonically with ρ (Fig. 4b,c).For the beta-binomial model, the large-n value of c can be calculated analytically (see Suppl.Inf.S3.2 for details) as a function of the parameters α and β, This analytical evaluation of c (valid for large n) was in good agreement with numerical simulations (Fig. 4c left).In the case of weak correlations ρ, equation 3 can be simplified: In this case, the specific heat rate is proportional to the strength of correlations (see Suppl.Inf.S3.1 for details), i.e.
This expression can also be derived from the Gaussian model in [8] equation ( 4), by inserting the expected values of the mean and variance of the population spike-count under random subsampling.Thus, at least for flat models and the analysis based on specific heat proposed previously, 'being very critical' is a consequence of 'being strongly correlated'.

Specific heat depends on average correlation strength in Kpairwise model
Is the relationship between the strength and correlations and the 'strength' of criticality (i.e. the divergence rate of specific heat) also true in more general models?In the original study [7], specific heat was computed from K-pairwise model fits to RGC activity resulting from three different kind of stimuli: Checker-board stimuli (which do not have long-range spatial correlations, although stimulus-driven cross-neural correlations can arise from receptive field overlap), natural images, which exhibit strong spatial correlations, and full-field flicker (which constitutes an extreme case of spatial correlations since all pixels in the display are identical).Tkačik et al. found that specific heat diverges in all three conditions, and interpreted this as evidence that signatures of criticality are not 'inherited from the stimulus' [7].Comparing the specific heat values for n = 100 reported in [7] across stimulus conditions, Tkačik et al. found the smallest peak for checkerboard stimuli (c max = 0.54 for n = 100), intermediate for natural images (c max = 0.92) and strongest for full-field flicker (c max = 2.4).We tested whether we find the same pattern of results in K-pairwise model fits to our retinal simulation.Specific heat divergence also followed the pattern predicted by the flat models (Fig. 4d): Checkerboard (which gave an average correlation between neural activity of ρ = 0.033) had the smallest peak (peak specific heat c max = 0.87) followed by natural images (ρ = 0.075, c max = 1.32) and full-field flicker (ρ = 0.341, c max = 3.09).We conclude that the experimental evidence-which showed that the specific heat diverges, and how the speed of divergences depends on the stimulus ensemble-is entirely consistent with a simple, feed-forward phenomenological model of retinal processing.

Sources of criticality-inducing correlations in neural activity
In the above, we showed that a beta-binomial spike count distribution can be sufficient for signatures of criticality to arise.For this to hold we need the variance of the population spike-count to grow at least quadratically with population size, i.e.Var(K) ∝ n 2 .The variance of the population spike-count is equal to the sum of all variances and covariances in the population, Var(K) = n i=1 Var(x i ) + i =j Cov(x i , x j ).A sufficient condition for signatures of criticality to arise in these models is that the average covariances (and hence 9/36 correlations) between neurons are independent of n, [5,6].One possible correlation structure which has this properties are so called 'infinite range' correlations (Fig. 5a): correlation between neurons do not drop off to 0 for large spatial distances.In this case, adding more and more neurons to a population will not change the average pairwise correlation within the population (Fig. 5b).
In neural systems, there are at least two reasons that can facilitate the required correlation structure.First, as shown above, the choice of stimuli has a clear effect on the heat capacity indicating an important effect of input-induced correlations.In particular for full-field flicker stimuli infinite-range correlations are to be expected but also white noise input can generate correlations of considerable extent due to overlapping receptive fields.Second, even a neural population which does not have infinite range correlations can appear critical if it is randomly subsampled during analysis: Suppose that different populations of size n are obtained as above by (uniformly) subsampling a large recording of size N .Then, for any correlation structure on the full recording (including limited-range correlations, Fig. 5c), the average correlation in a population of size n will be independent of n (Fig. 5c): If neurons are randomly subsampled from the large recording, then the pairwise correlations in each subpopulation are also a random subsample of the large correlation matrix.As a consequence, the average correlation will be independent of n, and specific heat will diverge with constant slope (Fig. 5d).In contrast, if different population sizes are constructed by taking into account the spatial structure of the population (i.e. by iteratively adding neighbouring cells) then the average correlation in each subpopulation will drop with n, and the slope of specific heat growth will decrease with population size.
In our RGC simulation, correlations did drop off to zero with spatial distance for checkerboard and natural images, but not for full-field flicker (Fig. 5e).Correlations in the full-field flicker condition initially drop off due to distance-dependent shared noise, but eventually saturate at a level far above zero that is determined by the full-field stimulus.Due to these strong infinite-range correlations, both spatially structured sampling and uniform sampling then give rise to linear growth in specific heat (Fig. 5f left).For the other two stimulus conditions, however, the choice of subsampling scheme does result in markedly different behavior of the specific heat growth: Both for natural images and checkerboard stimuli, we can see the rate of growth decreases for large n under spatially structured subsampling (Fig. 5f centre, right).This effect will be more pronounced for larger simulations, and in additional simulations we found specific heat to saturate completely once populations are substantially bigger than the spatial range of correlations.
In summary, populations will exhibit critical behaviour if correlations have infinite range (over the size of the recording), irrespective of the sampling scheme.In addition, if a population is randomly subsampled (as was done in [7,8]), then signatures of criticality will arise even if the underlying correlations have limited range.10/36 3 Materials and Methods

Numerical simulation of retinal ganglion cell activity
Retina simulation: We simulated a population of N = 316 retinal ganglion cells as linear threshold neurons whose receptive fields were modelled by difference-of-Gaussian filters with ON-centres [25,28,36].The simulation comprised two subgroups of cells with different receptive field sizes (surrounds 56µm and 30µm in retinal space, centres 28µm and 15µm, respectively, one third cells with large receptive fields).For both subgroups, the weight of the surround was 0.5 of the centre weight.Locations of receptive field centres were based on a reconstruction of 518 soma locations from a patch of mouse retina [47].
As the reconstructed locations in that data set also comprised about 40% amacrine cell somata, we randomly discarded 40% of the cell locations.The resulting patch of retina covered an area of 200 × 300µm 2 , corresponding to 100 × 150 pixels in stimulus space.Correlated noise across neurons was modelled using correlated additive Gaussian noise.
Correlations dropped off exponentially with soma distance with a decay constant of τ = 30µm i.e. noise covariance matrix was chosen as Σ = σ 2 noise (aI n + be −∆/τ ), where ∆ ij is the distance between neurons i and j and a 2 + b 2 = 1.We set σ noise = 0.022 and a = 0.45.We modelled neural spiking in discrete time using 20ms bins.In each bin t, the total input z i (t) to neuron i was given by z i (t) = w i s(t) + i (t), where w i is the receptive field of neuron i, s(t) the vectorised stimulus and i (t) the input noise of neuron i.A neuron in a given bin is active (x i = 1) if z i + d > 0.5 and inactive (x i = 0) otherwise, with offset d = 0.168.Parameters of the simulation (centre and surround sizes, relative strength of centre and surround, magnitude and correlations of noise, spiking threshold) were chosen to roughly match the statistics of neural spiking (firing rates, pairwise correlations, population activity counts) reported in studies of salamander retinal ganglion cells [2,3,14].(Code will be available at www.mackelab.org/code).
Stimuli: We used three types of stimuli for this study: natural images, checkerboard patterns and full-field flicker.For natural image stimuli, we used a sequence of 101 images of meadow sceneries taken from low hight.Each image was 400 × 400 pixels, and each image was presented for 20ms with 300 repetitions total.The luminance histograms of the images were transformed to a normal distribution with mean 0.5 and pixel values between 0 and 1.
For the full-field flicker stimulus, luminance levels were drawn from a Gaussian distribution with mean µ = 0.5 and variance σ 2 = 0.06.Checkerboard stimuli consisted of 80 × 80 tiles of size 5 × 5 pixels each.Luminance levels (from within the interval [0, 1]) of each tile were chosen to be either 0.15 or 0.77 with probability 0.5.The parameters of both stimulus sets were chosen to match the dynamic range of the simulated retinal ganglion cells.For both types of stimuli, 2000 images were generated and the image sequences were presented with 10 repetitions.To calculate specific heat as function of increasing population size, we randomly selected 10 subsamples of the full simulated population of N = 316 cells at population sizes n ∈ {20, 40, 60, 80, 100, 120} by uniformly drawing n neurons out of the full population without replacement.

Modeling neural population data with maximum entropy models
Model definition: We modelled retinal ganglion cell activity by using a 'K-pairwise' maximum entropy model [3].In a maximum entropy model [48], the probability of observing the binary spike word x ∈ {0, 1} n for parameters λ = {h, J, V } is given by Here, the parameter vector h (of size n × 1) and the upper-triangular matrix J ∈ R n×n correspond to the bias-terms and interaction terms in a pairwise maximum entropy model (also known as an Ising model or spin-glass) [14].The term K(x) = n i=1 x i denotes the population spike-count, i.e. the total number of spikes across the population within a single time bin, and the indicator-term δ (K = k) is 1 whenever the population spike-count equals k, and is 0 otherwise.The term n k=0 V k δ (K = k) was introduced by [3] to ensure that the model precisely captures the population spike-count distribution of the data using n additional free parameters.The partition function Z for given λ is chosen such that the probabilities of the model sum to 1.
Parameter fitting: To fit the model parameters λ = {h, J, V } to a data set D = {x (1) , x (2) , . . ., x (M ) }, we maximised the penalised log-likelihood [49,50] of the data D 12/36 under the model, Here, the l1-penalty controlled the magnitudes of parameters h, J, the term J 1 favoured sparse coupling matrices, and the regularisation term Σ on the V -parameters ensures that the terms controlling the spike count distribution vary smoothly in k (Suppl.Inf.S1).This smoothness prior is particularly important for large spike counts, as it makes it possible to interpolate parameters for which the number of observed counts is small.In maximum entropy models, exact evaluation of the penalised log-likelihood and its gradients requires the calculation of expectations under the model, E[x i ], E[x i x j ] or equivalently cov(x i , x j ), and P (K = k) (Suppl.Inf.S1.1), which in turn requires summations over all 2 n possible states x and is prohibitive for n > 20.Following previous work [16], we used Gibbs sampling to approximate the relevant expectations (Suppl.Inf.S1.1 for derivations and implementation details).We used two modifications over previous applications of Gibbs sampling to fitting maximum entropy models to neural population spike train data, with the goals of speeding up parameter learning and alleviating memory usage: First, we use Rao-Blackwellisation [41,42] to speed up convergence of the estimation of covariances of x.We used pairwise Gibbs sampling (blocked Gibbs with block size 2), where each new sample in the MCMC chain was obtained by updating two entries i and j of x at a time, rather than just a single entry.This allowed us to get estimates of the conditional probabilities P (x i x j = 1|x ∼{i,j} ), and to use them to speed up the estimation of the second moment E[x i x j ] from empirical average of these conditional probabilities (Suppl.Inf.S1.1).
Second, we used a variant of coordinate ascent that calculated all relevant quantities as running averages over the MCMC sample, and thereby avoided having to store the entire n × M MCMC sample in memory [16], where M is the length of the sample.Because all features of the maximum entropy model are either 0 or 1 (x i , x i x j and the indicator function for the spike count), the gain in log-likelihood obtainable from either updating a single element of h or J [16,40], or from updating all V simultaneously (but not from updating multiple entries of h and J) can be computed directly from MCMC estimates of E[x i ], E[x i x j ] and P (K = k) (Suppl.Inf.S1.2).For each iteration, we calculated the gain in log-likelihood for each possible update of h i , J ij and full V , and picked the update which led to the largest gain [16,51].
We measured the length of Markov chains in sweeps, where one sweep corresponds to one round of n(n − 1)/2 Markov chain updates that encompasses all pairs of entries of x in random order.We set a learning schedule that started at 800 sweeps for the first parameter update and doubled the number of sweeps in the chain after each set of 1000 parameter updates.We monitored convergence of the algorithm using a normalised mean square error between empirical E[x i ], cov(x i , x j ), P (K = k) and their estimates from the MCMC sample.For normalisation, we used the average squared values of the target quantity, e.g.

Calculating specific heat and temperature curves
Specific heat calculations: To investigate thermodynamic properties of neural population codes, Tkačik et al [7] introduced a temperature parameter T for equation 5: Model fits are obtained at T = 1, and the temperature parameter T is scaled to study the system (i.e.characterised by P T (x|h, J, V ) for T = 1).We note that varying T , in effect, modulates probabilities by exponentiating them with 1/T , and that the family of probability distributions obtained by varying T can be constructed for any distribution, not just maximum entropy models.For large temperatures P T approaches a uniform distribution (P T (x) ≈ 2 −n for each x), whereas for small temperatures it converges to a singleton, P T (x * ) ≈ 1 with x * = argmax x (P T =1 (x)).The specific heat, as given in equation 2, can be obtained from the variance of the log-probabilities of the model.As the variance in practice can not be outright computed for n beyond 20, we obtained estimates of c(T ) using a pairwise Gibbs sampler.We note that the specific heat does not depend on Z T , as changing Z T results in a constant, additive shift in log-probabilities which does not affect the variance.We tracked the variance of log-probabilities over an MCMC chain x (1) , . . ., x ( M ) of length M sampled at temperature T , For each population, we evaluated c(T ) for 31 temperatures between T = 0.8 and T = 2, and found the Gibbs sampler to provide reliable estimates over this temperature range.We used a burn-in of 2 * 10 4 sweeps, and ran the sampler for n 100 2 × 4h of CPU time, resulting in between 9.97e5 and 1.72e6 sweeps (mean ± std) for n = 100 (i.e. between 4.94e9 and 8.52e9 sampled individual spike words).

Simplified population models and the beta-binomial model
For the theoretical analysis, we adopted a class of population models (here referred to as 'flat' models) in which all neurons have identical mean firing rates, pairwise correlations and higher-order correlations [3,21,29,52,53].Such a model is fully specified by the population spike-count distribution P (K = k), and all spike words with the same spike count are equally probable.As a result, the probabilities of individual patterns x can be read off from the spike count distribution by whenever n i=1 x i = k.In a maximum entropy formalism, this model can be obtained by setting h i = 0 and J ij = 0 for all i, j ∈ {1, . . ., n} and only optimising entries of V .Without loss of generality, we fixed fixed V 0 = 0 [30], resulting in n degrees of freedom for the model.
In flat models, it is possible to explicitly construct a limit n → ∞ which will help us understand population analyses performed on experimental data: We assume that there is a spike count density f (r), r ∈ [0, 1], which describes the population spike-count distribution of an infinitely large population.f (r) denotes the probability density of a fraction of r neurons spiking simultaneously.Finite-size populations of n cells are then obtained as random subsamples out of this infinitely large system.Based on previous findings by [21], we show in Suppl.Inf.S2.3 that, in this construction, flat models always exhibit a linear divergence of specific heat, unless the limit f (r) is given by either a single delta peak or a mixture of two symmetric delta peaks.These two models corresponds to systems that (for large n) either behave like a fully independent population (whose spike count distribution converges to a single delta peak), or a population described by a pure pairwise maximum entropy model (which converges to two delta peaks).In particular, any flat model with higher-order correlations [18,27,52,53], or a non-degenerate f (r), will exhibit 'signatures of criticality'.Furthermore, we show that, for continuous f (r), c(T ) does not diverge for any T = 1.In combination, these results show that the peak of the specific heat is mathematically bound to converge to T = 1 for n → ∞ in this model class.
We further simplified the flat model by re-parametrising P (K = k) by a betabinomial distribution, thereby reducing the number of parameters from n to two, andimportantly-obtaining parameters which do not explicitly depend on n.In this model, and For simulated data, we found values for α, β extracted from the beta-binomial fits to populations of different sizes n to be stable over a large range of n (Fig. 3b).We used the beta-binomial parameters obtained from the largest investigated n to estimate the divergence rate c for n → ∞.

Discussion
An intriguing hypothesis about the collective activity of large neural populations has been the idea that their statistics resemble those of physical systems at a critical point.Using a definition of criticality which is based on temporal dynamics with power-law statistics, numerous studies have reported and studied critical behaviour in neural population activity [8,20,54,55].Multiple possible mechanisms for these dynamics have been proposed (e.g.[20,56,57]).It has been argued that such temporal dynamics might be beneficial for neural computation and communication [20,58,59] (see [5] for an overview).More recently, a second line of studies [5-8, 11, 12] has studied the statistics of time-instantenous patterns of neural activity using tools from statistical mechanics, and argued that they also exhibit critical behaviour.This hypothesis could open up further questions on how the system maintains its critical state, and what implications this observation has for how neural populations encode sensory information and perform computations on it.Similarly, signatures of criticality have also been observed in natural images [11] and small cortical populations [6], and have been studied using the theory of finite-size scaling and critical exponents [6].It has been argued that systems close to a critical point might be optimally sensitive to external perturbations [6] and that the large dynamic range of the code (i.e. the large variance of log-probabilities) might be beneficial for encoding sensory events which likewise have a large distribution of occurrence-probabilities [17].Alternatively, generic mechanisms could be sufficient to give rise to activity data with these statistics.We had demonstrated in a previous theoretical study [21] that a simple models with common input can exhibit signatures of criticality.More recently, Schwab et al. [22] and Aitchison et al. [23] elaborated on these findings, showing that common input (or other latent variables which lead to shared modulations in firing rates) can give rise to Zipf-like scaling of pattern probabilities (a second signature of criticality).Mathematically, Zipf's Law is equivalent to stating that the plot of entropy vs energy (i.e.log-probability) is a straight line with unit slope [22,23].Schwab et al [22] showed that particular latent variable models give rise to Zipf's law.This result was generalized by [23] which showed that, under fairly general circumstances, high-dimensional latent variable models exhibit a wide distribution of energies (i.e.log-probabilities) and hence a large specific heat.In addition, they showed that large fluctuations in the specific heat are (under some additional assumptions) sufficient to achieve Zipfian scaling.While it has also been argued that the use of data-sets which are too small might give rise to spuriously big specific heats [60]-while this is true in principle, additional analyses e.g. in [7] show that their results are robust with respect to data-set size.
However, neither of these previous theoretical studies analysed mechanistic models of neural population activity, nor did they have tools for studying population statistics in large simulations or recordings, and they were therefore limited to studying very small (N < 20) systems.It has thus been an open question of whether and how these theoretical considerations can account for effects observed in retinal ganglion cells.We here showed that surprisingly simple mechanisms are sufficient for two key signatures of thermodynamic criticality-a divergence of specific heat and a peak of the specific heat near unit temperature-to arise.
We found that neural population activity exhibits signatures of criticality whenever the average correlation in population of different sizes is larger than zero and does not depend on population size.In the thermodynamic analysis of physical systems at equilibrium, long-range correlations typically vanish in the thermodynamic limit.In neural systems, however, such 'criticality-inducing' correlations can arise as a consequence of various factors: In a local patch of retina, retinal ganglion cells have a large degree of receptive field overlap, and natural stimuli also contain strong spatial correlations.This can lead to correlations which do have unlimited range within the experimentally accessible length scales.Thus, fluctuations in the stimulus will lead to common activity modulations amongst neurons within the population.Empirically, activity correlations between pairs of retinal ganglion cells only fall of slowly with the distance between somas (or receptive field centres) [28].Similarly, Mora et al [8] used a moving-bar stimulus with strong temporal correlations, and found that including activity from multiple time-lags markedly increase the strength of specific heat.We hypothesise that this increase in specific heat is a consequence of temporal correlations being stronger than inter-neural correlations in this stimulus condition.In addition, firing rates of cortical neurons are modulated by global fluctuations in excitability [31][32][33][34], resulting in neural correlations with infinite range.
Finally, we showed that criticality-inducing correlations arise as a consequence of constructing different subpopulations by uniformly subsampling a large recording with correlations.Signatures of criticality are entirely consistent with canonical properties of neural population activity, and require neither finely-tuned parameters in the population, nor sophisticated circuitry or active mechanisms for keeping the system at the critical point.Signatures of criticality are likely going to be found not just in retinal ganglion cells, but in multiple brain areas and model systems.These observation raise the question of whether signatures of criticality are really indicative of an underlying principle, or rather are a consequence of viewing the statistics of neural populations through the lens of equilibrium thermodynamics.In order to realise the potential of large-scale recordings of neural activity in the search of a theory of neural computation, we will need data-analysis methods which are adapted to the specific properties of biological 16/36 data [4,19].
R n(n+3)/2+1 .The derivative of the log-likelihood with respect to any single parameter λ l , l = 1, . . ., n(n + 3)/2 + 1 is given by As can be seen from equation ( 14), the gradient of the log-likelihood vanishes if and only if the data means match the expectations of f (x) under the model.
To deal with data-sets of limited size, we maximised a regularised variant of the log-likelihood, Here, the matrix Σ implements a combined ridge and smoothing regression over V , with (n + 1) × (n + 1) identity matrix I and smoothing matrix S corresponding to a squared-exponential kernel [61].We set V 0 = 0 and accounted for this by conditioning on V 0 and correspondingly subtracted S 0• (σ S + σ I ) −1 S 0• T from Σ.We used σ h = σ J = 10 4 , σ S = 10, σ I = 400 and τ S = 10.
To fit maximum entropy models to large neural populations, one needs to 1. efficiently approximate the feature moments E λ [f (x)] needed for the gradients of both eq.( 13) and eq.( 15), which for large populations (n > 20) can not be calculated exactly 2. find efficient methods for updating the parameters λ.
We introduce two modifications over previous approaches to fitting maximum entropy models to neural data [16] to improve computational efficiency: 1. We used pairwise Gibbs sampling and Rao-Blackwellisation to considerably improve estimation of the second-order feature moments 2. The authors of [51] described a trick for efficiently updating the parameters in pairwise binary maximum entropy models: If one restricted updates to coordinatewise updates, then one can calculate the gain from updating a single variable in closed form, which makes it easy to select both the variable to update as well as the step-length in closed form.We show how this trick can be extended to allow a joint update of all the population-count features V .In addition, the gain in log-likelihood is linear in the feature-moments, which makes it possible to compute it from a running average over the MCMC sample, and avoids having to store the entire sample in memory at any point.
We describe our contributions in the sections S1.1 and S1.2, respectively.

S1.1 Pairwise Gibbs sampling and Rao-Blackwellisation
Following previous work [16], we used MCMC sampling to approximate the expectations of the feature functions f (x) under the K-pairwise model with parameters λ.These expected values E λ [f (x)] are required to evaluate the gradients of the (penalised) loglikelihood, as well as the log-likelihood gains resulting from parameter updates.As the number of pairwise terms grows quadratically with population size n, most of the parameters of the model P (x|λ) for large n control pairwise moments E λ [x i x j ].To make the estimation of these pairwise interactions more efficient, we implemented a pairwise Gibbs sampler that for each update step of the Markov chain samples two variables x i and x j , i = j, i, j ∈ 1, . . ., n.This furthermore allowed us to 'Rao-Blackwellise' the single-cell and pair-wise feature components f i (x) = x i and f ij (x) = x i x j [41,42,62], i.e. to use the conditional probabilities P (x i = 1|x ∼i , λ) and P (x i x j = 1|x ∼{i,j} , λ) for moment estimation, instead of the binary x i and x i x j .Rao-Blackwellisation provably reduces the variance of the resulting estimators, and empirically resulted in substantially faster convergence of the MCMC-estimated model firing rates E λ [f i (x)], second moments E λ [f ij (x)], and thus also of the covariances cov λ (x i , S1).Unlike the binary variables x i , x i x j however, the conditional probabilities are real numbers from the interval (0, 1) and cannot be stored in memory-efficient sparse matrices.We thus implemented a running average over conditional probabilities that discards the current chain element immediately after drawing the next one, while keeping track of the quantities as m increases from 1 to MCMC sample size M .We also kept track of the non-Rao-Blackwellised estimates for the expectations of the population-level indicator feature functions E λ [f k (x)] = P (K = k|λ), with Kronecker delta function δ(x, y) = 1 if x = y, and δ(x, y) = 0 otherwise.We quantified the advantage of Rao-Blackwellising the Gibbs sampler with long Markov chains drawn from the K-pairwise maximum entropy model fits to populations of size n = 100 drawn from the simulated RGC data.For each investigated parameter fit, we ran two chains under different conditions: a first chain for which we Rao-Blackwellised the single-cell and pairwise feature moments, and a second chain for which we did not.These Markov chains were run for M = 10 6 sweeps and hence orders of magnitude longer than had occurred for the invidivual parameter updates within this study, which comprised 800 to 30000 sweeps, or 3.96 × 10 6 to 1.485 × 10 6 individual MCMC chain updates at n = 100.The long sample runs served to give an approximation for the "true" expected values of the target quantities of interest to us: firing rates E λ [f i (x)], covariances cov λ (x i , x j ) and population spike count distribution P (K = k|λ).
We quantified the speed of convergence of the estimates to the "true" expected feature moments by the normalised MSE between sampler-derived feature moments after any given length 0 < m < M of the MCMC chain and the results we got after the full chain length.After the full M = 10 6 sweeps, the Rao-Blackwellised and non-Rao-Blackwellised estimates on average differed by 1.7 × 10 −4 %, 0.013% and 4 × 10 −6 % normalised MSE for firing rates, covariances and population spike count distributions, respectively.We computed the distance to "truth" for each condition as the normalised MSE to the E λ [f (x)] averaged over both conditions.We obtained MCMC estimates for the feature moments of the K-pairwise maximum entropy models fits to 10 subsampled populations of n = 100 neurons each drawn from our retina simulation.Supplementary figure S1a displays the results for the two conditions, Rao-Blackwellised vs. non-Rao-Blackwellised, for each of the 10 investigated fits.
MSEs of firing rates for single-cell features E λ [x i ] did not benefit from Rao-Blackwellisation.This is expected, as each x i is sampled n − 1 times per sweep and thus the moments are already well estimated relative to the second-order features.For covariances cov λ (x i , x j ), normalised MSEs showed clear improvement under Rao-Blackwellisation, visible as an approximately constant offset between the avarages over all 10 parameter fits in the loglog-domain as seen in figure S1b.The normalised MSE on average was 3.19 times higher for non-Rao-Blackwellised (given by the downwards offset of the normalised MSEs of the Rao-Blackwellised estimates).The fraction of samples needed from Rao-Blackwellised runs to achieve the same normalised MSE on the pariwise moments than non-Rao-Blackwellised runs (given by the leftward offset of the normalised MSEs of the Rao-Blackwellised) overall was 32.02%.The fraction ranged from 34.93% at 800 sweeps to 31.74% at 30000 sweeps.The ratio of normalised MSEs was similarly stable, being 2.96 times higher at 800 sweeps and 3.27 times higher at 30000 sweeps for non-Rao-Blackwellised samples than for Rao-Blackwellised ones.S1.2 Exploiting the structure of the K-pairwise feature functions allows blockwise parameter updates.
As described in the previous section, we can use MCMC to obtain the expected values of the feature function E λ [f (x)] that are needed to to optimise the model parameters λ.To find the parameter setting λ which maximise the log-likelihood over the given data vectors x (m) , m = 1, . . ., M , we follow an iterative update scheme introduced previously [51], and extend it to the K-pairwise model.The update scheme optimises parameter changes λ new − λ old relative to a current parameter estimate λ old , rather than the parameters λ directly.The benefit of this scheme over standard gradient ascent on the regularised log-ligkelihood as in eq. ( 14) is that we can give closed-form solutions for optimal values of a single component λ l when temporarily holding all other components λ ∼l fixed.
Changing the current parameter estimate λ old to λ new leads to a change in loglikelihood of which requires the higher-order moments E λ old l∈I f l (x) for all I ⊆ J and J ⊆ {1, . . ., n} being the index set of components that are not set to zero.
The population spike count features f k (x), however, are mutually exclusive (only one of the n+1 features can be non-zero at any time), and therefore we can all parameters of V jointly, and still pull the expectation term outside of the expectation.For the population-spike count features f k (x), hereafter collectively called f V (x) ∈ {0, 1} n+1 , all such terms of order ||I|| > 1 are zero due to the sparsity of f V (x).When restricting the current parameter update of λ to only update components corresponding to V , we have and We obtained estimates of the values of P (k|λ old ) = E λ old [f k (x)] from the MCMC sample using the indicator functions f k (x), and optimising w.r.t.V new k , k ∈ {1,. . .,n} using gradient-based methods [64].
In summary, our update-scheme for maximising the log-likelihood proceeds as follows: For a given parameter vector λ old , we first estimate the expectation of the feature functions f i (x), f ij (x) and f k (x) using a running average over an MCMC sampling and Rao-Blackwellization.We then calculate, for each possible single-neuron parameter h i and each possibly pairwise term J ij the gain in penalised log-likelihood that we would get from updating it, using methods as described above and derived in [51].We additionally compute the gain in penalised log-likelihood that would result from optimising all n of the free V parameters jointly, using a convex optimization.Finally, we choose the update that brings the largest gain, and either update a single h i , a single J ij , or all V parameters.Subsequently, we again estimate the new feature functions using MCMC sampling given the current estimate of λ old ← λ new before we update again.We initialised the algorithm assuming independent neurons (i.e.setting each h i using the firing rate of each neuron, and leaving J and V zero).The algorithm then typically first updated all V parameters, before proceeding to jump between different J, h and V updates.

S2 Supplementary Text: Specific heat in simple models
We refer to a maxmimum entropy model as 'flat' if it is fully specified by the population spike count distribution P ( n i=1 x i = k), i.e. the model class studied in [21,29,30].In this model class, all neurons have the same firing rate µ and pairwise correlation ρ.As neuron identities become interchangeable, all n k possible patterns x with n i=1 = k are assigned the same probability P (k) = P (x) n k .In flat models, all relevant population properties can be computed from summing over n + 1 different spike counts, and one never has to (explicitly) sum over the entire 2 n possible spike patterns.

S2.1 A non-critical special case: Independent neurons
A special case of a flat model is an independent model in which all neurons have the same firing rates and zero correlations.Assuming independent spiking for each of the n neurons and a shared probability q ∈ [0, 1] to fire in a time bin, the distribution of population spike counts K = n i=1 x i is given by a binomial distribution, To compute specific heat capacities for the underlying neural population of size n, Entropy: Recalling that P (k) = P (x) n k , the entropy of the flat model for general P (k) can be written as Thus, the entropy of a flat model is Asymptotic entropy: We assume that P (k) has a limiting distribution f (r), where r ∈ [0, 1] is the probability density of a proportion of r neurons spiking simultaneously.Therefore, for large n Here, we used the fact that, for large n, As the first term only grows with log(n), and the second with n, we get that the entropy of a flat model, for large n, is given by

S2.3 Asymptotic specific heat in flat models at unit temperature
Next, we calculate the specific heat, first exactly and then for large n, and finally for weakly correlated models: First, the specific heat is given by Using E[log P (x)] = −H n , we get that For large n, we have that P For large n, this integral is dominated by the term in n 2 , and thus the specific heat is asymptotically given by Therefore, in general, the specific heat grows linearly, and hence diverges (see Fig S3).The only exception to this are models for which η(r) − h n = 0 for almost all r.This happens if f (r) is a delta-distribution, f (r) = δ(r − µ), in which case h n = η(µ) and therefore the integral vanishes.This occurs whenever the pairwise correlations do not grow proportionally with n 2 , as then the variance of the population spike count collapses in the limit.One such special case is the binomial distribution over k, as already demonstrated above using a more direct approach.There is a second special case, namely if f (r) is a combination of two δ-peaks at µ and 1 − µ (See [21] for details)this special case corresponds to a flat Ising model.

S2.4
In flat models, specific heat does not diverge for temperatures which are not equal to 1: Above we showed that at unit temperature, the specific heat for flat models (almost) always diverges.Now, we show that this is NOT true for any other temperature.This explains that, for any f (r), we will find that the unit temperature is 'special'.For large populations, this expression is dominated by the exponential term exp (n(1 − β)η(r)).For β < 1, the exponential term is in turn dominated by the mode of η(r), which is at r = 1 2 .Thus, for β < 1, f β (r) = δ(r − 1 2 ), a delta-peak at r = 1 2 .Conversely, for β > 1, the argument of the exponential has its peaks at r = 0 and r = 1, and therefore f β (r) = 1  2 δ(r − 1) + 1 2 δ(r − 0).In this case, we also have that the integral in the specific heat vanishes, and that the specific heat does not diverge.
S3 Specific heat divergence rate in flat models as function of correlation strength In the next two sections, we will derive analytic expressions to predict the specific heat divergence rate in flat models as a function of the correlation strength within the population.Starting out from eq. ( 20), we will use two different approximations to f (r) that will each yield results that allow us to better understand the behavior of the specific heat at unit temperature c(T = 1) in flat models.
S3.1 Asymptotic entropy and specific heat in weakly correlated flat models: Next, we examine entropy and specific heat in models with weak correlations.If the model is weakly correlated and its mode is not at 0 or 1 we can assume it to be approximately Gaussian with mean µ and variance σ 2 , We first calculate the entropy: We expand η(r) to second order around µ, We further investigate the variance, again neglecting all terms which are of higher order than 2, obtaining In other words, a population of a given size n at fixed firing rate µ that has a high specific heat is simply a population which is very correlated.Inspecting the equations above, we see that the final results do not critically depend on the Gaussian assumptionthe only requirement for the calculation to be accurate is that the distribution is reasonably peaked around its mean.

S3.2 Asymptotic specific heat in the beta-binomial population model
For the beta-binomial model, we assume f (r) to be given by a beta distribution, i.e.
Such f (r) arise for large populations when the population spike count k is described by a beta-binomial distribution, and the choice for the beta distribution as a model for f (r) was motivated by the successful application of beta-binomial models P (k|α, β) to our simulated RGC activity (see Fig. S4).
For beta-distributed r, we have

50 Figure 1 .
Figure 1.A simple phenomenological model of retinal ganglion cell activity a) Model schematic: Neurons have linear stimulus selectivity with centre-surround receptive fields and also receive correlated Gaussian noise.Neural activity is modelled in discrete time.b) Receptive field centres in simulation.c) Example raster plot of the simulated activity of 100 neurons in response to natural stimuli.d) Statistics of population activity in response to natural stimuli.Histogram of firing rates (top left), correlation coefficients (bottom left) and frequency of population spike-counts (right).

Figure 2 .
Figure 2. Signatures of criticality in a simple simulation of RGC activity a) Estimation-error (normalised mean square error) in pairwise covariances as function of sample size, averaged across 10 populations of size n = 100.Use of Rao-Blackwellization reduces the number of samples needed for a given level of accuracy by a factor of approximately 3, making it possible to explore multiple large population models.b) Quality of fit: After convergence, the population models (here n = 100, example population) capture the mean firing rates (top), covariances (centre) and spike count distribution (bottom) of the data.c) Changing the temperature parameter scales both mean firing rates (top), covariances (centre) and population spike-counts (bottom) d) Estimating specific heat via MCMC sampling: MCMC estimates of specific heat from a K-pairwise maximum entropy model fit to an example population (same model as in b-d).Final estimates were taken from the average over first 4h sampling time.Right: Resulting plot of specific heat as function of temperature.e) Divergence of specific heat: Average and individual traces for 10 randomly sampled populations for each of 6 different population sizes, exhibiting divergence of specific heat and peak in heat near unit temperature.Inset: Specific heat at unit temperature and at peak vs. population size.

Figure 3 .
Figure 3. Signatures of criticality in a flat population model a) Population spike-count distribution in RGC simulation, and approximation by different models.Only the beta-binomial population model fits the simulated data accurately.b) Beta-binomial model fits for different population sizes, indicating the goodness-of-fit is robust across population size.c) Estimates for beta-binomial parameters α, β for data from the simulation for different population sizes (mean ± 1 s.t.d.), Best-fitting parameters do not vary systematically with population size.d) Beta-binomial model approximations to published empirically measured population spike-count distributions.e) Specific heat traces for the beta-binomial model, exhibiting signatures of criticality.Average and individual traces for 30 randomly sampled populations for each of 6 different population sizes.Inset: Specific heat at unit temperature and at peak vs. population size.f) Heat traces for independent model and flat pairwise maximum entropy model, which do not exhibit a divergence of the specific heat.

Figure 4 .
Figure 4. Relationship between correlations and criticality.a) Specific heat traces for beta-binomial model of different correlation strengths and population sizes.Heat traces are qualitatively similar, but differ markedly quantitatively (see y-axes).b) Specific heat diverges linearly, and the slope depends on the strength of correlations (left).Divergence rate of specific heat for beta-binomial model as a function of correlation strength (centre).Rightmost point (at infinity) corresponds to analytical prediction of large-n behaviour.Divergence rates are strictly increasing with correlation strength (right) which is captured by a weak-correlation approximation (dashed line).c) Specific heat increases with correlation in the K-pairwise maximum entropy model: average and individual traces for 10 randomly subsampled populations for 6 different population sizes.Left to right: checkerboard, natural images and full-field flicker stimuli presented to the population.Correlation strengths denote mean correlation coefficient in each population.

Figure 5 .
Figure 5. Random subsampling leads to criticality-inducing correlations.a) Illustration: A population with 100 neurons and infinite-range correlations, the average correlation between any pair of neurons is close to 0.05.Correlation as function of inter-neuron distance (left) and full correlation matrix (right).b) Average correlation in subpopulation of different size n (left) and specific heat as function of n (right), when neurons are sampled from 1 to 100.Random sampling gives identical results (not shown).c) Population with limited-range correlations, same plots as in panel a. d) Left: Average correlation as function of population size for ordered sampling (green) and uniform subsampling (gray).Right: Specific heat grows linearly for random subsampling, but shows signs of saturation for ordered sampling.e) Average correlation as function of inter-neuron distance in RGC simulation.For checkerboard and natural images, correlations drop to 0 for large distances.f) Specific heat for different stimulation conditions, for ordered (colour) or random subsampling (gray).

Figure S2 .
Figure S2.Quality of fits for K-pairwise maximum entropy model across multiple populations and stimulus conditions a) Normalised MSEs for firing rates, covariances and P (K) during parameter learning.Error values collapses across 10 subpopulations at n = 100, fit to simulated activity in response to natural images, one point for each displayed iteration and each subpopulation.Lines are moving averages (smoothing kernel width = 150 param.updates).b) Quality of fit after parameter learning.Data vs. model estimates for firing rates, covariances and P (K), collapsed over all 10 subpoplations with size n = 100.c) Quality of fit for different stimulus types.Normalised MSEs after maximum entropy model fitting shown for 10 subpopulations for natural images (nat) and 5 subpopulations each for checkerboard (cb) and full-field flicker (fff).All subpopulations of size n = 100.Vertical bars give averages.Colours as in a), b).

E 1 0fFigure
Figure S4.(No) Influence of beta-binomial approximation on heat capacitySpecific heat capacities computed from population spike count distributions P (K).Spike count distributions for population sizes n = 20, . . ., 300 were obtained from 50 uniformly drawn subpopulations each.Simulated retinal activity was taken from of the retina simulation with in total N = 316 RGCs that responded to natural image stimulation.Resulting specific heat traces computed from beta-binomial approximations to the spike count distributions (left) and from raw P (K) (right) do not display visible differences.