On the Number of Neurons and Time Scale of Integration Underlying the Formation of Percepts in the Brain

All of our perceptual experiences arise from the activity of neural populations. Here we study the formation of such percepts under the assumption that they emerge from a linear readout, i.e., a weighted sum of the neurons’ firing rates. We show that this assumption constrains the trial-to-trial covariance structure of neural activities and animal behavior. The predicted covariance structure depends on the readout parameters, and in particular on the temporal integration window w and typical number of neurons K used in the formation of the percept. Using these predictions, we show how to infer the readout parameters from joint measurements of a subject’s behavior and neural activities. We consider three such scenarios: (1) recordings from the complete neural population, (2) recordings of neuronal sub-ensembles whose size exceeds K, and (3) recordings of neuronal sub-ensembles that are smaller than K. Using theoretical arguments and artificially generated data, we show that the first two scenarios allow us to recover the typical spatial and temporal scales of the readout. In the third scenario, we show that the readout parameters can only be recovered by making additional assumptions about the structure of the full population activity. Our work provides the first thorough interpretation of (feed-forward) percept formation from a population of sensory neurons. We discuss applications to experimental recordings in classic sensory decision-making tasks, which will hopefully provide new insights into the nature of perceptual integration.


Predictions for choice probabilities
Most often, choice signals are computed in the form of choice probabilities. This first requires to compute some temporal averager i of the underlying spike trains. Then, CP for every neuron i measures the area under the ROC curve between the two distributions ofr i , respectively conditioned on c = 0 and c = 1 [5]. With the same assumptions as in our main text (optimal behavioral model, Gaussian statistics), Haefner et al. [6] have shown that the CP signal for each neuron i is approximately proportional to the (Pearson) correlation betweenr i and s, that is: The second equality makes the link with the notations of our article. For example, let us consider a typical neuron from our simulations, and its spike countr i over w = 400 msec of stimulation (as in main text, Figure   4a). We assume a CC valued i = 0.1 Hz, and a Poisson-like firing rate at 20 Hz leading to σ(r i ) = 20/0. 4 7 Hz. The JND for our "animal" is Z 3 Hz, and with our chosen stimuli we have κ(Z ) 0.067 (eq. 46 from the main text). In these conditions, we recover a typical value CP i 0.53.
In our notations, CP i is predicted to be proportional to the choice covarianced i , computed with the same temporal averaging asr i . However, CP i is also inversely proportional to the standard deviation of the spike counts σ(r i ), which depends on the time window w used to compute them. For example, if we used instead an integration of w = 40 msec, we would find σ(r i ) = 20/0.04 22.3 Hz, and consequently CP i 0.51. As a result there is no single "reference" experimental value for CPs. Conversely, choice covariance signals d i (t) used in our article do not suffer from that complication.

Sensitivity as a function of w
In the forms of the main text, it is not explicit how the value of w influences the variance of s, and thus the readout's sensitivity. Here we show that under mild assumptions, the doubly-integrated covariance matrixC (eq. 40 from the main text) scales as w −1 .
To allow a direct comparison between different possible choices for the shape function h(x), we impose that it should be positive (h(x) ≥ 0), normalized ( x h(x)dx = 1), and have a unit time constant ( x h(x) 2 dx = 1).
Common choices could include a square kernel, a decreasing exponential, a Gaussian kernel, etc. 1 For concision, we note the integration kernel at scale w: h w (u) := w −1 h(u/w). Then, let us note G w (u, v) . It verifies the following property: t G w (t, t + τ )dt = (h w h w )(τ ), the autocorrelation of kernel h w . As a result, we can rewrite the variance equation (Methods, eq. 36) in the form: In the second line, the function of t defined by the fraction is positive and has an integral of 1, so it operates as a temporal averaging on C ij (t, t + τ ). The resulting average over t, noted C ij (τ ) in the third line, is thus a form of cross-correlogram between neurons i and j, measuring the average covariance between the spikes from i and j separated by a time lag τ .
Since the shape h has a unit time constant, its autocorrelation function typically has support on [−1, 1], and verifies (h h)(0) = 1. On the other hand, C ij (τ ) typically has support on some interval [−τ C , τ C ], where τ C is the typical time scale of noise correlations in the population. As a result, as soon as w gets bigger than τ C , the integral in eq. 2 approaches a constant value, and the variance of s scales as w −1 . The variance of s also scales as w −1 when w → 0, because all auto-correlograms C ii (τ ) display a Dirac peak in τ = 0, due to the spiking nature of the neurons [3].
This whole analysis holds similarly in the more general case of a non-deterministic t R , only with a slightly different definition of kernel G w (u, v) (supplementary section 6).

Encoding neural network
We detail here the architecture of the artificial encoding network used to test our method. This ad-hoc network was designed to display some classic features of sensory cortical neurons involved in perceptual decision-making tasks (e.g, V2, MT, S1, S2, etc.). To reproduce the diversity of response naturally observed at the population level [7], neurons in our network have broadly distributed firing rates and pairwise noise correlations (main text, The second layer (L2) consists of 5000 leaky integrate-and-fire (LIF) neurons, some of which receive input from L1, and who are all coupled through a sparse, balanced connectivity. The neurons are modeled according to the following differential equations for their voltages, The neuron emits a spike at each time t i when V are sparse with (Erdös-Renyi) connection probability p = 0.01. As these asymmetries create biases in the total synaptic inputs to each type of cell, the intrinsic currents I (p) , I (n) and I (u) also vary depending on neuron subtype, to ensure homogeneous firing properties inside the three populations (see Table 1).
Then, all L2 neurons are connected through a single matrix W (2) of recurrent connections with (Erdös-Renyi) connection probability p = 0.03-independently of their subtype. Non-zero connection strengths are picked uniformly in an interval [w min , w max ]: see Table 1. Note that L2 recurrent connections can be both excitatory and inhibitory, a departure from biology allowing for an easier implementation. Finally, the recurrent connections in L2 are associated to synaptic delays: for each pair (i, k) of connected L2 neurons, the random delay ∆ ki is drawn uniformly between 0 and 5 msec. This substantially increases the diversity of neural responses in the population, particularly at the level of JPSTHs ( Figure 4d from the main text). This is interesting because our method is specifically designed to analyze generic, heterogeneous population activities.
Subtype We implemented and simulated the network using Brian, a spiking neural network simulator in Python [4].
Our simulation consisted of many successive epochs of 300 msec with all possible successions of the three stimulus values s (as in Figure 1a from the main text). Since the input Poisson neurons were always firing close to 30 Hz, there was no strong transient at stimulus onset as is often observed in real sensory neurons. In our case, the change of activity between two successive stimuli was always only differential, and rather weak (see Figure   4b from the main text). observe a larger proportion of the cells with significant correlations (around 70% of tested pairs, at time lag τ = 0 msec). 2 We deemed "significant" any value |CCG ij (τ )| exceeding 3σ ij (τ ), where σ 2 ij (τ ) is the expected variance for the measure CCG ij (τ ) under the hypothesis that the two neurons are independent. If CCG ij (τ ) is estimated from T (discrete, contiguous) time bins over N trial repetitions, and we assume stationary firing for the neurons, then one can show that As a result, σ 2 ij (τ ) can be estimated experimentally from the neurons' auto-correlograms CCG ii (∆) and CCG jj (∆).

Bayesian regularization of the optimal readout
The regularization procedure aims to counteract overfitting effects that may appear in Fisher's linear discriminant (main text, eq. 12) when the computation is based on too little data points. Our procedure is directly inspired by Bayesian linear regression, as exposed in Bishop 2006 [2], chapter 3. We refer the reader to this book for additional details.
We start with our notations-mostly similar to those introduced in the main text. We directly consider the time-integrated versions of the spike trains, with some fixed parameters w and t R , and focus on some candidate readout ensemble E of size K. Then, our experimental data consist of the recorded spike countsr iq , where i = 1 . . . K runs over neurons in E, and q = 1 . . . T represents the set of recorded trials. We will noter q for the vector of neural activities on trial q, and s q for the stimulus value used on trial q.
We will generally note E[X] = T −1 T q=1 X q for the (empirical) average of quantity X q over trials. Without loss of generality, we make the offsetr q ←r q − E[r] and s q ← s q − E[s]. This will simplify the following formulas by removing cumbersome constant terms. As in the main text, we introduce the following (empirical) statistics: Matrix A is the total covariance matrix,b is the tuning vector, and σ 2 s is the variance of the tested stimuli. Note the simplified definitions as compared to the main text, because we now have E[r] = 0 and E[s] = 0.
The goal of Fisher's linear discriminant is to find a vector a maximizing the signal-to-noise ratio (SNR) of variable a r, with respect to the stimulus s. Under our assumption that the system depends linearly on s, the signal term writes σ 2 s (a b ) 2 (main text, eq. 51). Mathematically, the simplest option is to fix this signal term once and for all. Then, the SNR will simply be maximized by minimizing the total covariance, a Aa (main text, eq. 53). The logical choice for the signal term is to assume a b = 1 (unbiased percept). This imposes that vector a lie in a certain hyperplane, of dimension K − 1. We hard-code this constraint by the following reparametrization: where M is a (K × K − 1) matrix whose columns constitute an orthogonal basis of {b} ⊥ , the orthogonal space of vectorb. Vector c ∈ R K−1 constitutes our reparametrization of vector a.
At this point, we can make the link with classic linear regression. Indeed, the mean square error of a r as an estimator of stimulus s writes: Having imposed a b = 1, we are left with F data (a) = T (a Aa − σ 2 s ). In turn, a Aa is the total covariance. So ultimately, maximizing the SNR is equivalent to minimizing F data (a) under the constraint that a b = 1.
In Bayesian linear regression, one does not simply minimize F data (a), but uses it to derive a full distribution a posteriori for vector a given the data [2]. One first considers a (Gaussian) generative model for the data given a, depending on a hyperparameter named β: On the other hand, one requires that the inferred vector a be regular : its entries a i should not take too big values, because this is typically a mark of overfitting, see [2]. We thus impose a regularization prior on a, depending on a hyperparameter named α: This gives rise to the overall generative model for the data: Note the reparametrization of vector a with vector c (eq. 3), to account for the fact that a b = 1. Just as in classic Bayesian regression, the log-probability in eq. 4 is a quadratic function of vector c.
The regularization procedure per se consists of finding the hyperparameters (α, β) which maximize the overall likelihood of the data, when marginalizing over the readout vector c. In other words, we seek (α, β) that maximize the so-called evidence function: This maximization is often referred to as empirical Bayes. It can be performed based on an EM algorithm, where c is the latent variable. We refer to [2] for details, and only provide here the algorithm. We note c ∼ N (m T , Σ T ) for the posterior distribution of vector c. In the E step, we deduce the values of m T and Σ T from eq. 4, assuming a fixed value of α and β: (Vector a T is directly obtained from m T with eq. 3.) In the M step, we find the hyperparameters α and β which maximize the likelihood in eq. 4, marginalized over c assuming the distribution c ∼ N (m T , Σ T ): By iterating these two steps, the EM algorithm is guaranteed to converge to a (local) maximum for the evidence function in eq. 5.
At convergence, we obtain a solution to our original problem of regularization. The hyperparameters α and β implement the optimal trade-off between minimizing a Aa (maximizing the SNR), and minimizing a 2 (regularization). This optimal trade-off will mostly depend on T , the number of trials entering the computation of A andb. The final vector a T is the maximum a posteriori estimate for vector a. A simple computation shows that M (βT A + αId)a T = 0, meaning that (βT A + αId)a T is proportional tob. Hence, as stated in the main text, one has with the strength of regularization given by λ = α/(βT ). (We have not investigated whether the optimal λ found by our procedure coincides with the one that would be found in classic Bayesian regularization, without imposing a b = 1. In general, they are probably of comparable magnitude.) We define the corrected JND, Z, as the expected value for the noise variance a C a under the posterior distribution for a (plus the potential decision noise, σ 2 d ). One can show that it verifies : (see main text, eq. 53). Similarly, we define CC curves (main text, eq. 9) as the expected value forC(t)a under the posterior distribution for a. By linearity, this is simplyC(t)a T .
Naturally, Bayesian linear regression is just a model-based inference from the data, so it is not guaranteed to always counteract overfitting in the optimal fashion. In our empirical tests, the JNDs recovered after regularization were still somewhat smaller than their true values (as can be seen in the main text, Figure 5d-e), meaning that there was some residual overfitting. Nonetheless, the situation was much improved compared to when the correction was not applied, and we would strongly recommend to apply this procedure in data-scarce situations. As a drawback, the simple linear inversion required in the original formula for Fisher's LD (main text, eq. 12) is replaced by multiple iterations of the EM algorithm-which must be performed anew for every tested set of readout parameters. So the regularization procedure is quite time-costly.

Unbiased estimation of CC indicators q and V
When it comes to computing CC indicators q(u, t) and V , the finite amount of data can bias the results in two ways. First, due to the finite number of trials, each neuron's individual statistic (say, d i ) will differ from the "perfect" value that would be obtained from an infinite number of trials. Second, due to the finite number of neurons, each population-wide indicator (say,q =<b id i > i ) will differ from the "perfect" value that would be obtained if all neurons in the population had been recorded.
Here, we present a generic description of this problem, which applies both to the "true" and "predicted" CC indicators from the main text, and we detail the specific corrections to ensure unbiasedness.

Nature of the problem, notations
We are given a neural population of interest, Ω, of cardinal M . Each neuron i in the population has two attributes b i and d i , and we wish to estimate the following population-wide indicators : The names are directly inspired from the main text, without the complications due to temporal dependencies, temporal averaging, stars, readout parameters, etc. For clarity, we also give explicit names to the two intermediate quantities B and D, involved in the computation of V .
Unfortunately, all the data are not available to perform these computations exactly, and we need to introduce some more notations. First, we will use the generic notation x i i∈R to denote the population average of quantity x i over neurons i in some fixed neural ensemble R, possibly smaller than Ω. That is: In our case, R is a generic notation for any subset of the neurons that were actually recorded ; we note R for its cardinal. We assume that the choice of R amongst Ω is totally random and independent from everything else.
Second, due to the finite number of recording trials, we do not access the "perfect" values b i and d i , but experimental versions b i and d i which are corrupted by measurement noise. In our case, we may assume that b i and d i are unbiased estimators of their true values, meaning that Here, the expectation refers to the random nature of the recording trials which lead to the measures of b i and d i .
In turn, the variances Var[ b i ] and Var[ d i ] give the strength of the respective measurement errors. For simplicity, we assume that the measurement error on each b i or d i is independent from all other errors. There are several ways of estimating these errors in practice (e.g., based on a model). In this work, we estimated each quantity Var[ x i ] from a bootstrap principle, as detailed below.

Unbiased estimators
Unbiased estimator for q. Given the above notations, our naive estimator for q in eq. 6 is: This estimator is directly suitable for our needs, because it is unbiased. Indeed: Conversely, the squaring operations involved in eq. 7-9 systematically transform measurement errors into positive biases. For example, q 2 is not an unbiased estimator of q 2 , so we need to introduce specific corrections. The following lemma allows to derive all the corrections required below. • we access the values of y i through estimators y i for i ∈ R, a sub-ensemble of S of cardinal R ≤ S, and assume the following properties for the various estimators: Then, the following quantity : provides an unbiased estimator of x i i∈Ω y i i∈Ω .
The proof is given at the end of this section. Rephrased, the sum of the "X" and "W" terms provides an unbiased estimator of the quantity Cov x i i∈S , y i i∈R . On the one hand, the "X term" assesses the covariations of x i and y i from one neuron to the other. On the other hand, the "W term" will appear when Unbiased estimator for q 2 . In this case, we apply the lemma with R = S, x i = y i = b i d i , and their unbiased estimator x i = y i = b i d i , yielding the following unbiased estimator for q 2 : Unbiased estimator for BD. First, it is simple to obtain unbiased estimators for the intermediate quantities B and D in eq. 7-8. For example: where the second term (average measurement error in the population) corresponds to the bias. A similar relationship holds for d i . As a result: (Note that the available population for the average can be different for B and D.) In turn, B D is not an unbiased estimator of BD. To derive the corrections, we apply the lemma with i measured over ensemble S, and y i = d 2 i measured over ensemble R (assumed to be included in S). Their respective unbiased estimators are (and so W i = 0). Furthermore, concerning the term x i y i which appears in the lemma, we note that for every i ∈ R : so the leftmost term can be replaced by the rightmost term without modifying its expected value. We deduce the following unbiased estimator for BD:

Application : "true" and "predicted" indicators
Estimating measurement errors. We estimate all measurement errors thanks to a bootstrap principle. We generate T surrogate versions of our data, by resampling with replacement from the original trials. Then, each neuron's individual statistic (say, d i ) gives rise to T surrogate measures d (rs) i , and we estimate the measurement error on d i as: For the large-scale simulation presented in the article, we use T = 14 resamplings. Furthermore, we use the same 14 resamplings to derive error bars on our final estimates (see main text). This departure from the statistical canon was imposed by the length of the whole inference procedure. As a result, when correcting for measurement errors on a resampled set of data, we remove twice the estimated measurement error from eq. 15.
For example, for D in eq. 13, our "unbiased estimator" on resamplings is This allows for D (rs) to have the same expectation as the original D. Without this ad hoc procedure, the final CC indicators computed from the resamplings (q (rs) and V (rs) ) would be systematically biased compared to the original ones (q and V ).  Predicted indicators. For the "predicted" versions, we face two additional difficulties. First, for each candidate readout ensemble E, the corresponding CC indicators are obtained as the compound of two predictions: • On the one hand, the prediction for CC signals inside the readout ensemble E : • On the other hand, the prediction for CC signals outside the readout ensemble E : Both predictions are then mixed with a weighting p := K/N tot (see main text), so that: Second, the final CC indicators are obtained by averaging q and V over several candidate ensembles E (see main text). To ease the implementation of this final averaging, we always use the same estimator B computed from all recorded neurons-independently of the candidate ensemble E : For CC signals inside E, we apply eq. 10-11 with Ω = R = E, of cardinal M = R = K. Then we apply eq. 14 with S = N and R = E. We thus obtain the following unbiased estimators: For CC signals outside E, the population of interest Ω has cardinal M = N tot − K. We apply eq. 10-11 with R = I, the set of complimentary neurons, of cardinal R = I (see main text). Then we apply eq. 14 with S = N and R = I. We thus obtain the following unbiased estimators: Finally, our unbiased compound estimators are: These "predicted" estimators are generally not reliable, owing to the small size of complimentary ensemble I.
However, the final prediction is obtained as an average over a large number of pairs (E, I)-see main text.
Thanks to the linear structure of q and V , this averaging can be performed "online", without storing the results for each tested pair (E, I). Since each component is unbiased, the final average will also be unbiased, and reliable.

Proof of the lemma
The expectation of x i i∈S y i i∈R writes as follows.
In the third line, we note that R ⊂ S, so that E 1 S i 1 R The last two terms in the above sum can be replaced by their unbiased estimators. Namely : W i i∈R is an unbiased estimator of W i i∈Ω , x i y i − W i i∈R is an unbiased estimator of x i y i i∈Ω .
Then, multiplying the above relationship by S(M −1) (S−1)M and collecting terms again, we find that The final result of the lemma is a rearranged form of this equation, making more explicit the corrective nature of term "X".

Extended model with non-deterministic extraction time t R
We detail here the analytical modifications when the model's extraction time t R is non-deterministic. We thus assume that t R is itself a random variable, drawn on each trial according to some density function g(t), independently of neural activities r(t). (Note that this is a restrictive hypothesis. In particular, it rules out a direct application of the model to an integration-to-bounds framework.) The full readout model is then given by: For concision, we have noted the integration kernel at scale w: h w (u) := w −1 h(u/w). This model naturally encompasses the simpler version presented in the main text. To obtain a deterministic readout time t R , we simply need to set g(t) = δ(t − t R ) where δ(·) corresponds to the Dirac delta-function.
Derivation. For any deterministic function of time Then, for any random process X(t) constructed from the spike trains (and hence, independent of t R ), we have: Var(X(t R )) = E R Var(X) + Var R E(X) , this last line stemming from the law of total variance. These two formulas allow to compute the moments of s given s, in a fashion similar to the main text (Methods, eq. 35-37). After these computations, we find that the general form of the characteristic equations still holds: Cov[r(t), s|s] =C(t)a , but with more general definitions forb,C(t) and Γ: having defined g h w (u) = t g(t)h w (t − u)dt and G w (u, v) := t g(t)h w (t − u)h w (t − v)dt.
Interpretation. Through eq. 17, g(t) acts a weighting factor over the CC curves that would be obtained for each t R :C(t|g) = u g(u)C(t|t R = u)du. This leads to the spreading of CC curves sketched in main text, Fig   8a. Furthermore, matrix V temp ij is an additional source of variance that appears only when g(t) has an extended temporal support, i.e., when t R is non-deterministic: Extension of our method. Could the statistical approach introduced in the main text be extended, to recover a non-deterministic extraction function g(t)? One possible concern is that the temporal evolution of CC signals is only determined by the aggregate function (g h w )(t) (eq. 17), which cannot be used to disentangle g(t) and w separately.
However, note that in the sensitivity equation (eq. 18), one now hasC = Γ in general-as opposed to the case with deterministic t R . Instead, the respective effects of g(t) and w onC(t) and Γ (eq. 17-18) can roughly be thought of as a scaling:C ρ(w, g)Γ, because the overall "shape" of covariance between neurons (as opposed to its "strength") does not depend much on the precise temporal integration used to compute their activity. For example, eq. 19 is exactly verified if (1) neural activities are stationary and (2) the profile of temporal correlation is homogeneous across neurons: C ij (t, u) =C ij F (|t − u|). Precisely, in this case, ρ(w, g) = ξ F (ξ) h w (ξ) 2 g(ξ) 2 dξ ξ F (ξ) h w (ξ) 2 dξ expressed in terms of Fourier transforms.
As a result, the CC indicator q(u, t) (main text, eq. 14) is predicted to scale as ρ(w, g). So, while matching the temporal support of q E (u, t) and q (u, t) constrains the value of (g h w )(t), matching their overall amplitude constrains ρ(w, g). Thus, we may hope that minimizing the loss term u,t ( q E (u, t) − q (u, t)) 2 dudt will allow to disentangle the values of g(t) and w separately. In practice though, this would require the fitting of at least one additional temporal parameter ; typically, the standard deviation of t R from trial to trial.