Optimal learning with excitatory and inhibitory synapses

Characterizing the relation between weight structure and input/output statistics is fundamental for understanding the computational capabilities of neural circuits. In this work, I study the problem of storing associations between analog signals in the presence of correlations, using methods from statistical mechanics. I characterize the typical learning performance in terms of the power spectrum of random input and output processes. I show that optimal synaptic weight configurations reach a capacity of 0.5 for any fraction of excitatory to inhibitory weights and have a peculiar synaptic distribution with a finite fraction of silent synapses. I further provide a link between typical learning performance and principal components analysis in single cases. These results may shed light on the synaptic profile of brain circuits, such as cerebellar structures, that are thought to engage in processing time-dependent signals and performing on-line prediction.


Introduction
At the most basic level, neuronal circuits are characterized by the subdivision into excitatory and inhibitory populations, a principle called Dale's law. Even  role of Dale's law has not yet been understood, the importance of synaptic sign constraints is pivotal in constructing biologically plausible models of synaptic plasticity in the brain [1][2][3][4][5].
The properties of synaptic couplings strongly impact the dynamics and response of neural circuits, thus playing a crucial role in shaping their computational capabilities. It has been argued that the statistics of synaptic weights in neural circuits could reflect a principle of optimality for information storage, both at the level of single-neuron weight distributions [6,7] and intercell synaptic correlations [8] (e.g. the overabundance of reciprocal connections). A number of theoretical studies, stemming from the pioneering Gardner approach [9], have investigated the computational capabilities of stylized classification and memorization tasks in both binary [10][11][12][13] and analog perceptrons [14,15], using synthetic data. With some exceptions mentioned in the following, these studies considered random uncorrelated inputs and outputs, a usual approach in statistical learning theory. One interesting theoretical prediction is that nonnegativity constraints imply that a finite fraction of synaptic weights are set to zero at critical capacity [6,15,16], a feature which is consistent with experimental synaptic weight distributions observed in some brain areas, e.g. input fibers to Purkinje cells in the cerebellum. The need to understand how the interaction between excitatory and inhibitory synapses mediates plasticity and dynamic homeostasis [17,18] calls for the study of heterogeneous multi-population feed-forward and recurrent models. A plethora of mechanisms for excitatory-inhibitory (E-I) balance of input currents onto a neuron have been proposed [19,20]. At the computational level, it has recently been shown that a peculiar scaling of excitation and inhibition with network size, originally introduced to account for the high variability of neural firing activity [21][22][23][24][25][26][27], carries the computational advantage of noise robustness and stability of memory states in associative memory networks [13].
Analyzing training and generalization performance in feed-forward and recurrent networks as a function of statistical and geometrical structure of a task remains an open problem both in computational neuroscience and statistical learning theory [28][29][30][31][32]. This calls for statistical models of the low-dimensional structure of data that are at the same time expressive and amenable to mathematical analyses. A few classical studies investigated the effect of "semantic" (among input patterns) and spatial (among neural units) correlations in random classification and memory retrieval [33][34][35]. The latter are important in the construction of associative memory networks for place cell formation in the hippocampal complex [36].
For reason of mathematical tractability, the vast majority of analytical studies in binary and analog perceptron models focused on the case where both inputs and outputs are independent and identically distributed. In this work, I relax this assumption and study optimal learning of input/output associations with real-world statistics with a linear perceptron having heterogeneous synaptic weights. I introduce a mean-field theory of an analog perceptron in the presence of weight regularization with sign-constraints, considering two different statistical models for input and output correlations. I derive its critical capacity in a random association task and study the statistical properties of the optimal synaptic weight vector across a diverse range of parameters.
This work is organized as follows. In the first section, I introduce the framework and provide the general definitions for the problem. I first consider a model of temporal (or, equivalently, "semantic") correlations across inputs and output patterns, assuming statistical independence across neurons. I show that optimal solutions are insensitive to the fraction of E and I weights, as long as the external bias is learned. I derive the weight distribution and show that it is characterized by a finite fraction of zero weights also in the general case of E-I constraints and correlated signals. The assumption of independence is subsequently relaxed in order to provide a theory that depends on the spectrum of the sample covariance matrix and the dimensionality of the output signal along the principal components of the input. The implications of these results are discussed in the final section.

Mean-field theory with correlations
Consider the problem of linearly mapping a set of correlated inputs x iμ , with i 2 1, . . ., N and μ = 1, . . ., P from N E = f E N excitatory (E) and N I = (1 − f E ) inhibitory (I) neurons, onto an output y μ using a synaptic vector w, in the presence of a learnable constant bias current b (Fig 1). To account for different statistical properties of E and I input rates, we write the elements of the input matrix as and the same for σ i . At this stage, the quantities ξ iμ have unit variance and are uncorrelated across neurons: hξ iμ ξ iν i = δ ij C μν . In the following, we refer to x and y as signals and μ as a time index, although we consider general "semantic"correlations across the patterns x μ [34]. The output signal has average hy m i ¼ � y and variance hðy m À � yÞ 2 i ¼ s 2 y . We initially consider output signals y μ with the same temporal correlations as the input, namely hδy μ δy ν i = C μν , where For a given input-output set, we are faced with the problem of minimizing the following regression loss (energy) function: The rationale for using a regularization term lies not only in alleviating ill-conditioning due to input correlations, but also in controlling the metabolic cost of synaptic plasticity and transmission. Preliminary numerical experiments showed that the typical vector w that solves this sign-constrained least square problem has a squared norm P N i¼1 w 2 i ¼ Oð1Þ, irrespectively of the L2 regularization, as in the special case of i.i.d input/output and non-negative synaptic weights [15]. Synaptic weights w i are thus of Oð1= ffi ffi ffi ffi N p Þ, hence the scaling of the regularization term Nγ and the bias current b ¼ I ffi ffi ffi ffi N p . In order to consider a well defined N ! 1 limit for E and the spectrum of the matrix C, we take P = αN, with α called the load, as is costumary in mean-field analysis of perceptron problems [9].
Optimizing with respect to the bias b naturally yields solutions w for which the average excitatory and inhibitory weight, with c 2 {E, I}. We call this property balance, in that the same scaling is used in balanced state theory of neural circuits [21,22,24]. In order to derive a mean-field description for the typical properties of the learned synaptic vector w, we employ a statistical mechanics framework in which the minimizer of E is evaluated after averaging across all possible realizations of the input matrix X and output y. To do so, we compute the free energy density where Z = R dμ (w)e −βE is the so-called partition function and the measure dmðwÞ yðÀ w k Þdw k implements the sign-constraints over the synapic weight vector w. The brackets in Eq (3) stand for the quenched average over all the quantities x iμ and y μ , and the inverse temperature β will allow us to select weight configurations w that minimize the energy E. The free energy density f acts as a generating function from which all the statistical quantities of interest can be calculated by appropriate differentiation and taking the β ! 1 limit. In particular, we will be interested in the (normalized) average loss � ¼ hEi N and the error � err ¼ 1 2N hjX T w þ b À yj 2 i, corresponding to the average value of the first term in Eq (1), where b is a P-dimensional vector containing b in every element. The average in Eq (3) can be computed in the N ! 1 limit with the help of the replica method, an analytical continuation technique that entails the introduction of a number n of formal replicas of the vector w. A general expression for f can be obtained in the large N limit using the saddle-point method. The crucial quantity in our derivation is the (replicated) cumulant generating function Z ξ,δy for the (meanremoved) input x and output y, which can be easily expressed as a function of the eigenvalues λ μ , μ = 1, . . ., αN of the covariance matrix C, plus a set of order parameters to be evaluated selfconsistently (Methods).

Critical capacity
The existence of weight vectors w's with a certain value of the regression loss E in the error regime (� > 0) is described by the so-called overlap order parameter Dq w . In the replica-based derivation of the mean-field theory, overlap parameters are introduced with the purpose of decoupling the w i 's over the i index, and represent the scalar-product of two different configurations of the weights w (Methods: Replica formalism: ensemble covariance matrix (EC)). For finite β, the quantity Dq w ¼ bDq w represents the variance of the synaptic weights across different solutions. In the asymptotic limit β ! 1 of Eq (3), a simple saddle-point equation for Dq w can be derived when b is chosen to minimize Eq (1): where ρ(λ) is the distribution of eigenvalues of C.
In the absence of weight regularization (γ = 0), we define the critical capacity α c as the maximal load α = P/N for which the patterns x μ can be correctly mapped to their outputs y μ with zero error. When the synaptic weights are not sign-constrained, the critical capacity is obviously α c = 1, since the matrix X is typically full rank. In the sign-constrained case, α c is found to be the minimal value of α such that Eq (4) is satisfied for 0 < Dq w < þ1. Noting that the left-hand side in Eq (4) is a non-decreasing function of Dq w with an asymptote in α, the order parameter Dq w goes to + 1 as the critical capacity is approached from the right. We thus find for γ = 0 the surpisingly simple result: As shown in Fig 2A in the case of i.i.d. x and y, the loss has a sharp increase at α = 0.5. This holds irrespectively of the structure of the covariance matrix C and the ratio of excitatory weights f E . In Fig 2A, we also show the average minimal loss � for increasing values of the regularization parameter γ.
In [15], the authors showed that, in the case with excitatory synapses only and uncorrelated inputs and outputs, α c approaches 0.5 in the limit when the quantity goes to zero, and analyzed which conditions on inputs and outputs statistics lead to maximize capacity. Here we take a complementary approach, where the x and y statistics are fixed and capacity is optimized within the error regime, so that the optimal bias I ffi ffi ffi ffi N p is well defined in terms of minimizing hEi at any load α. The bias optimization leads to a massive simplification of the saddle-point equations and makes results independent of the E/I ratio and the input/output statistics   (Methods: EC, Saddle-point equations). One may observe that, in the particular case studied by [15], α c is maximal for very large I, due to the divergence of the norm of w at critical capacity for an optimal bias in the absence of regularization.
The independence of our results with respect to the E/I ratio for an optimal bias current signals a local gauge invariance, as observed by [37,38] for a sign-constrained binary perceptron. Indeed, calling g i = sign w i , we can write the mean-removed output as P N i¼1 g i jw i js i x m i and redefine the ξ's as g i x m i , without changing their occurrence probability. This establishes an equivalence to a linear perceptron with non-negative weights (see [37] for more details), once the mean contribution has been removed. Any residual dependence of α c or � on external parameters must therefore be ascribed to the volume of weights satisfying Eq (2), for a suboptimal external current b.
For a generic value of the bias current b, there are strong deviations from the condition in Eq (2). In Fig 2B, we compare the value of the average output � y withh � The quantity c measures weight-rate correlations that are responsible for the cancelation of the Oð ffi ffi ffi ffi N p Þ bias. The deviation from Eq (2), shown here for a rapidly decaying covariance of the form 2t 2 , has been previously described in the context of a target-based learning algorithm used to build E-I-separated rate and spiking models of neural circuits capable of solving input/ output tasks [3]. In this approach, a randomly initialized recurrent network n T is driven by a low dimensional signal z. Its currents are then used as targets to train the synaptic couplings of a second (rate or spiking) network n S , in such a way that the desired output z can later be linearly decoded from the self-sustained activity of n S . Each neuron of n S has to independently learn an input/output mapping from firing rates x to currents y, using an on-line sign-constrained least square method. In the presence of an L2 regularization and a constant b / ffi ffi ffi ffi N p external current, the on-line learning method typically converges onto a solution for the recurrent synaptic weights for which Eq (2) does not hold. As also shown in [3], in the peculiar case of a self-sustained periodic dynamics (in which case off-diagonal terms of the covariance matrix C μν do not vanish for large μ or ν) the two contributionsh and c scale approximately like ffi ffi ffi ffi N p and cancel each other to produce an Oð1Þ total average output � y ¼h þ c. In the effort to build heterogeneous functional network models, the emergence of synaptic connectivity compatible with the balanced scaling thus depends on the statistics of incoming currents. Adhoc regularization can be avoided by adjusting external currents onto each neuron.

Power spectrum and synaptic distribution
The theory developed thus far applies to a generic covariance matrix C. To connect the spectral properties of C with the signal dynamics, we further assume the x iμ to be N independent stationary discrete-time processes. In this case, C μν = C(μ − ν) is a matrix of Toeplitz type [39], leading to the following expression for the average minimal loss density in the N ! 1 limit: with Dq w given by Eq (4). The function λ(ϕ) can be computed exactly in some cases (Methods: Power spectrum and synaptic distribution) and corresponds to the average power spectrum of the x and y stochastic processes. Fig 3 shows two representative input signals with Gaussian and exponential covariance matrix C ( Fig 3A) and a comparison between the average power spectrum of the input and the analytical results for the eigenvalue spectrum of the matrix C ( Fig 3B). From now on, we use the terms Gaussian or rfb (radial basis function) indistinguishably to denote the un-normalized Gaussian function C mn ¼ e À ðmÀ nÞ 2 2t 2 . As shown in Fig 4A in the case of input x and output y with rbf covariance, the squared norm of the optimal synaptic vector w (red curve) is in general a non-monotonic function of α, its maximum being attained at bigger values of α as the time constant τ increases. We also show the minimal loss density � and the mean error � err for γ = 0.1. The curves in Fig 4A are the same for any ratio f E : the use of an optimal bias current b cancels any asymmetry between    E and I populations. For a finite γ, the minimal average loss � for a given f E decreases as either σ E or σ I increase. For a given set of parameters f E and γ, the optimal bias b will in general depend on the load α and the structure of the covariance matrix C, as shown in Fig 4B. Using the same analytical machinery employed for the calculation of the free energy Eq (3), the probability distribution of the typical weight w i can be easily derived. This can be seen by employing a variant of the replica trick (Methods: Distribution of synaptic weights) that links the so-called entropic part of f to hp(w i )i, expressed in terms of the saddle-point values of the same (conjugated) overlap parameters employed thus far. Interestingly, the optimal bias b implies that half of the synapses are zero, irrespectively of f E and the properties of the covariance matrix C. The probability density of the synaptic weights is composed of two truncated Gaussian densities with zero mean for the E and I components, plus a finite fraction p 0 = 0.5 of zero weights.
We show in Fig 4C the shape of the optimal weight distribution for a linear perceptron with 80% excitatory synapses, trained on exponentially correlated x and y and with a ratio σ I /σ E = 2. It is interesting to note that, in the presence of an optimal external current, both the means of the Gaussian components and the fraction of silent synapses do not depend on the specific properties of input and output signals.
The shape of the synaptic distribution appeared in previous studies both in the binary [8,11,13] and linear perceptron [15]. In the linear case with only excitatory synapses [15], for a fixed bias b ¼ ffi ffi ffi ffi N p , the fraction of zero E weights is larger than 0.5 at criticality. It generally depends on input parameters and the load in the error region α � α c . Let us also mention that a similar property is also apparent in the binary perceptron, where the scale of the typical solutions is set by robustness [13] to input and output noise. For weights w i ¼ Oð1= ffi ffi ffi ffi N p Þ, the sparsity of critical solutions generically depends on properties of E and I inputs. For weights of Oð1=NÞ, robust solutions have a fraction of zero E weights generically larger than 0.5 [6,11]. When inhibitory synapses are added, their weights are less sparse [11]. Interestingly, in the case without robustness, half of the E and I weights are zero at critical capacity for all f E � 0.5.
The dynamic properties of input/output mappings affect the shape of the weight distribution in a computable manner. As an example, in a linear perceptron with non-negative synapses, the explicit dependence of the variance of the weights on the input and output autocorrelation time constant is shown in Fig 5A for various loads α. Previous work considered an analog perceptron with purely excitatory weights as a model for the graded rate response of Purkinje cells in the cerebellum [15]. In the presence of heterogeneity of synaptic properties across cells, a larger variance in their synaptic distribution is expected to be correlated with high frequency temporal fluctuations in input currents. Analogously, the auto-correlation of the typical signals being processed sets the value of the constant external current that a neuron must receive in order to optimize its capacity.
When the input and output have different covariance matrices C x 6 ¼ C y , a joint diagonalization is not possible in general (Methods: EC, Energetic part). We can nevertheless write an expression (Eq (23)) that holds when input and output patterns are defined on a ring (with periodic boundary conditions) and use it as an approximation for the general case. Fig 5B  shows good agreement between numerical experiment and theoretical predictions for the error � err and the squared norm of the synaptic weight vector w, when input and output processes have two different time-constants τ x and τ y .

Sample covariance and dimensionality
In the discussion thus far, we assumed independence across the "spatial" index i in the input. It is often the case for input signals to be confined to a manifold of dimension smaller than N, a feature that can be described by various dimensionality measures, some of which rely on principal component analysis [40,41]. In order to relax the independence assumption, we build on a framework originally introduced in the theory of spin glasses with orthogonal couplings [42][43][44] and further developed in the context of adaptive Thouless-Anderson-Palmer (TAP) equations [45][46][47]. In the TAP formalism, a set of mean-field equations is derived for a given instance of the random couplings (in our case, for a fixed input/output set). In its adaptive generalization [46], the structure of the TAP equations depends on the specific data distribution, in such a way that averaging the equations over the random couplings yields the same results of the replica approach. Here, following previous work in the context of information theory of linear vector channels and binary perceptrons [48][49][50][51], we employ an expression for an ensemble of rectangular random matrices and use the replica method to average over the input X and output y.
Let us write the input matrix ðXÞ im ¼ � x i þ s i x im , with ξ = USV T , S being the matrix of singular values. To analyze the properties of the typical case, we start from a generic singular value distribution S and consider i.i.d. output y μ . In calculating the cumulant generating function Z ξ,δy , we perform a homogeneous average across the left and right principal components U and V (Methods: SC, Energetic part). Calling ρ ξξ T(λ) the eigenvalue distribution of the sample covariance matrix ξξ T , we can express Z ξ,δy in terms of a function G x;dy of an enlarged set of overlap parameters, which depends on the so-called Shannon transform [52] of ρ ξξ T(λ), a quantity that measures the capacity of linear vector channels. The resulting self-consistent equations, which describe the statistical properties of the synaptic weights w i , are expressed in terms of the Stieltjes transform of ρ ξξ T(λ), an important tool in random matrix theory [53] (Methods: SC, Saddle-point equations).
We show the validity of the mean-field approach by employing two different data models for the input signals. In the first example, valid for α � 1, all the P vectors ξ μ are orthogonal to each other. This yields an eigenvalue distribution of the simple form ρ(λ) = αδ(λ − 1) + (1 − α) δ(λ), for which the function G x;dy can be computed explicitly [51]. Additionally, we use a synthetic model where we explicitly set the singular value spectrum of ξ to be s a ð Þ ¼ we 2s 2 x , with χ Average error � err in the case where input and output signals have two different covariance matrices, for increasing time constant τ y of the output signal y. Parameters: a normalization factor ensuring matrix ξ has unit variance. The shape of the singular value spectrum s controls the spread of the data points ξ μ in the N-dimensional input space, as shown in Fig 6A. As shown in Fig 6B for i.i.d Gaussian output, learning degrades as σ x decreases, since inputs tend to be confined to a lower dimensional subspace rather than being equally distributed along input dimensions.
For N large enough (in practice, for N ≳ 500), the statistics of single cases is well captured by the equations for the average case (self-averaging effect). To get a mean-field description for a single case, where a given input matrix X is used, we further assume we have access to the linear expansion c μ of the output y in the set {v μ } of the columns of the V matrix, namely y ¼ � y þ s y Vc. The calculation can be carried out in a similar way and yields, for the average regression loss, the following result: The average in Eq (6) is computed over the eigenvalues λ x of the sample covariance matrix, which correspond to the PCA variances, and l y m ¼ c 2 m (Methods: SC, Energetic part). The quan-tityL w can be computed from a set of self-consistent equations that link the order parameter Dq w and the first two moments of the synaptic distribution. To better understand the role of the parameterL w , it is instructive to compare Eq (6) with the corresponding result for unconstrained weights, which can be derived from the pseudo-inverse solution w � = (ξξ T + γ) −1 ξy (Methods: SC, i.i.d. and unconstrained cases). The average loss is: Comparing Eqs (7) and (8), we find thatL w acts as an implicit regularization in the sign-constrained case. The mean-field theory is thus carried out through a diagonalization over independent contributions along the components v μ , with prescribed input and output variances λ x and λ y , respectively. The coupling between different components, induced by the averages h�i x,y and the sign-constraints, is incorporated in the effective regularizationL w , acting on  each component equally, that depends only on the structure of the input x (see Eqs (56) and (67) in Methods)).
In Fig 6C, we show results when the dimensionality of the output y along the (temporal) components of the input is modulated by taking c a ð Þ ¼ e À a 2 2s 2 y . The perceptron performance improves as the output signals spreads out across multiple components v μ . The case of i.i.d. output is recovered by taking c μ = 1.

Discussion
In this work, I investigated the properties of optimal solutions of a linear perceptron with signconstrained synapses and correlated input/output signals, thus providing a general mean-field theory for constrained regression in the presence of correlations. I treated both the case of known ensemble covariances and the case where the sample covariance is given. The latter approach, built on a rotationally invariant assumption, allowed to link the regression performance to the input and output statistical properties expressed by principal component analysis.
I provided the general expression of the weight distribution for regularized regression and found that half of the weights are set to zero, irrespectively of the fraction of excitatory weights, provided the bias is optimized. The shape of the synaptic distribution has been previously described in the binary perceptron with independent input at critical capacity, as well as in the theory of compressed sensing [54]. I elucidated the role of the optimal bias current and its relation to the optimal capacity and the scaling of the solution weights. This analysis also shed light on the structural properties of synaptic matrices that emerge when target-based methods are used for building biologically plausible functional models of rate and spiking networks.
The theory presented in this work is relevant in the effort of establishing quantitative comparisons between the synaptic profile of neural circuits involved in temporal processing of dynamic signals, such as the cerebellum [55][56][57], and normative theories that take into account the temporal and geometrical complexity of computational tasks. On the other hand, the construction of progressively more biologically plausible models of neural circuits calls for normative theories of learning in heterogeneous networks, which can be coupled to dynamic meanfield analysis of E-I separated circuits [24,25,58].
As shown in this work, the interaction between correlational structure of input signals, synaptic metabolic cost and constant external current shapes the distribution of synaptic weights. In this respect, the results presented here offer a first approximation (static linear input-output associations) to account for heterogeneities of the fraction between E and I inputs to single cells in local circuits. Even though a heterogeneous linear neuron is capable of memorizing N/ 2 associations without error for any E/I ratio, the optimal bias does depend on f E , its minimal value being attained for f E = 0.5. Input current in turn sets the neuron's operating regime and its input/output properties. Moreover, trading memorization accuracy (small output error � err ) for smaller weights (small |w| 2 ) could be beneficial when synaptic costs are considered (γ > 0). It is therefore likely that, for an optimality principle of the 80/20 ratio to emerge from purely representational considerations, dynamical and metabolic effects should be examined all together.
The importance of a theory of constrained regression with realistic input/output statistics goes beyond the realm of neuroscience. Non-negativity is commonly required to provide interpretable results in a wide variety of inference and learning problems. Off-line and on-line least-square estimation methods [59,60] are also of great practical importance in adaptive control applications, where constraints on the parameter range are usually imposed by physical plausibility.
In this work, I assumed statistical independence between inputs and outputs. For the sake of biological plausibility, it would be interesting to consider more general input-output correlations for regression and binary discrimination tasks. The classical model for such correlations is provided by the so-called teacher-student (TS) approach [61], where the output y is generated by a deterministic parameter-dependent transformation of the input x, with a structure similar to the trained neural architecture. The problem of input/output correlations is deeply related to the issue of optimal random nonlinear expansion both in statistical learning theory [62,63] and theoretical neuroscience [41,64], with a history dating back to the Marr-Albus theory of pattern separation in cerebellum [65]. In a recent work, [28] introduced a promising generalization of TS, in which labels are generated via a low-dimensional latent representation, and it was shown that this model captures the training dynamics in deep networks with real world datasets.
A general analysis that fully takes into account spatio-temporal correlations in network models could shed light on the emergence of specific network motifs during training. In networks with non-linear dynamics, the mathematical treatment quickly gets challenging even for simple learning rules. In recent years, interesting work has been done to clarify the relation between learning and network motifs, using a variety of mean-field approaches. Examples are the study of associative learning in spin models [8] and the analysis of motif dynamics for simple learning rules in spiking networks [66]. Incorporating both the temporal aspects of learning and neural cross-correlations in E-I separated models with realistic input/output structure is an interesting topic for future work.

Replica formalism: Ensemble covariance matrix (EC)
Using the Replica formalism [67], the free energy density is written as: The function Z n can be computed by considering a finite number n of replicas of the vector w and subsequently taking a continuation n 2 R. The introduction of n replicas allows to factorize hZ n i x,y over individual weights w i , at the cost of coupling different replicas after the averages over the x and y are performed. Introducing a small set of overlap order parameters, factorization across replicas is restored, so that in the large N limit the replicated partition function takes the form hZ n i x,y = e −βNnf . In the following, we will usually drop the subscript in the average h�i x,y .
To simplify the formulas, we introduce the Oð1Þ weights J i ¼ s i ffi ffi ffi ffi N p w i . In terms of these rescaled variables, the loss function in Eq (1) takes the form: by virtue of We proceed by inserting the definitions M a ¼ 1 ffi ffiffi x im J ia ffi ffi ffi N p À s y dy m with the aid of appropriate δ functions. The averaged replicated partition function hZ n i is: where: In Eq 10, we used a Fourier expansion of the δ functions and introduced the real variables u μa as conjugate variables for Δ μa . Analogously, we employed the purely imaginaryM a for the variables M a . Once the the average is carried out, second cumulants of ξ and δy get coupled to replica mixing terms of the form J ia J ib , which can be dealt with by introducing appropriate overlap order parameters Nq ab w ¼ P N i¼1 J ia J ib with the use of n(n + 1)/2 additional δ functions, together with their conjugate variablesq ab w . Cumulants of higher order will not contribute to the expression in the large N limit. Expanding the δ functions for the overlap parameters we get the expression where the two contributions G E and G S , respectively called energetic and entropic part, will be calculated separately in the following for ease of exposition. Owing to the convexity of the regression problem, we use a Replica Symmetry (RS) [67] ansatz q ab w ¼ q w þ d ab Dq w and M a = M.
EC, Entropic part. The total volume of configurations w a for fixed values of the overlap parameters is given by the entropic part: Using the RS ansatẑ q ab w ¼q w À d abq andM a ¼M, we get: Using the explicit definition of the measure dmðJÞ / Q i2E yðJ i ÞdJ i Q k2I yðÀ J k ÞdJ k , one has, up to constant terms: where we introduced the notations f I = 1 − f E and s E = −s I = 1. In order to disentangle the term ∑ ab J a J b = (∑ a J a ) 2 , we employ the so-called Hubbard-Stratonovich transformation ffi ffi ffiffi 2p p . Taking the limit n ! 0 one gets: EC, Energetic part. In order to compute the energetic part, we first need to evaluate the average with respect to ξ and δy in Eq (11). Performing the two Gaussian integrals we get: from which: where we performed a translation D ma þ M a À � y ! D ma . In the special case C x = C y � C, we can use C = VΛV T to jointly rotate Δ a ! V Δ a and u a ! Vu a , thus leaving scalar products invariant. By doing so, we obtain, within the RS ansatz: where z μ = ∑ ν V μν . Using a Hubbard-Stratonovich transformation on the term ∑ ab u μa u μb , after some algebra, we obtain: Observing that the free energy only depends on M through the term ðM À � yÞ 2 in G E , we conveniently eliminate the quantities z μ at this stage, using the simple saddle-point relation thus getting: The brackets h�i λ in Eq (22) stand for an average over the eigenvalue distribution ρ(λ) of C in the N ! 1 limit, assuming self-averaging. A similar expression for G E was previously derived in [34] for spherical weights, i.e. P N i¼1 w 2 i ¼ 1, in the presence of outputs y μ generated by a teacher linear perceptron. To map Eq (45) in [34] to Eq (22), one substitutes (1 − q) ! Δq w (observing that q aa = 1 thanks to the spherical constraint) and sets R = 0, since the learning task only involves patterns memorization.
When C x 6 ¼ C y , we can derive a similar expression under the assumption of a ring topology in pattern space (corresponding to periodic boundary conditions in the index μ): in this case, both covariance matrices are circulant and may be jointly diagonalized by discrete Fourier while q w ¼ Oð1Þ. In this scaling, Eqs (25)-(27) take the form: where G x ð Þ ¼ e À x 2 2 ffi ffi ffiffi 2p p and HðxÞ ¼ R 1 x Dz. The two remaining saddle-point equations are: Optimizing f with respect to the bias b ¼ I ffi ffi ffi ffi N p immediately implies B = 0, by virtue of Eq (33), and greatly simplifies the saddle-point equations. Using the scaling assumptions Eqs (30)- (33) together with the saddle-point Eqs (34)-(38), we get Eq (4) in the main text, that is valid for any α for γ > 0. In the unregularized case (γ = 0), it describes solutions in the error regime α > α c . The optimal bias b can be computed by I ffi ffi ffi ffi N p using Eq (36), that is valid up to an Oð1Þ term equal to � y (Fig 4B). Keeping only the leading terms in the limit β ! 1, Eq (24) can be written as: From the definition of the free energy density −βNf = hlog R dμ(w)e −βE i, one has that hEi N ¼ @ b bf ð Þ. Using Eq (39) and the relevant saddle-point equations, the expression for the average minimal energy density is then: Also, noting that @ g E ¼ N 2 P N i¼1 w 2 i , we can compute the average squared norm of the weights v ¼ P N i¼1 hw 2 i i by v = 2@ γ f. We thus obtain: The error � err ¼ 1 2N hjX T w þ b À yj 2 i can be then computed by � err ¼ � À g 2 v.

Distribution of synaptic weights
The synaptic weight distribution appearing in Eqs (28) and (29) can be obtained using a variant of the replica trick [6,67]. Using the expression Z −1 = lim n ! 0 Z n−1 , the density of excitatory weights can be written as: where we picked the first E weight in the first replica w 11 without loss of generality. The calculation proceeds along the same lines as for the entropic part above, since the energetic part does not depend on w a explicitely. Isolating the first replica and taking the limit n ! 0, one gets the expression and analogously for the I weights. This expression holds for uncorrelated inputs and outputs and any fixed bias b, as well as for any correlated x and y with optimal bias b, where deviations from Eq (2) do not occur. In the β ! 1, using the scaling relations Eqs (30)-(33), it can be easily shown that the mean-field weight probability density of the rescaled weights ffi ffi ffi ffi N p w i is a superposition of a δ function in zero and two truncated Gaussian densitites: where the mean and standard deviation of the Gaussians G(�;M, S) are: This weight density is valid for γ > 0 at any α and at critical capacity for γ = 0. The fraction of zero weights is given by:

Spectrum of exponential and rbf covariance
For the exponential covariance C mn ¼ e À jmÀ nj t , one has [33]: t . In the rbf case C mn ¼ e À jmÀ nj 2 2t 2 , the spectrum can be computed by Fourier series [39], yielding the Jacobi theta function of 3rd type.

Replica formalism: Sample covariance matrix (SC)
Also in the case of a sample covariance matrix, we are interested in statistically structured inputs and output. An independent average across x and y would result in a simple dependence on the variance of y in the energetic part. To capture the geometric dependence between x and y, we thus extend the calculations in [50,51] to the case where the linear expansion of y μ on the right singular vectors V �μ is known, by taking δy μ = ∑ ν V μν c ν . In order to compute the replicated cumulant generating function Eq (11), we again introduce overlap parameters q ab w , whose volume is given by the previously computed entropic part G S . The fact that the entropic part is unchanged in turn implies that the mean-field weight distribution takes the form of Eq (44), with the values of {A, B, C} being determined by a new set of saddle-point equations.
SC, Energetic part. Using again the expressions ðXÞ im ¼ � x i þ s i x im and ξ = USV T , the replicated cumulant generating function for the joint (mean-removed) input and output is: where we used the change of variablesJ ia ¼ P k U ki J ka andũ ma ¼ P k V km u ka . The average in Eq (47) is taken over the joint distribution pðJ a ;ũ a Þ resulting from averaging over the Haar measure on the orthogonal matrices U and V. For a single replica, Z ξ,δy will only depend on the squared norms Q w ¼ P where M ¼ L w 1 N À iS À iS T L u 1 P ! and 1 K is the identity matrix of dimension K. Following [51], the determinant can be easily calculated: where the limit is taken for N ! 1 and the average is with respect to the eigenvalue distribution ρ(λ x ). As for the quadratic portion of the Gaussian integral, calling l y k ¼ c 2 k , we will use the shorthand Considering now the replicated generating function, all the n(n + 1) cross-product J a � J b ¼J a �J b and u a � u b ¼ũ a �ũ b must be conserved via the multiplication of U and V.
Together with the overlap parameters Nq ab w ¼ P i J ia J ib , we additionally introduce the quantities Pq ab u ¼ P m u ma u mb , thus obtaining: In the RS case, we again take q ab w ¼ q w þ d ab Dq w and, similarly for the u's, q ab u ¼ À q u þ d ab Dq u . In the basis where both q ab w and q ab u are diagonal, the expression for Z ξ,δy becomes Z x;dy ¼ e iJ T 1 Sũ 1 À is y c T ffi ffi so, calling G x;dy ¼ 1 N lim n!0 log Z x;dy , we have: 2G x;dy ¼ F Dq w ; Dq u ð Þ þ q w @FðDq w ; Dq u Þ @Dq w À q u @FðDq w ; Dq u Þ @Dq u À as 2 y K L w ; L u ð Þ ð54Þ with the function F given by: Fðx; yÞ ¼ Extr L w ;L u fÀ hlogðl x þ L w L u Þi l x À ða À 1Þ log L u þ L w x þ aL u yg À log x À a log y À ð1 þ aÞ ð55Þ and K L w ; L u ð Þ ¼ L w h l y l x þL w L u i l x ;l y . In Eq (54), it is intended that Λ w and Λ w are implied by the Legendre Transform conditions: The remaining terms in the energetic part G E involve the q ab u overlaps and their conjugated parametersq ab u . Introducing the RS ansatzq ab u ¼q u þ d ab Dq u Àq u 2 , the calculation follows along the same lines of the section SC, Energetic part. We get: þ Dq u Dq u À q u ð Þ þq u Dq u À log 1 þ bDq u ð Þ À bq u þ ðM À � yÞ Eliminating M,q u and Dq u at the saddle-point in Eq (58), G E reduces to: SC, Saddle-point equations. The final expression for the free energy density implies the following saddle-point equations: in addition to the entropic saddle-point Eqs (25)- (27), which are unchanged. The saddle-point values of the conjugate Legendre variables Λ w , Λ u greatly simplify the expression for the first and second derivatives of F. Indeed, from Eqs (61) and (62) one has: or, setting L w ¼ bL w :L In particular, Eq (56) shows that Dq w is expressed by a Stieltjes transform of ρ(λ x ) and the first term in Eq (55) is its Shannon transform. In the limit β ! 1, using the following additional scaling relations for the u overlaps: Dq u ¼ bDq u Methodology: Alessandro Ingrosso.