Optimal Properties of Analog Perceptrons with Excitatory Weights

The cerebellum is a brain structure which has been traditionally devoted to supervised learning. According to this theory, plasticity at the Parallel Fiber (PF) to Purkinje Cell (PC) synapses is guided by the Climbing fibers (CF), which encode an ‘error signal’. Purkinje cells have thus been modeled as perceptrons, learning input/output binary associations. At maximal capacity, a perceptron with excitatory weights expresses a large fraction of zero-weight synapses, in agreement with experimental findings. However, numerous experiments indicate that the firing rate of Purkinje cells varies in an analog, not binary, manner. In this paper, we study the perceptron with analog inputs and outputs. We show that the optimal input has a sparse binary distribution, in good agreement with the burst firing of the Granule cells. In addition, we show that the weight distribution consists of a large fraction of silent synapses, as in previously studied binary perceptron models, and as seen experimentally.


Introduction
Purkinje cells (PCs) are the only outputs of the cerebellar cortex, a brain structure involved in motor learning. They receive a very large number (*150,000) of excitatory synaptic inputs from Granule Cells (GCs) through parallel fibers (PFs), and a single very strong input from the inferior olive through climbing fibers (CFs).
Single PCs have long been considered as a neurobiological implementation of a perceptron [1,2], the simplest feedforward network endowed with supervised learning [3], since CFs are thought to provide PCs with an error signal [4]. A perceptron learns associations between input patterns and a binary output that are imposed to it. Learning is due to synaptic modifications, under the control of an error signal. The learning capabilities of perceptrons have been extensively studied for unbiased [5,6] as well as biased patterns [6], and for unconstrained synapses [5,6]. In real neurons, synapses are either excitatory (glutamatergic synapses), or inhibitory (GABAergic synapses), depending on the identity of the pre-synaptic neurons (except during early development, when GABAergic synapses are initially excitatory and then become inhibitory). A multitude of experiments characterizing synaptic plasticity have shown that the strength, but not the sign, of a synapse can be modified by patterns of neuronal activity. This has led to the study of perceptrons with sign-constrained weights [7,8,9,10]. In particular, Brunel et al. [10] showed that when synaptic weights are constrained to be excitatory (positive or zero), a perceptron at maximal capacity has a distribution of synaptic weights with two components: a finite fraction of zero-weight ('silent') synapses; and a truncated Gaussian distribution for the rest of the synapses. They further showed that this distribution is in striking agreement with experimental data [10].
Numerous experiments show however that in the course of specific motor tasks, the firing rate of Purkinje cell varies in an analog, not binary, fashion [11,12,13,14]. We therefore set out to investigate the capacity and distribution of synaptic weights of a perceptron storing associations between analog inputs and outputs. More precisely, each input or output unit can take an analog value drawn from a distribution with a given mean and variance. We show that the optimal input distribution matches the firing pattern of the Granule cells, and weight distribution at maximal capacity reproduces the experimental Parallel Fiber to Purkinje cell synaptic weight distribution.

The analog perceptron
The perceptron consists of N inputs and one output. Both inputs and outputs take continuous values. We require this perceptron to learn a set of p prescribed random input-output associations, where the inputs G m i (i~1, . . . ,N, m~1, . . . ,p) are drawn randomly and independently from a distribution r in (G), with mean m G and standard deviation s G while the target outputs P m t are drawn randomly and independently from a distribution r out (P) with mean m P and standard deviation s P . Note that since G m i and P m t represent firing rates of input and output cells, respectively, they must be nonnegative quantities. In particular, m G w0, m P w0 represent the mean firing rates of granule/Purkinje cell, respectively. The output of the perceptron when a pattern m is presented in input is given by where w is a monotonically increasing static transfer function (f-I curve), w i are the synaptic weights from input i~1, . . . ,N, hN represents inhibitory inputs that cancel the leading order term in P N i~1 w i G m i so that the argument of w is of order 1. In Purkinje cells, these inhibitory inputs are provided by interneurons of the molecular layer. The goal of perceptron learning is to find a set of synaptic weights fw i §0g,i~1, . . . ,N for which P m~P m t for all m~1, . . . ,p.
We focus for simplicity on a linear transfer function w(x)~x, but our results can be applied to arbitrary invertible transfer functions w. Indeed, the problem of learning associations (G m i ?P m ) in a perceptron with an arbitrary invertible transfer function w is equivalent to the problem of learning (G m i ?w {1 (P m )) in a linear perceptron. All the results derived in this paper can then be applied to a perceptron with transfer function w, except that m P and s P are now defined to be the two first moments of w {1 (P m t ).

Storage capacity
In the large N limit the probability of finding a set of weights that satisfies P m~P m t for all m~1, . . . ,p is expected to be 1 if a:p=N is below a critical value a c , while it is 0 when awa c [15]. a c is therefore the number of associations that can be learned per synapse, and is commonly used as a measure of storage capacity.
This storage capacity can be computed analytically using the replica method (see Methods) [6,16,17,10,15]. The capacity is given by B is given by the equation )), and c depends on the statistics of the associations as Therefore, the maximal capacity only depends on a single parameter c, which is a function of the statistics of the patterns that need to be learned. This dependence is shown in Fig. 1A. It shows that the capacity is exactly equal to 0.5 when c~0, while it decreases monotonically as c increases.
If the number of patterns to be learned exceeds the maximal capacity, the mean squared error becomes strictly positive. It can also be computed using the replica method (see Methods, Eq. (17)). Unsurprisingly, it increases monotonically with a, as shown in Fig. 1B which shows the result of the analytical calculation, as well as numerical simulations. If uncorrelated noise is added to the perceptron, the total mean squared error is the sum of the error without noise (Eq. 17) and the variance of the uncorrelated noise.
In the simulations, inputs and outputs are drawn from an exponential distribution. The weight update at each presentation is the standard perceptron one, i.e.
where b is the learning rate. w i is set to zero if application of the update leads to a negative weight. This corresponds to a gradient descent of a cost function proportional to This learning rule is in qualitative agreement with experimental data on synaptic plasticity in GC to PC synapses [18,19]. In Purkinje cells, the error signal is thought to be conveyed by climbing fiber (CF) activation. Two protocols have been shown to be effective in eliciting long-term plasticity. Pairing GC with and CF activation leads to Long-Term Depression (LTD) of the synapse, while Long-Term Potentiation (LTP) is induced by stimulating the GC alone (see Fig. 3AB of [19] for details). Writing climbing fiber activation as C~P{P t zC 0 , we see that Eq. (5) is recovered if one chooses Dw i !G i (C 0 {C), which captures the two experimental protocols described above.

Distribution of synaptic weights
The distribution of synaptic weights at maximal capacity can also be computed using the replica method (see [10] for details of the calculation). It turns out that the distribution obeys exactly the same equation as in the binary perceptron, i.e. where and w is the average synaptic weight. In particular the fraction of zero weight synapses is S~H({B). Interestingly, there is a very simple relationship between capacity and fraction of silent synapses, Sza c~1 , that holds for any value of c. The fraction of silent synapses S is shown as a function of c in Fig. 2A. It shows that S~0:5 when c~0, and increases monotonically with c.
The full distribution of weights is shown in Fig. 2B, together with the results of a numerical simulation (see parameters in the caption of Fig. 2B). The theoretical distribution of synaptic weights is in good agreement with experimental measurements of the efficacy of a large set of GC to PC synapses, using paired recordings in vitro (see Fig. 6A of [10] for details) [20,21,10].
Above maximal capacity, awa c , the distribution of synaptic weights is still given by Eq. (6), but the fraction of zero weight synapses decreases monotonically with a, and goes to zero in the large a limit (see Fig. 2C). In that limit the distribution becomes

Author Summary
Learning properties of neuronal networks have been extensively studied using methods from statistical physics. However, most of these studies ignore a fundamental constraint in networks of real neurons: synapses are either excitatory or inhibitory, and cannot change sign during learning. Here, we characterize the optimal storage properties of an analog perceptron with excitatory synapses, as a simplified model for cerebellar Purkinje cells. The information storage capacity is shown to be optimized when inputs have a sparse binary distribution, while the weight distribution at maximal capacity consists of a large amount of zero-weight synapses. Both features are in agreement with electrophysiological data.  increasingly close to a Gaussian distribution peaked around a positive value, with a width that tends to zero in the large a limit.

Statistics of inputs and outputs maximizing storage capacity
To maximize storage capacity, c should be as small as possible. We first ask which distribution of inputs maximize capacity. From Eq. (4), it is clear that to maximize capacity, m G should be as small as possible, while s G should be as large as possible. Since r in is a distribution of firing rates, it must be bounded between 0 and a maximal firing rate G max . The distribution of a bounded variable that maximizes the variance with a fixed mean m G is a binary distribution r in (G)~(1{m G =G max )d(G)zm G =G max d(G{G max ). Thus, we predict that to optimize capacity, patterns of activity in the Granule cell layer should be sparse (to ensure m G is small), but active cells should be active close to their maximal firing rates. Interestingly, this is in striking agreement with available data [22,19,23] showing that (i) Granule cells have very sparse activity in vivo (average firing rates of 0.5 Hz [22]) (ii) they can respond with brief, high frequency bursts of action potentials to sensory inputs (with an average frequency of 77 Hz within the burst, and maximal frequencies up to 250 Hz, see e.g. Fig. 3 of [22]).
The next question is which distribution of output firing rates optimizes the capacity. Eq. (4) makes it clear the capacity is optimized for s P~0 . In this limit however, all input patterns lead to exactly the same output, and the Purkinje cell output contains no information on which input was presented. This is of course not a desirable outcome, and suggests the capacity is not the correct measure to maximize in this case. We therefore turn to the Shannon mutual information between the Purkinje cell output and its inputs as a more suitable measure. In the presence of additive Gaussian noise of zero mean and standard deviation s n , this is simply the mutual information of a Gaussian channel with a signal-to-noise ratio s 2 P =s 2 n , i.e. log 2 (1zs 2 P =s 2 n )=2 bits per pattern (see e.g. [24]). The total information in bits per synapse is therefore I~a c log 2 (1zs 2 P =s 2 n )=2. The information is zero when s P~0 , and reaches a maximum for a finite value of s P , which depends on both the noise standard deviation s n and s eff~sG h=m G . Fig. 3A shows the information as a function of s P , for different values of s eff , for s n~1 . It shows that the optimal value of s P increases approximately linearly with s eff for large s eff (see Fig. 3B).

Discussion
In this paper, we have considered an analog firing rate model for a Purkinje cell with plastic excitatory weights, and derived both its maximal capacity and the distribution of weights at maximal capacity. We showed that the capacity is of the same order as in a binary perceptron model.
The distribution of synaptic weights of the analog perceptron is composed at maximal capacity of two parts: a large fraction (w0:5) of silent synapses and a truncated Gaussian. It has exactly the same shape as in several other models: a standard binary perceptron [10], and a bistable perceptron [25]. This distribution is in quantitative agreement with a combination of electron microscopy and electrophysiological data in adult rat slices [20,21,10]. Furthermore, a gradient descent learning rule leading to maximal capacity bears strong similarities with synaptic plasticity experiments: LTD when PF and CF are coactivated, LTP when PF fires alone (i.e. CF below baseline, thus P t wP) [18,19].
We found that in order to maximize the capacity, the input variance should be as large as possible. We argue that GCs in vivo are close to such an optimal distribution, since they fire high-frequency bursts at very low rates [22,19,23]. Furthermore, GC bursts have been found in some experiments to be critical to induce plasticity in PF to PC synapse [26]. Indeed, no plasticity is induced in those protocols with a single GC spike. Secondly, lower variance in the output also increases the capacity, but at a cost of losing information contained in the output, in the presence of noise. For a given variance of the noise, there is an optimal variance of the output that maximizes the information contained in the output.
The model we have studied here is essentially equivalent to the ADALINE (Adaptive Linear Neuron) model [27], whose storage capacity, in the absence of constraints on synaptic weights, is equal to 1. The result can be easily intuitively understood by the fact that when a~1, there are exactly N linear equations to solve, Eq. 1, with N unknowns, w i (see e.g. [15]). We have shown here that the constraints that all synaptic weights should be positive or zero leads to a capacity which is decreased by a factor 2 or more, depending on the value of c. This decrease in capacity is similar to what is observed in the standard perceptron with excitatory synapses [7,8,9,10]. Note that learning associations with constrained weights is similar conceptually to non-negative matrix factorization [28,29]. Generalizations of such models in the temporal domain (the so-called adaptive filter models) have been proposed to describe learning in the cerebellar cortex [30,31,32,33]. It would be of interest to investigate capacity and distribution of synaptic weights of such models.
In this paper, we have focused on a single plasticity site, the GC to PC synapse. Many other sites of plasticity are known to exist in the cerebellum [18]. Future studies are needed to clarify the impact of these additional sites of plasticity on the learning capabilities of this structure.

Calculation of the storage capacity
The replica method involves calculating the average logarithm of the volume of the space of weights satisfying all constraints given by Eq. (1) [6]. To compute the average logarithm, one uses the replica trick: n replicas of the system are introduced, one computes where S:T represents an average over the patterns, and a is a replica index. This calculation is done using a standard procedure. After introducing integral representations for the delta functions, one averages over the distribution of the patterns. One then introduces order parameters 1 N together with conjugate parametersM M a ,Q Q a andq q ab . We then use a replica-symmetric ansatz (all the order parameters are taken to be independent of replica index a), perform the limit n?0 and obtain where in the Equation for F (Eq. (12)), the two first lines are identical to the binary perceptron with excitatory weights [10], while the last line is specific to the analog perceptron.
In the large N limit, the integral in Eq. (11) is dominated by the extremum of F . The typical values of all order parameters are then obtained by the resulting saddle point equations, setting the derivatives of F with respect to all order parameters to zero. The maximal capacity a C is obtained in the limit q?Q, for which the volume vanishes. This limit yields Eqs. (2,4).

Calculation of the mean squared error
Following [34], we introduce a cost function which is given by the sum of the squared error for all patterns, and compute its minimum over the space of weights. This is done introducing a partition function Z(h), where h is an inverse temperature, and computing Slog Z(h)T using the replica method. The mean squared error is then given by To perform this calculation, a new parameter has to be introduced, which will remain finite when awa c in the limit h??, q?Q. The mean squared error is then given by When a~a c , r diverges to infinity, E min~0 , and Eqs. (19,20) reduce to Eqs. (2,3).