Error-based or target-based? A unifying framework for learning in recurrent spiking networks

Learning in biological or artificial networks means changing the laws governing the network dynamics in order to better behave in a specific situation. In the field of supervised learning, two complementary approaches stand out: error-based and target-based learning. However, there exists no consensus on which is better suited for which task, and what is the most biologically plausible. Here we propose a comprehensive theoretical framework that includes these two frameworks as special cases. This novel theoretical formulation offers major insights into the differences between the two approaches. In particular, we show how target-based naturally emerges from error-based when the number of constraints on the target dynamics, and as a consequence on the internal network dynamics, is comparable to the degrees of freedom of the network. Moreover, given the experimental evidences on the relevance that spikes have in biological networks, we investigate the role of coding with specific patterns of spikes by introducing a parameter that defines the tolerance to precise spike timing during learning. Our approach naturally lends itself to Imitation Learning (and Behavioral Cloning in particular) and we apply it to solve relevant closed-loop tasks such as the button-and-food task, and the 2D Bipedal Walker. We show that a high dimensionality feedback structure is extremely important when it is necessary to solve a task that requires retaining memory for a long time (button-and-food). On the other hand, we find that coding with specific patterns of spikes enables optimal performances in a motor task (the 2D Bipedal Walker). Finally, we show that our theoretical formulation suggests protocols to deduce the structure of learning feedback in biological networks.


Introduction
When first confronted with reality, humans learn with high sample efficiency, benefiting from the fabric of society and its abundance of experts in all relevant domains. A conceptually simple and effective strategy for learning in this social context is Imitation Learning. One can conceptualize this learning strategy in the Behavioral Cloning framework, where an agent observes a target, closely optimal behavior (expert demonstration), and progressively improves its mimicking performances by minimizing the differences between its own and the expert's behavior. Behavioral Cloning can be directly implemented in a supervised learning framework. In last years competition between two opposite interpretations of supervised learning is emerging: error-based approaches [33,29,2,3,19], where the error information computed at the environment level is injected into the network and used to improve later performances, and target-based approaches [25,21,10,28,8,14,37], where a target for the internal activity is selected and learned. In this work, we provide a general framework where these different approaches are reconciled and can be retrieved via a proper definition of the error propagation structure the agent receives from the environment. Target-based and error-based are particular cases of our comprehensive framework. This novel formulation, being more general, offers new insights on the importance of the feedback structure for network learning dynamics, a still under-explored degree of freedom. Moreover, we observe that spike-timing-based neural codes are experimentally suggested to be important in several brain systems [9,17,17,30,13]. This evidence led us to we investigate the role of coding with specific patterns of spikes by introducing a parameter that defines the tolerance to precise spike timing during learning. Although many studies have approached learning in feedforward [28,24,11,22,38,27] and recurrent spiking networks [2,29,10,15,8], a very small number of them successfully faced real world problems and reinforcement learning tasks [2,36]. In this work, we apply our framework to the problem of behavioral cloning in recurrent spiking networks and show how it produces valid solutions for relevant tasks (button-and-food and the 2D Bipedal Walker). From a biological point of view, we focus on a tantalizing novel route opened by such a framework: the exploration of what feedback strategy is actually implemented by biological networks and in the different brain areas. We propose an experimental measure that can help elucidate the error propagation structure of biological agents, offering an initial step in a potentially fruitful insight-cloning of naturally evolved learning expertise.

The spiking model
In our formalism neurons are modeled as real-valued variable v t j ∈ R, where the j ∈ {1, . . . , N } label identifies the neuron and t ∈ {1, . . . , T } is a discrete time variable. Each neuron exposes an observable state s t j ∈ {0, 1}, which represents the occurrence of a spike from neuron j at time t. We then define the following dynamics for our model:ŝ Where ∆t = 1ms is the discrete time-integration step, while τ s = 2ms and τ m = 8ms are respectively the spike-filtering time constant and the temporal membrane constant. Each neuron is a leaky integrator with a recurrent filtered input obtained via a synaptic matrix w ∈ R N ×N and an external signal I t i . w res = −20 accounts for the reset of the membrane potential after the emission of a spike. v th = 0 and v rest = −4 are the the threshold and the rest membrane potential.

Basics and definitions
We face the general problem of an agent interacting with an environment with the purpose to solve a specific task. This is in general formulated in term of an association, at each time t, between a state defined by the vector x t h and actions defined by the vector y t k . The agent evaluates its current state and decides an action through a policy π({y t+1 k }|{x t h }). Two possible and opposite strategies to approach the problem to learn an optimal policy are Reinforcement Learning and Imitation Learning. In the former the agent starts by trial and error and the most successful behaviors are potentiated. In the latter the optimal policy is learned by observing an expert which already knows a solution to the problem. Behavioral Cloning belongs to the category of Imitation Learning and its scope is to learn to reproduce a set of expert behaviours (actions) y t+1 k ∈ R, k = 1, ... O (where O is the output dimension) given a set of states x t h ∈ R, h = 1, ... I (where I is the input dimension). Our approach is to explore the implementation of Behavioral Cloning in recurrent spiking networks.

Behavioral Cloning in spiking networks
In what follows, we assume that the action of the agent at time t, y t k is evaluated by a recurrent spiking network and can be decoded through a linear readout y t is a temporal filtering of the spikes s t i . To train the network to clone the expert behavior it is necessary to minimize the error: It is possible to derive the learning rule by differentiating the previous error function (by following the gradient), similarly to what was done in [2]: where we have used p t i for the pseudo-derivative (similarly to [2]) and reserved e t j = ∂v t i ∂wij for the spike response function that can be computed iteratively as In our case the pseudo-derivative, whose purpose is to replace is non-differentiable, see eq.(3)), is defined as follows: it peaks at v t i = 0 and δv is a parameter defining its width. For the complete derivation we refer to the supplemental material (where we also discuss the in eq. (6)). An agent (here a recurrent network) observes the current state-action pair of a target agent and is trained to emulate such behaviour. The model assumes the presence of additional constraints. The total number of independent constraints D defines the rank of the error propagation matrix. (B) Schematics of difference τ , the spikes filtering timescales. A larger τ is more tolerant on precise spike timing. (C) Schematics of our generalized framework. Changing the D and τ parameters, it is possible to derive different learning algorithms.

Resources
The code to run the experiments is written in Python 3. Simulations were executed on a dual-socket server with eight-core Intel(R) Xeon(R) E5-2620 v4 CPU per socket. The cores are clocked at 2.10GHz with HyperThreading enabled, so that each core can run 2 processes, for a total of 32 processes per server.

Generalization
In eq. (6) we used the expert behavior y t k as a target output. However, it is possible to imagine that in both biological and artificial systems there are much more constraints, not directly related to the behavior, to be satisfied. One example is the following: it might be necessary for the network to encode an internal state which is useful to produce the behavior y t k and to solve the task (e.g. an internal representation of the position of the agent). The encoding of this information can automatically emerge during training, however to directly suggest it to the network might significantly facilitate the learning process. This signal is referred as hint in the literature [15]. For this reason we introduce a further set of output targets q t k , k = O + 1, ... D and define Y t k , k = 1, ... D as the collection of y and q . Y t k should be decoded from the network activity through a linear readout Y t k = i R ik s t i and should be as similar as possible to the target. This can be done by minimizing the error The resulting learning rule is

Target-based approach
The possibility to broadcast specific local errors in biological networks has been debated for a long time [32,23]. On the other hand, the propagation of a target appears to be more coherent with biological observations [18,26,35,20]. For this reason we propose an alternative formulations allowing to evaluate target rather than errors [25,23]. This can be easily done by writing the target output as: Where r t k is the target activity of the recurrent network. We observe that if the matrix R ik is full rank, the internal target can be easily uniquely defined, otherwise it exists a degeneracy in its choice. Substituting this expression in eq. (9) we obtain By inspection, we notice the occurrence of a novel matrix D = R R which acts recurrently on the network, D ∈ R N ×N . If one now forgets the origin of this novel matrix, the previous relation can be rewritten in terms of a general square matrix D ∈ R N ×N : The two core new terms are the r t i and the matrix D. The first induces the problem of selecting the optimal network activity, which is tautologically a re-statement of the learning problem. The second term, the matrix D defines the dynamics in the space of the internal network activities s t k during learning. This formulation results similar to the full-FORCE algorithm [10], which is target-based, but does not impose a specific pattern of spikes for the internal solution.

Spike coding approximation
We want now to replace the target internal activity r t i with a target sequence of spikes s t i , in order to approximate the Y t i as: We stress here the fact that, due to the spikes quantization, the equality cannot be strictly achieved, and eq. (13) is an approximation. One can simply consider s t to be the solution of the optimization problem s t i = argmin s t i kt |y t k − i B kis t i |. The optimal encoding for a continuous trajectory through a pattern of spikes has been broadly discussed in [5]. However, the pattern s t might describe an impossible dynamics (for example activity that follows periods of complete network silence). For this reason here we take a different choice. The s t i is the pattern of spikes expressed by the untrained network when the target output Y t i is randomly projected as an input (similarly to [10,28]). It has been demonstrated that this choice allows for fast convergence and encodes detailed information about the target output. With these additional considerations, we can now rewrite our expression for the weight update in terms of the network activity: In this way a specific pattern of spikes is directly suggested to the network as the internal solution of the task. We observe that when R is random and full rank, D it is almost diagonal and the training of recurrent weights reduces to learning a specific pattern of spikes [31,16,12,4,7]. In this limit the model LTTS [28] is recovered (see Fig.1C), with the only difference of the presence of the pseudo-derivative. We interpret the parameter τ (the time scale of the spike filtering, see eq. (4)) as the tolerance to spike timing. In Fig.1B we show in a sketch that, for the same spike displacement between the internal and the target activity, the error is higher when the τ is lower.

Dimensionality of the solution space
The learning formulation of eq. (14) offers a major insights on the role played by the feedback matrix D ik . Consider the learning problem (with fixed input and target output) where the synaptic matrix w ij is refined to minimize the output error (by converging to the proper internal dynamics). The learning dynamics can be easily pictured as a trajectory where a single point is a complete history of the network activity s n = {s t i : i = 1, ...N ; t = 1, ...T }. Upon initialization, a network is located at a point s 0 marking its untrained spontaneous dynamics. The following point s 1 is the activity produced by the network after applying the learning rule defined in eq. (14), and so on. By inspecting eq. (14) one notes that a sufficient condition for halting the learning is where is an arbitrary small positive number. If is small enough it is possible to write: In the limit of a full-rank D matrix (example: the LTTS limit where D is diagonal) the only solution to eq. (15) iss t s t and the learning halts only when targets t is cloned. When the rank is lower the solution to eq. (15) is not unique, and the dimensionality of possible solutions is defined by the kernel of the matrix D (the collection of vectors λ such that Dλ = 0). We have: dim Ker D = N − rank D = N − D. We run a numerical experiment in order to confirm our theoretical predictions. We used equation (14) to store and recall a 3D continuous trajectory y t is a temporal pattern composed of 3 independent continuous signals. Each target signal is specified as the superposition of the four frequencies f ∈ {1, 2, 3, 5} Hz with uniformly extracted random amplitude A ∈ [0.5, 2.0] and phase φ ∈ [0, 2π]. We repeated the experiment for different values of the rank where δ ik is the Kronecker delta (the analysis for the case R ik random provides analogous results and is reported in the supplemental material ). When the rank is N , different replicas of the learning (different initializations of recurrent weights) converge almost to the same internal dynamics s t i . This is reported in Fig , where λ k are the principal component variances normalized to one ( k λ k = 1). We found a monotonic relation between the dimension of the convergence space and the rank (see Fig.2C, more information on the PC analysis and the estimation of the dimensionality in the supplemental material ). This observation confirms that when the rank is very high, the solution is strongly constrained, while when the rank becomes lower, the internal solution is free to move in a subspace of possible solutions. We suggest that this measure can be used in biological data to estimate the dimensionality of the learning constraints in biological neural network from the dimensionality of the solution space.

Tolerance to spike timing
As we discussed above the τ can be interpreted as the tolerance to precise spike timing. To investigate the role of this parameter, we considered the same store and recall task of a 3D trajectory described in the previous section (N = 100, T = 100). We set the maximum rank (D = N ) for this experiment. In Fig.3A we report the spike error ∆S = it |s t i − s t i | as a function of the iteration number for different values of the parameter τ . Only for the lower values of τ the algorithm converges exactly to the spike pattern s t i . In Fig.3B we report the mse = kt (y t k − y t k ) 2 as a function of the iteration numbers and the parameter τ . In Fig.3C we show the mse as a function of ∆S for different values of τ . Lower τ values are characterized by a higher slope meaning that a change in the spike pattern expressed by the network strongly affects the error on the output y t k . This suggests a low tolerance to precise spike timing in the generated output when the parameter τ is low. The consequence of this effect in a behavioral task is investigated below (section 2D Bipedal Walker).

Button-and-food task
To investigate the effect of the rank D of feedback matrix, we design a button-and-food task (see Fig.4A for a graphical representation), which requires for a precise trajectory and to retain the memory of the past states. In this task, the agent starts at the center of the scene, which features also a button and an initially locked target (the food). The agent task is to first push the button so to unlock the food and then reach for it. We stress that to change its spatial target from the button to the food, the agents has to remember that it already pressed the button (the button state is not provided as an input to the network during the task). In our experiment we kept the position of the button (expressed in polar coordinates) fixed at r btn = 0.  Fig.5C the final reward is reported as a function of the target angle θ targ for different ranks (purple arrows indicate the training conditions). As expected, the reward is maximum concurrently to the training condition. Moreover, it can be readily seen how high-rank feedback structures allows for superior performances for this task.

2D Bipedal Walker
We benchmarked our behavioral cloning learning protocol on the standard task the 2D Bipedal Walker, provided through the OpenAI gym (https://gym.openai.com [6], MIT License). The environment and the task are sketched in Fig.5A: a bipedal agent has to learn to walk and to travel as long a distance as possible. The expert behavior is obtained by training a standard feed-forward network with PPO (proximal policy approximation [34], in particular we used the code provided in   Fig.5C for an example of the states-actions trajectories). The average reward performed by the expert is r exp 180 while a random agent achieves r rnd −120. We performed behavioral cloning by using the learning rule in eq. (14) in a network of N = 500 neurons. We chose the maximum rank (D = N ) and evaluate the performances for different values of τ (more information in the supplemental material ). In Fig.5B-C it is report the rastergram for 100 random neurons and the dynamics of the membrane potential for 3 random neurons during a task episode. For each value of τ we performed 10 independent realizations of the experiment. For each realization the s t i is computed, and the recurrent weights are trained by using eq. (14). The optimization is performed using gradient ascent and a learning rate η = 1.0. In Fig.5D we report the spike error ∆S = it |s t i − s t i | at the end of the training. The internal dynamics s t i almost perfectly reproduces the target pattern of spikes s t i for τ < 0.5ms, while the error increases for larger values. The readout time-scale is fixed to τ RO = 5ms while the readout weights are initialized to zero and the learning rate is set to η RO = 0.01. Every 75 training iterations of the readout we test the network and evaluate the average reward r over 50 repetitions of the task. We then evaluate the average over the 10 realizations of the maximum r obtained for each realization. In Fig.5F it is reported the average of the maximum reward as a function of τ . The decreasing monotonic trend suggests that learning with specific pattern of spikes (τ → 0) enables for optimal performances in this walking task. We stress that in this experiment we used a clumped version of the learning rule. In other words we substituted s t i to s t i in the evaluation of ∂vi ∂wij in eq. (7). This choice, which is only possible when the maximum rank is considered (D = N ), allows for faster convergence and better performances. The results for the non-clumped version of the learning rule are reported in the supplemental material.

Discussion
In this work, we introduced a general framework for supervised learning in recurrent spiking networks, with two main parameters, the rank of the feedback error propagation D and the tolerance to precise spike timing τ (see Fig.1C). We argue that many proposed learning rules can be seen as specific cases of our general framework (e-prop, LTTS, full-FORCE). In particular, the generalization on the rank of the feedback matrix allowed us to understand the target-based approaches as emerging from error-based ones when the number of independent constraints is high. Moreover, we understood that different D values lead to different dimensionality of the solution space. If we see the learning as a trajectory in the space of internal dynamics, when the rank D is maximum, every training converges to the same point in this space. On the other hand, when the D is lower, the solution is not unique, and the possible solutions are distributed in a subspace whose dimensionality is inversely proportional to the rank of the feedback matrix. We suggest that this finding can be used to produce experimental observable to deduce the actual structure of error propagation in the different regions of the brain. On a technological level, our approach offers a strategy to clone on a (spiking) chip an expert behavior either previously learned via standard reinforcement learning algorithms or acquired from a human agent. Our formalism can be directly applied to train an agent to solve closed-loop tasks through a behavioral cloning approach. This allowed solving tasks that are relevant in the reinforcement learning framework by using a recurrent spiking network, a problem that has been faced successfully only by a very small number of studies [2]. Moreover, our general framework, encompassing different learning formulations, allowed us to investigate what learning method is optimal to solve a specific task. We demonstrated that a high number of constraints can be exploited to obtain better performances in a task in which it was required to retain a memory of the internal state for a long time (the state of the button in the button-and-food task). On the other hand, we found that a typical motor task (the 2D Bipedal Walker) strongly benefits from precise timing coding, which is probably due to the necessity to master fine movement controls to achieve optimal performances. In this case, a high rank in the error propagation matrix is not really relevant. From the biological point of view, we conjecture that different brain areas might be located in different positions in the plane presented in Fig.1C.

Limitations of the study
We chose relevant but very simple tasks in order to test the performances of our model and understand its properties. However, it is very important to demonstrate if this approach can be successfully applied to more complex tasks, e.g. requiring both long-term memory and fine motor skills. It would be of interest to measure what are the optimal values for both the rank of feedback matrix and τ in a more demanding task. Finally, we suggested that our framework allows for inferring the error propagation structure. However, our measure requires knowing the target internal dynamics which is not available in experimental recordings. We plan to develop a variant of this measure that doesn't require such an observable. Moreover, we observe that the measure we proposed is indirect since it is necessary to estimate the dimensionality of the solution space first and then deduce the dimensionality of the learning constraints. Future development of the theory might be to formulate a method that directly infers from the data the laws of the dynamics in the solution space induced by learning.

A.1 The spiking model
In our formalism neurons are modeled as real-valued variable v t j ∈ R, where the j ∈ {1, . . . , N } label identifies the neuron and t ∈ {1, . . . , T } is a discrete time variable. Each neuron exposes an observable state s t j ∈ {0, 1}, which represents the occurrence of a spike from neuron j at time t. We then define the following dynamics for our model:ŝ ∆t is the discrete time-integration step, while τ s and τ m are respectively the spike-filtering time constant and the temporal membrane constant. Each neuron is a leaky integrator with a recurrent filtered input obtained via a synaptic matrix w ∈ R N ×N and an external signal I t i . w res = −20 accounts for the reset of the membrane potential after the emission of a spike. v th = 0 and v rest = −4 are the the threshold and the rest membrane potential.

A.2 Error-based learning rule, complete derivation
We derive here the expression for the synaptic update (eq. (6) in the main text) that is obtained in the error-based framework for the minimization of an output error E. We assume a regression problem where the error is computed as: moreover we assume the system output y t k ∈ R to be a linear readout (via readout real matrix B ik ) of the low-pass filtered network activitys t i : wheres t i can be evaluated iteratively as: with β RO = exp (−∆t/τ RO ). The resulting formulation for the synaptic update ∆w ij is obtained by imposing it to be proportional to the negative error gradient, with the proportionality factor η representing the learning rate of the system. Following a classical factorization of the error gradient computation in recurrent network, one start by noticing how the time-unrolled network has a feed-forward structure with layers indexed by the time variable t and shared weights w t ij = w ij ∀t. The gradient for a specific time-layer can be expressed as (we use the superscript w (t) ij to denote the formal dependence of the synaptic matrix on the layer of time-unrolled, however being the network recurrent, we will omit this trivial index in subsequent expressions): The total gradient is obtained by summing the contributions from all the time-layers, yielding the following expression for the error-gradient synaptic update: Following [bellec2020], we rewrite the error total derivative by collecting all the terms that can be computed locally. The core issue is the fact that the error E = E s 1 , s 2 , . . . , s T (with s t = {s t i } , ∀i), is a function of the complete network activity, so the influence of a spike s t i on the subsequent network development should be backtracked in the computation of the total derivative. We aim for a recursive rewriting by noting that: Indeed the second term is suited for an analogous manipulation. The recursive chain terminates for v T +1 i , having the error no dependence for such variable. Applying this strategy yields the following expression for the error gradient: Collecting all the terms of the form ∂v t+1 i /∂v t i yields the following compact expression for the gradient: The current form still bares the issue of require computation of future event (terms indexed by t > τ ). However this problem is only apparent (see [bellec2020]), as it can be solved by exchanging the summation indices and rewriting the former expression in a form that, at each time t, only involves past events (thus being physically plausible): If we recognize in the last term ∂v t i /∂w ij =ŝ t j and note how ∂v t i /∂v t−1 i = β m , where β m = exp (−∆t/τ m ), the second summation in the gradient is recognized as a low-pass filter ofŝ t j , which yields the spike response function, namely: We stress that this formulation is equivalent to eq. (7) of the main text. Inspecting the term ∂s t+1 i /∂v t i we note how, according to eq. (3) of the main text, the spike s t i depends in a non-differentiable way on the . This is a fundamental characteristic of spike-based system, which is usually dealt with the introduction of a custom, non-linear pseudo-derivative p t i (see eq. (8) of main text). With the previous substitution in place the gradient reads: Up until this point all the manipulation have yielded an exact expression. However, the computation of the total derivative of the error with respect to the neuron spike still needs to be accounted for. Again we face the problem of the cascading influence of the term s t i on the entire future network activity. In [bellec2020] the following approximation is introduced: where the symbol ∂ indicates that only direct contributions of s t i on E should be accounted for in the derivation, thus removing the influences of spike s t i on subsequent network activity (i.e. the elicit of spike s t+ j for t + > t). This approximation is mandatory for a biologically plausible learning rule, which must satisfy space-time locality. If we substitute the approximation (15) into eq. (14) (and use the explicit expression for the error E in (4) and (5)) we get: where β = exp (−∆t/τ ). Again, the apparent issue of the sum on future events can be solved by an exchange in the summation indices, which yields: The previous expression for the error gradient can then be used as a synaptic update rule to improve network performances. In our experiments we have introduced an additional approximation to (17) by neglecting the following temporal filter: These additional approximations justify the use of the symbol in eq. (6) of the main text and yield our final expression for the synaptic update rule (where all constant factors are included in the definition of the learning rate η): Indeed, we observe that this approximation does not significantly affect the learning. It is straightforward to derive the learning rule for the readout weights:

B Dimensionality of solution space
This section provides more details about the section Dimensionality of solution space and Fig.2 of the main text.

B.1 Store and recall of a 3D trajectory and training protocol
To investigate the properties of the solution space, we decided to store and recall a 3D continuous trajectory. Given a target input x t h (h = 1, ..I, t = 1, ..T ), the network should reproduce the target output y t k (k = 1, ..O, t = 1, ..T ). y t k is a temporal pattern composed of 3 independent continuous signals. Each target signal is specified as the superposition of the four frequencies f ∈ {1, 2, 3, 5} Hz with uniformly extracted random amplitude A ∈ [0.5, 2.0] and phase φ ∈ [0, 2π]. In this case the input, referred to as clock signal, is defined as follows. For t ∈ (0, 0.
x t k = 0 otherwise. In order to train our recurrent spiking network, the first step is to compute the target pattern of where This term is only used to compute the target, and is not present during the test. w in,t ih and w teach,t ik are random matrix whose elements are randomly drown from a Gaussian distribution with zero means and respective variances σ in and σ teach . All the parameters are reported in Table 1.
The recurrent weights are learned with the following rule (eq.(14) of the main text) while the learning rule for the readout weights is defined in eq.(20).

B.2 PC representation and dimensionality estimation
The learning dynamics can be easily pictured as a trajectory where a single point is a complete history of the network activity s n = {s t i : i = 1, ...N ; t = 1, ...T }. For simplicity, and for visualization purposes we defined each point of the trajectory as defined by the vector s n := t |s t i − s t i |. Every 50 learning steps a vector s n is collected. 10 different realizations of the experiment are performed (for different initialization of the initial recurrent weights, which are randomly extracted from a Gaussian with zero mean and variance 2.0). This procedure provides 10 different trajectories s n . The first 2 PCs of these trajectories are reported in Fig.2A-B of the main text.

B.3 Dimensionality estimation
In order to estimate the dimensionality of the solution space we consider the difference between the activity generated by the network at the end of the learning procedure and the target sequence δs t i = s t i − s t i . When the sequence is perfectly cloned, δs t i = 0 by definition. Otherwise these deviations are different from zero. In order to estimate the dimensionality of the sub-space of the solution containing δs t i we first perform the PC analysis. PCA is applied on a collection of T = 100 vectors δs t i each of them defined by N = 100 coordinates. As a result we obtain N principal component variances λ k . If only one variance is significantly different from zero the dimensionality is approximately one, and so on. The dimensionality can be estimated as d =

B.4 Random R matrix
We repeated the experiment of the main text (reported in Fig.2C) for the case in which the matrix R ik is random. Its elements are randomly extracted from a Gaussian with mean zero and variance 1 √ D . The result, reported in Fig.S1 is analogous with what observed in the paper.
C Button-and-food C.1 Training details The Button & Food task requires an agent, which starts at the center of the environment domain, to reach for a button in order to unlock its final target (the food) which sits on a separate location. For this task we used the same learning procedure as described in Appendix B.1. In particular, in the main text in Fig.4 of the main text we reported results obtained when training using the semi-clumped version of the learning rule. In the clumped version one substitutes the target spiking activity s t i to the network activity s t i during the evaluation of the spike response function e t j , indeed, upon convergence, learning halts precisely when s t i = s t i . However, when D < N and D ik is diagonal, the target cannot effectively be enforced (or learned) for the N − D κ-neurons for which D κκ = 0, so s t i is only replaced to s t i for i = κ, yielding the semi-clumped formulation.
In our experiments we set the initial agent position at coordinates (x, y) = (0, 0), with the button positioned at (0, −0.2). The target sits at a constant radius (R = 0.7) from the origin, while the angle θ is varied in the range [30 • , 150 • ]. The agent is trained on the angles θ ∈ {50, 70, 90, 110, 130} and tested on the complete range. We fix the network size and task temporal duration to N = 300 and T = 80 respectively. The expert trajectories are straight lines connecting agent initial location to button to food, travelled with constant speed so to match the travel time of T 1 a→b = 30 and T 2 b→f = 50. Additionally, a hint signal composed of two  units (H = 2) encoding for the Boolean locked variable (square signal active when true) is injected on top of the expert trajectories (via a Gaussian random projection matrix of zero mean and variance σ hint ) before computing the target activity. The agent receives as input the difference of both button's and food's position with respect to current agent location, the input vector is then represented as ∆ t = {∆x t b , ∆y t b , ∆x t f , ∆y t f }. This vector is then encoded via a set of tuning curves: the domain is partitioned in each direction (x and y) with a resolution of 20 cells, each cell coding a scalar input via a Gaussian activation centered around it physical location and of unit variance (with a peak value of 1). Thus, each of the four input vector components ∆x t b,f , ∆y t b,f are encoded using 20 units, yielding a total input count of I = 80 units. The agent output is the velocity vector v x,y , encoded using O = 2 output units. The agent reward is computed as r unlock = min d (p agent , p target ) −1 when the target is successfully unlocked (button pushed), while fixed at r locked = 1/d max , d max = 5 for the locked condition. The feedback matrix D ik is computed as D = R R, where R is a Gaussian random matrix of shape R D×N , with D the predefined matrix rank, and variance σ = 1/ √ D. The learning procedure is the same as described in Appendix B.1. We trained our system for 1000 epochs (with Adam optimizer) and repeat the experiment 100 times for each rank.

C.2 Random R matrix
We report here results obtained for the pure non-clumped version of the rule (24) and random feedback matrix R ik . In Fig.S2 we report the results of the analysis. In Fig.S2A the average reward (error bars omitted for visualization clarity) as a function of the test angle θ is plotted for the different ranks of the feedback matrix D, highlighting the superior performances of the high-ranks for this particular task. This features is then confirmed in panel Fig.S2B, where the average reward (across all test conditions) is reported as a function of the rank D of the feedback matrix D, again stressing how a high-rank feedback induces optimal performances for this task. This result confirms that what reported in the main text for a diagonal feedback matrix R ik is indeed general and holds for a random R ik as well. The complete set of parameters used in the Button & Food task is reported in Table 2.

D.1 Training details
The 2D Bipedal Walker environment was, provided through the OpenAI gym (https://gym.openai.com [1606.01540], MIT License). The expert behavior is obtained by training a standard feed-forward network with PPO (proximal policy approximation [schulman2017proximal], in particular we used the code provided in [pytorch_minimal_ppo], MIT License). The average reward performed by the expert is r exp 180 while a random agent achieves r rnd −120. The sequence of states-actions is collected in the vectors y t k , k = 1, ... O, x t h , h = 1, ...I, t = 1, ...T . The learning procedure is the same as described in Appendix B.1. All the parameters of the experiment are reported in Table 3. For this experiment we chose the maximum rank (D = N ). In this case, if the matrix R ij is extracted randomly (e.g. Guassian with zero mean) the matrix D ij is almost diagonal. For this reason we take D ij = δ ij . We evaluate the performances for different values of τ . The learning is divided in 2 phases. In the first one, the recurrent weights are trained in order to reproduce the internal target dynamics s t i for 500 iterations using gradient ascent. In the second phase of the training procedure, only the readout weights are trained using equation eq.(20).

D.2 Non-clumped case
In the experiment reported in Fig.5 of the main text we adopted the clumped version of the learning rule.
This means to substitute the target spiking activity s t i to the activity produced by the network s t i in the evaluation of the spike response function e t j . We repeated the experiment in the non-clumped version of the training. In Fig.S3 it is reported the average of the maximum reward as a function of τ . For each value of τ we performed 10 independent realizations of the experiment. For each realization the s t i is computed, and the recurrent weights are trained. The optimization of the recurrent weights is performed using eq.(24) through gradient ascent and a learning rate η. The learning of readout weights is performed using eq.(20) through gradient ascent and a learning rate η RO . Every 75 training iterations of the readout training we test the network and evaluate the average reward r over 50 repetitions of the task. We then evaluate the average over the 10 realizations of the maximum r obtained for each realization.
It is apparent that in this case there exist an optimal τ 2.5ms, allowing for a minimal training error (∆S = it |s t i − s t i |, the difference between the target pattern of spikes and the pattern generated is minimal, Fig.S3 left panel) and a maximum value of the reward ( Fig.S3 right panel). However, the relationship between the training error and the reward is not trivial since for high τ value we got a very poor training error, but also a good value of the average reward.