## Figures

## Abstract

The field of recurrent neural networks is over-populated by a variety of proposed learning rules and protocols. The scope of this work is to define a generalized framework, to move a step forward towards the unification of this fragmented scenario. In the field of supervised learning, two opposite approaches stand out, error-based and target-based. This duality gave rise to a scientific debate on which learning framework is the most likely to be implemented in biological networks of neurons. Moreover, the existence of spikes raises the question of whether the coding of information is rate-based or spike-based. To face these questions, we proposed a learning model with two main parameters, the rank of the feedback learning matrix and the tolerance to spike timing *τ*_{⋆}. We demonstrate that a low (high) rank accounts for an error-based (target-based) learning rule, while high (low) tolerance to spike timing promotes rate-based (spike-based) coding. We show that in a store and recall task, high-ranks allow for lower MSE values, while low-ranks enable a faster convergence. Our framework naturally lends itself to Behavioral Cloning and allows for efficiently solving relevant closed-loop tasks, investigating what parameters are optimal to solve a specific task. We found that a high is essential for tasks that require retaining memory for a long time (Button and Food). On the other hand, this is not relevant for a motor task (the 2D Bipedal Walker). In this case, we find that precise spike-based coding enables optimal performances. Finally, we show that our theoretical formulation allows for defining protocols to estimate the rank of the feedback error in biological networks. We release a PyTorch implementation of our model supporting GPU parallelization.

## Author summary

Learning in biological or artificial networks means changing the laws governing the network dynamics in order to better behave in a specific situation. However, there exists no consensus on what rules regulate learning in biological systems. To face these questions, we propose a novel theoretical formulation for learning with two main parameters, the number of learning constraints () and the tolerance to spike timing (*τ*_{⋆}). We demonstrate that a low (high) rank accounts for an error-based (target-based) learning rule, while high (low) tolerance to spike timing *τ*_{⋆} promotes rate-based (spike-based) coding.

Our approach naturally lends itself to Imitation Learning (and Behavioral Cloning in particular) and we apply it to solve relevant closed-loop tasks such as the button-and-food task, and the 2D Bipedal Walker. The button-and-food is a navigation task that requires retaining a long-term memory, and benefits from a high . On the other hand, the 2D Bipedal Walker is a motor task and benefits from a low *τ*_{⋆}.

Finally, we show that our theoretical formulation suggests protocols to deduce the structure of learning feedback in biological networks.

**Citation: **Capone C, Muratore P, Paolucci PS (2022) Error-based or target-based? A unified framework for learning in recurrent spiking networks. PLoS Comput Biol 18(6):
e1010221.
https://doi.org/10.1371/journal.pcbi.1010221

**Editor: **Michele Migliore,
National Research Council, ITALY

**Received: **December 6, 2021; **Accepted: **May 17, 2022; **Published: ** June 21, 2022

**Copyright: ** © 2022 Capone et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **The code associated to this paper is made publicly available in the following repository: https://github.com/myscience/goal. We provide two Python implementations: a pure NumPy-based version and a PyTorch implementation with GPU support.

**Funding: **This work has been supported by the European Union Horizon 2020 Research and Innovation program under the FET Flagship Human Brain Project (grant agreement SGA3 n. 945539, to P.S.P., and grant agreement SGA2 n. 785907, to P.S.P.) and by the INFN APE Parallel/Distributed Computing laboratory as salary to P.S.P. C.C. received salary from SGA3 n. 945539 and SGA2 n. 785907. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

When confronted with reality, humans learn with high sample efficiency, benefiting from the fabric of society and its abundance of experts in relevant domains. A conceptually simple and effective strategy for learning in this social context is Imitation Learning. One can conceptualize this learning strategy in the Behavioral Cloning framework, where an agent observes a near optimal behavior (expert demonstration), and progressively improves its mimicking performances by minimizing the differences between its own and the expert’s behavior. Behavioral Cloning can be directly implemented in a supervised learning framework. In the last years, a competition between two opposite interpretations of supervised learning is emerging: error-based approaches [1–5], where the error information computed at the environment level is injected into the network and used to improve later performances, and target-based approaches [6–13], where a target for the internal activity is selected and learned. In this work, we provide a general framework, which we call GOAL (Generalized Optimization of Apprenticeship Learning), where these different approaches are reconciled and can be retrieved via a proper definition of the error propagation structure the agent receives from the environment. Target-based and error-based are particular cases of our comprehensive framework. This novel formulation, being more general, offers new insights on the importance of the feedback structure for network learning dynamics, a still under-explored degree of freedom. Moreover, we remark that spike-timing-based neural codes are experimentally suggested to be important in several brain systems [14–17]. This evidence led us to we investigate the role of coding with specific patterns of spikes by introducing a parameter that defines the tolerance to precise spike timing during learning. Although many studies have approached learning in feedforward [9, 18–22] and recurrent spiking networks [2, 3, 8, 10, 12, 23, 24], a very small number of them successfully faced real world problems and reinforcement learning tasks [3, 25]. In this work, we apply our framework to the problem of behavioral cloning in recurrent spiking networks and show how it produces valid solutions for relevant tasks (button-and-food and the 2D Bipedal Walker). From a biological point of view, we focus on a novel route opened by such a framework: the exploration of what feedback strategy is actually implemented by biological networks and in the different brain areas. We propose an experimental measure that can help elucidate the error propagation structure of biological agents, offering an initial step in a potentially fruitful insight-cloning of naturally evolved learning expertise.

## Models

### The spiking network

In our formalism, neurons are modeled as real-valued variable , where the *j* ∈ {1, …, *N*} label identifies the neuron and *t* ∈ {1, …, *T*} is a discrete time variable. Each neuron exposes an observable state , which represents the occurrence of a spike from neuron *j* at time *t*. We then define the following dynamics for our model:
(1) (2) (3)
Δ*t* is the discrete time-integration step, while *τ*_{s} and *τ*_{m} are respectively the spike-filtering time constant and the temporal membrane constant. Each neuron is a leaky integrator with a recurrent filtered input obtained via a synaptic matrix and an external signal . *w*_{res} = −20 accounts for the reset of the membrane potential after the emission of a spike. *v*^{th} = 0 and *v*_{rest} = −4 are the threshold and the rest membrane potential.

### The supervised learning rule

We aim at training the recurrent spiking network to reproduce a desired output. In the framework of behavioral cloning, this output is the behavior of an expert agent (human, pre-trained artificial intelligence) which already knows an almost optimal solution of a task (see details in the section Application to closed-loop tasks: Behavioral cloning).

In order to train the network to reproduce at each time the desired output vector , it is necessary to minimize the loss function:
(4)
where is a linear readout of the spiking activity of the network and . is defined as:
(5)
i.e., a temporal filtering of the spikes , where Δ*t* is the temporal bin of the simulation and *τ*_{⋆} the timescale of the filtering.

It is possible to derive the learning rule by differentiating the previous error function (by following the gradient), similarly to what was done in [3]: (6) where we have used for the pseudo-derivative (similarly to [3]) and reserved for the spike response function that can be computed iteratively as (7)

In our case the pseudo-derivative, whose purpose is to replace (since *f*(⋅) is non-differentiable, see Eq (3)), is defined as , it peaks at and *δv* is a parameter defining its width. For the complete derivation, we refer to Section A in S1 Text (where we also discuss the ≃ in Eq (6)).

## Results

In the following sections, we define a generalized learning framework by identifying two sensitive parameters: the number of constraints *R* and the sensitivity to precise temporal coding *τ*_{⋆}. We analytically show how different learning rules presented in the literature can be accounted for as specific cases of our framework. We provide a geometrical interpretation, and strengthen our statement through numerical experiments.

Finally, we test the performances of our learning rule as a function of the two main parameters (*R*, *τ*_{⋆}) on different tasks: a store and recall task (of a target 3D trajectory), a navigation task (button and food) and a motor task (2D bipedal walker).

### Theoretical results

#### Generalization of the learning framework.

As discussed above, in Eq (6) we defined as the desired output (e.g., the target behavior). However, it is possible to imagine that in both biological and artificial systems there are much more constraints, not directly related to the behavior, to be satisfied. One example is the following: it might be necessary for the network to encode an internal state which is useful to produce the behavior and to solve the task (e.g. an internal representation of the position of the agent, contextual information and so on). The encoding of this information can automatically emerge during training, however to directly suggest it to the network might significantly facilitate the learning process. This signal is referred to as hint in the literature [23]. For this reason, we introduce a further set of output targets , *k* = *O* + 1, …*R* and define , *k* = 1, …*R* as the collection of and . is the signal decoded from the network activity through a linear readout and it is constrained to be as similar as possible to the target (see Fig 1A). By definition, is the same as but with extra rows. See section Definition of the additional constraints for details on the choice of and .

The gradient-based minimization of the error results in the following learning rule: (8)

We propose a general framework where the dimensionality of the error feedback *R*, and the sensitivity to temporal coding *τ*_{⋆} can be varied arbitrarily. (**A**) Graphical depiction of a general supervised learning framework. *R* errors (difference between the output and the target output) are projected to the network to evaluate the recurrent weights updates. (**B**) Graphical depiction of the role of the temporal sensitivity parameter *τ*_{⋆}. (**C**) Several learning rules present in the literature (e-prop, full-FORCE, LTTS) can be accounted as specific cases of our general framework. (**D**) Target-based approaches define a specific internal solution for network dynamics (red point), while error-based solutions are distributed in a subspace of possible internal states (green line). However, not all the points on the green line are accessible (given the discrete nature of the spiking activity), and the error-based solution can be sub-optimal. (**E**) However, when *τ*_{⋆} is large, the accessible states are more dense, and it is easier to find a good solution with an error-based approach. (**F**) The MSE for target-based and error-based solutions becomes comparable for large *τ*_{⋆} values. Points are average between five realization of the minimum MSE (after 10^{4} training epochs), with error bars denoting the corresponding standard deviations.

The possibility to broadcast specific local errors in biological networks has been debated for a long time [26, 27]. On the other hand, the propagation of a target appears to be more coherent with biological observations [28–31]. For this reason, we propose an alternative formulation allowing to propagate targets rather than errors [6, 27]. This can be easily done by writing the target output as: (9)

We stress here the fact that, due to the spikes discretization, the last equality cannot be strictly achieved, and it is only an approximation. One can simply consider *s*^{⋆t} to be the solution of the optimization problem . The optimal encoding for a continuous trajectory through a pattern of spikes has been broadly discussed in [32]. However, the pattern *s*^{⋆t} might describe an impossible dynamics (for example, activity that follows periods of complete network silence). For this reason, here we take a different choice. The is the pattern of spikes expressed by the untrained network (where recurrent connections are all set to zero) when the target output is randomly projected as an input (similarly to [8, 9]). It has been demonstrated that this choice allows for fast convergence and encodes detailed information about the target output. With these additional considerations, we can now rewrite our expression for the weight update in terms of the network activity:
(10)
where is a novel matrix which acts recurrently on the network. The two core new terms are the matrix and . The former defines the dynamics in the space of the internal network activities during learning.

The latter provides a specific pattern of spikes, which is directly suggested to the network as the internal solution of the task. We interpret the parameter *τ*_{⋆} (the time-scale of the spike filtering, see Eq (5)) as the tolerance to spike timing of the proposed internal solution . This clarifies the use of the subscript ⋆ for the timescale *τ*_{⋆}, since it concerns the target quantities. In Fig 1B we show in a sketch that, for the same spike displacement between the spontaneous and the target activity, the error is higher when the *τ*_{⋆} is lower. However, as demonstrated in the following sections, the network dynamics only converges to for a range of parameters and *τ*_{⋆}.

#### Definition of the additional constraints.

As discussed above there are many possible choices for the additional constraints (contextual signals, f.r. regularization and so on), and we take the following one. We first compute the target spiking pattern (as described in the section 2.2.1) and trained the network readout (via standard output-error gradient descent, Adam optimizer) to reconstruct . We remark that only serves to define the readout weights, and it will not be used explicitly when using the learning rule in Eq (8).

Then, we define . The first *O* rows of the matrix are taken from the matrix so that for *k* ≤ *O*. The other rows are chosen randomly from a Gaussian distribution (with the same variance of the matrix ). In summary, the generalized target is:
(11)
where the additional constraints are a random linear combination of a hypothetical internal solution . However, we demonstrate that only when the rank is high, the internal dynamics converges to .

#### Training protocol.

We recap in this section the network training procedure used for rank-feedback control (Figs 1 and 2). First, we construct the target spike-pattern from the target output signal: the output signal is randomly projected into the untrained network, alongside the regular input. The spike activity expressed by the untrained network with such input is selected as internal target . Then, we train the linear readout to reproduce the target signal from the newly constructed target spike pattern. Note how, at this stage . The output connectivity matrix is then expanded to include new constraints, resulting in . In practice, we extract the *R* − *O* new vectors from a Gaussian distribution with zero mean and variance equal to 2 std(*B*_{ki}). Finally, the expanded output connectivity matrix is adopted in the recurrent synaptic updates and Eq (8) is used for training.

#### The target-based limit.

We remark that the formulation described above is equivalent to an error-based approach whose output target is . When the rank of the matrix is comparable to the number of neurons, the matrix is almost diagonal and the learning rule reduces to: (12)

In this case, the training of recurrent weights reduces to learning a specific pattern of spikes [33–37]. In this limit, the model Learning Through Target Spikes (LTTS) [9] is recovered (see Fig 1C), with the only difference of the presence of the pseudo-derivative. These two limits are investigated numerically in the section Dimensionality of the solution space. We remark that in the formulation described in Eq (10) it is possible to change the rank of the feedback by directly changing the number of rows in the matrix .

Numerical results confirming this theoretical prediction are reported in the following section. A major advantage of target-based limit is in the implementability and plausibility of the error propagation. In the general error-based case, the performances/activity of neurons has to be read, compared with the target output and then broadcasted to all neurons. This process requires time. This is reflected in the non locality in Eq (10) due to the matrix . To update the weights between neuron *i* and neuron *j* it is necessary to know the local errors from all other neurons. On the contrary, the rule derived in the target-based limit, only requires the local error of the post-synaptic neuron.

#### Interpretation of the framework.

The major strength of our formulation is the capability to encompass very different learning approaches in the same framework. We have already noted in the previous section how in the high-rank and small-*τ*_{⋆} limit, one recovers the LTTS model, where a specific pattern of spikes is learned. In the large-*τ*_{⋆} regime the precise temporal-coding of spikes is blurred out (see Fig 1B), preventing learning for a specific spike-pattern. However, with a high-rank configuration a target rate-based internal solution is still identified: this is the known full-FORCE solution [8], where the learnt input currents induce an internal target activity, which is suited for the task. Loosening the learning constraints, i.e. reducing the feedback matrix rank , progressively enlarges the space of internal solutions. When matches the output dimensionality, we recover the known error-based approach (e.g. e-prop [3]): no internal dynamics is imposed on the system, which is only guided by the projection of the output error. Fig 1C visually represents where these models are located in the plane. Our novel general description of different learning approaches offers a new tool to better investigate their relationships. Here we provide a geometrical interpretation of our model, that intuitively explains what is the role of the two parameters and *τ*_{⋆} in defining the network internal solution to a task. Fig 1D–1E represent the error space, the difference between network activity and the target activity at a specific time *t* (two sample neurons are represented, and ). As discussed in the previous section, the target-based learning rule univocally defines one solution , ∀*i*, *t* (, represented by the red point in Fig 1D–1E). Thus, the high-rank learning rule (and the target-based one) tries to make the network dynamics converge to the red point (as represented by the black arrow in Fig 1D–1E).

On the other hand, the error-based rule defines a set of possible solutions, defined by , ∀*k*, *t* (this can be represented as a line in the space in Fig 1D–1E, green line) in which the MSE is low. In other words, using the low-rank learning rule is equivalent to looking for the closest solution next to the green line (in Fig 1D–1E) (as represented by the black arrow in Fig 1D–1E). However, not all the points on the green line are accessible to the network (given the discrete nature of the spiking activity, see Fig 1D black crosses), and the error-based solution achieved during the learning procedure can be sub-optimal (see Fig 1D, green point which is far from the green line). Indeed, when *τ*_{⋆} is small , and it can assume only values −1, 0, or 1. However, when *τ*_{⋆} is large, the signals produced by the network are filtered, and the possible values for are no longer −1, 0, or 1. As a result, the accessible states are denser in the space of possible internal solutions (see Fig 1E, black crosses), and it is easier to find a good solution with an error-based approach (see Fig 1E, green point, which is now closer to the green line). This theoretical prediction is confirmed in numerical experiments (see the following section).

### Numerical results

#### Store and recall.

To investigate the role of the and *τ*_{⋆} parameters in our learning rule (see Eqs (8) and (10)), we considered a store and recall task. The network is asked to reproduce the target trajectory when prompted with a clock-like input signal , with a random Gaussian matrix with zero mean and *σ*_{inp} variance and , *C* = 5 (see Fig 2A for a graphical representation of the task).

(**A**) We benchmarked our framework on a store and recall task. The network has to autonomously reproduce a target 3D trajectory, given a clock like input (bottom). Dashed line: target output. Solid line: network output at the end of the training for low and high rank conditions (green and blue respectively). (**B**) MSE (between the target output and the network output) as a function of training epochs, in the store and recall task of a 3D trajectory. Low rank (R = 3, green) and high rank (R = N, red) performances are compared. (**C**) Spike error as a function of training epochs, for high and low rank learning rule. (**D**) The MSE (color-coded) as a function of the rank R and the timescale *τ*_{⋆}. (**E**) The spike error (color-coded) as a function of the rank R and the timescale *τ*_{⋆}. (**F**) Convergence time (color-coded) as a function of the rank R and the timescale *τ*_{⋆}.

is a temporal pattern composed of *O* = 3 independent continuous signals. Each target signal is specified as the superposition of the four frequencies *f* ∈ {1, 2, 3, 5} Hz with uniformly extracted random amplitude *A* ∈ [0.5, 2.0] and phase *ϕ* ∈ [0, 2*π*]. We defined the additional constraints as described in section 2.2.2 and used the learning rule in Eq (8). Given this formulation, we can arbitrarily modulate the parameters and *τ*_{⋆}.

First we validated the intuition (see Fig 1D and 1E) that larger *τ*_{⋆} time constant, with the consequent enrichment of available network states, should progressively erase the difference between an error-based (low-rank) and target-based (high-rank) learning approach. We considered the two scenarios where and and trained the network till convergence for increasing values of *τ*_{⋆}. The results, collected in Fig 1F, clearly illustrate how the difference between the two approaches vanishes for increasing *τ*_{⋆}. In Fig 2B and 2C we have reported the output-error (measured as the MSE) and the spike-error as a function of training epochs for a particular choice of the *τ*_{⋆} parameter.

Fig 2C shows that for a high rank, the internal activity of the network converges exactly to internal proposed target . This confirms that learning with a high rank is equivalent to a target approach. On the other hand, when the rank is low, this does not happen, and the network autonomously finds an alternative internal dynamics, that is different from , but still produces an output similar to (as shown in Fig 2B).

Both methods achieve low output errors (Fig 2B), with the high-rank approach eventually scoring a lower MSE (the readout limit, i.e. the lowest achievable error given the pre-trained readout matrix ), while low-rank allows for a faster convergence. To better grasp the interplay between the rank and the *τ*_{⋆} parameters, we trained several instances of the same task to explore the model behavior in the full plane. We measured the output- and spike-error (Fig 2D and 2E), plus an estimate of the convergence time T_{conv} (Fig 2F), quantified as the number of epochs needed to halve the initial output error. Only high-rank feedback achieved low spike-errors (Fig 2E), with a non-trivial dependence on the optimal *τ*_{⋆}. The LTTS algorithm was found to be the most robust in this sense, reliably achieving low spike errors for a broad range of *τ*_{⋆}. Interestingly, the MSE metric (Fig 2D) highlighted two regions of low output-error, corresponding either to pure error-based () or high-rank, each with different optimal *τ*_{⋆}. Finally, the convergence time highlights how a low rank systematically allows for a faster convergence (Fig 2F, light region in the bottom-right part of the panel). A possible explanation for this is that, while the target-based (high rank) solution is unique, there are many possible error-based (low-rank) solutions. For this reason is easier to find a close error-based solution starting from a random initial condition in the network activity space (see Fig 1D), while it requires more time to get the target-based solution. However, training slows for low rank and low *τ*_{⋆}, with magenta-colored conditions signaling failure to reduce output-error by half.

#### Dimensionality of the solution space.

The learning formulation of Eq (10) offers a major insights on the role played by the feedback matrix . Consider the learning problem (with fixed input and target output) where the synaptic matrix *w*_{ij} is refined to minimize the output error (by converging to the proper internal dynamics). The learning dynamics can be easily pictured as a trajectory where a single point is a complete history of the network activity , with *n* = 1, …*E* where *E* is the total number of learning epochs. Upon initialization, a network is located at a point *s*_{0} marking its untrained spontaneous dynamics. The following point *s*_{1} is the activity produced by the network after applying the learning rule defined in Eq (10), and so on. By inspecting Eq (10) one notes that a sufficient condition for halting the learning is , where *ϵ* is an arbitrary small positive number. If *ϵ* is small enough it is possible to write:
(13)
In the limit of a full-rank matrix (example: the LTTS limit where is diagonal) the only solution to Eq (13) is and the learning halts only when the target is cloned. When the rank is lower, the solution to Eq (13) is not unique, and the dimensionality of possible solutions is defined by the kernel of the matrix (the collection of vectors λ such that ). We have: . We run the store and recall experiment in order to confirm our theoretical predictions.

We repeated the experiment for different values of the rank . The matrix is set to , *i* = 1, …*N*, , where *δ*_{ik} is the Kronecker delta (the analysis for the case random provides analogous results and is reported in Fig C in S1 Text). When the rank is *N*, different replicas of the learning (different initialization of recurrent weights) converge almost to the same internal dynamics . This is reported in Fig 3A (left) where a single trajectory represents the first 2 principal components (PC) of the vector . The convergence to the point (0, 0) represents the convergence of the dynamics to . When the rank is lower (, see Fig 3A, right) different realizations of the learning converge to different points, distributed on an line in the PC space. This can be generalized by investigating the dimension of the convergence space as a function of the rank. The dimension of the vector evaluated in the trained network is estimated as , where λ_{k} are the principal component variances normalized to one (∑_{k} λ_{k} = 1). We found a monotonic relation between the dimension of the convergence space and the rank (see Fig 3B, more information on the PC analysis and the estimation of the dimensionality in Section B in S1 Text). This observation confirms that when the rank is very high, the solution is strongly constrained, while when the rank becomes lower, the internal solution is free to move in a subspace of possible solutions. We suggest that this measure can be used in biological data to estimate the dimensionality of the learning constraints in biological neural network from the dimensionality of the solution space.

(**A**) Dynamics along training epochs of the in the first two principal components for different repetition of the training with variable initial conditions. The error propagation matrix has maximum rank (, target-based limit). (**B**) Same as in (**A**), but with an error propagation matrix with rank . (**C**) Dimensionality of the solution space as a function of the rank of the error propagation matrix. (**D**) Dynamics along training epochs of the in the first two principal components, when a white noise is included in the synaptic dynamics. (**E**) Estimation of the dimensionality of the solution space, sampled thanks to fluctuations induced on the synaptic weights, as a function of the rank of the error propagation matrix.

#### Dimensionality estimation on single trial.

The dimensionality estimation described above requires the knowledge of the *s*^{⋆} and the repetition of several realizations of the training procedure. However, this information is not available in an experimental setup. For this reason, we propose an alternative approach which could be directly applied to experimental data.

We perform only one realization of the training, but we assume the presence of noise on the synaptic dynamics as follows, by adding white noise to Eq (10) and following the equation:
(14)
where *ϵ* = 0.1 and is a normal variable. In Fig 3C we reported dynamics along training epochs of the in the first two principal components. Since, the is not experimentally accessible, we replaced it with , the internal dynamics of the network at the end of the training. We observe that a first phase is dominated by the learning dynamics, while the second phase is dominated by the synaptic noise. This second phase allows exploring the space of possible internal solutions, even without running several times the training experiment. By estimating the dimensionality of this sampled space (in the same way as described in the previous section) we observe a monotonic dependence between the rank of the matrix , and the dimensionality of such a space (see Fig 3D). This methodology could be directly applied to data, allowing to provide an estimation of the dimensionality of the space of possible internal solutions to the same problem. We suggest that this could be directly related to the structure of the feedback during training, as demonstrated in our model.

### Application to closed-loop tasks: Behavioral cloning

We face the general problem of an agent interacting with an environment with the purpose to solve a specific task. This is in general formulated in term of an association, at each time *t*, between a state defined by the vector and actions defined by the vector . The agent evaluates its current state and decides an action through a policy . Two possible and opposite strategies to approach the problem to learn an optimal policy are Reinforcement Learning and Imitation Learning. In the former, the agent starts by trial and error and the most successful behaviors are potentiated. In the latter the optimal policy is learned by observing an expert which already knows a solution to the problem. Behavioral Cloning belongs to the category of Imitation Learning and its scope is to learn to reproduce a set of expert behaviours (actions) , *k* = 1, …*O* (where *O* is the output dimension) given a set of states , *h* = 1, …*I* (where *I* is the input dimension). Our approach is to explore the implementation of Behavioral Cloning in recurrent spiking networks.

In what follows, we assume that the action of the agent at time *t*, is evaluated by a recurrent spiking network and can be decoded through a linear readout , where . is a temporal filtering of the spikes (similarly to in Eq (1), with a time scale *τ*_{⋆}). The network is trained to reproduce the target behavior of the expert .

#### Button-and-food task.

To investigate the effects of the rank of feedback matrix, we design a button-and-food task (see Fig 4A for a graphical representation), which requires for a precise trajectory and to retain the memory of the past states. In this task, the agent starts at the center of the scene, which features also a button and an initially locked target (the food). The agent task is to first push the button so to unlock the food and then reach for it. We stress that to change its spatial target from the button to the food, the agents has to remember that it already pressed the button (the button state is not provided as an input to the network during the task). In our experiment we kept the position of the button (expressed in polar coordinates) fixed at *r*_{btn} = 0.2, *θ*_{btn} = −90° for all conditions, while food position had *r*_{food} = 0.7 and variable *θ*_{food} ∈ [30°, 150°]. The agent learns via observations of a collection of experts behaviours, which we indicate via the food positions . The expert behavior is a trajectory which reaches the button and then the food in straight lines (*T* = 80). The network receives as input (*I* = 80 input units) the vertical and horizontal differences of both the button’s and food’s positions with respect to agent location ( respectively). These quantities are encoded through a set of tuning curves. Each of the Δ_{i} values are encoded by 20 input units with different Gaussian activation functions. Agent output is the velocity vector *v*_{x,y} (*O* = 2 output units). We used *η* = *η*_{RO} = 0.01 (with Adam optimizer), moreover *τ*_{RO} = 10ms. Agent performances are measured by defining a reward function *r* that considers the importance to push the button before taking the food:
where is the button-state indicator variable that is zero when the button is locked and one otherwise, the are the agent and target position vectors and *d*(⋅, ⋅) is the standard euclidean distance. We repeated training for different values of the rank of the feedback matrix , computed from (with *δ*_{ik} the Kronecker delta, the analysis for the case random provides analogous results and is reported in Section C.2 in S1 Text), in a network of *N* = 300 neurons, and compared the overall performances (more information in Section C.2 in S1 Text). Fig 4B and 4C reports the rastergram for 100 random neurons and the dynamics of the membrane potential for 3 random neurons during a task episode. In Fig 4D we reported an example of the actions (*v*_{x}, *v*_{y}, red an green respectively) trajectories, the target ones, and the ones reproduced by the network (dashed and solid lines respectively). In Fig 4E we report the agent training trajectories, color-coded for the final reward and the rewards obtained by the network after the behavioral cloning. Indeed, all the training conditions () show good convergence. In Fig 4F the final reward is reported as a function of the target angle *θ*_{food} for different ranks (ranks are color-coded using the same scheme as Fig 4G and purple arrows indicate the training conditions). As expected, the reward is maximum concurrently with the training condition. Moreover, it can be readily seen how high-rank feedback structures allows for superior performances for this task. Finally, in Fig 4G the average reward across all target conditions is reported as a function of the rank , further highlighting the benefits of a high-rank feedback structure for this task.

(**A**) Sketch of the task. An agent starts at the center of the environment domain (left) and is asked to reach a target. The target is initially “locked”. The agent must unlock the target by pushing a button (middle) placed behind and then reach for the target (right). (**B**). Rasterplot of the activity of a random sample of 100 neurons across 80 time unit of a task episode. (**C**). Temporal dynamics of the membrane potential of three example units. (**D**) Target *v*_{x,y} (dashed lines), the velocity direction in the bi-dimensional plane, and the one reproduced by the network after the behavioral cloning (solid lines). (**E**) Example trajectories produced by a trained agent for different target locations. Purple arrows depict the positions of the food for the observed expert behaviors. (**F**) Final reward obtained by a trained agent as a function of the target position (measured by the angle *θ* with a fixed radius of *r* = 0.7 as measured from the agent starting position). Individual lines are average values over 10 repetitions. Color codes for different ranks in the error propagation matrix. (**G**) Average over all the target positions of final reward as a function of the rank. Error bars represent the standard deviation of the mean.

#### 2D Bipedal Walker.

We benchmarked our behavioral cloning learning protocol on the 2D Bipedal Walker standard task provided through the OpenAI gym (https://gym.openai.com [38], MIT License). The environment and the task are sketched in Fig 5A: a bipedal agent has to learn to walk and to travel as long a distance as possible. The expert behavior is obtained by training a standard feed-forward network with PPO (proximal policy approximation [39], in particular we used the code provided in [40], MIT License). The sequence of states-actions is collected in the vectors , *k* = 1, …*O*, , *h* = 1, …*I*, *t* = 1, …*T*, with *T* = 400, *O* = 4, *I* = 14 (we excluded the LIDAR inputs, see Fig 5C for an example of the states-actions trajectories). The average reward performed by the expert is 〈*r*〉_{exp} ≃ 180 while a random agent achieves 〈*r*〉_{rnd} ≃ −120. We performed behavioral cloning by using the learning rule in Eq (10) in a network of *N* = 500 neurons. We chose the maximum rank () and evaluate the performances for different values of *τ*_{⋆} (more information in Section D in S1 Text). Fig 5B and 5C (bottom) reports the rastergram for 100 random neurons and the dynamics of the membrane potential for 3 random neurons during a task episode. For each value of *τ*_{⋆} we performed 10 independent realizations of the experiment. For each realization, the is computed, and the recurrent weights are trained by using Eq (10). The optimization is performed using gradient ascent and a learning rate *η* = 1.0. In Fig 5D we report the spike error at the end of the training. The internal dynamics almost perfectly reproduces the target pattern of spikes for *τ*_{⋆} < 0.5ms, while the error increases for larger values. The readout time-scale is fixed to *τ*_{RO} = 5ms while the readout weights are initialized to zero and the learning rate is set to *η*_{RO} = 0.01. Every 75 training iterations of the readout we test the network and evaluate the average reward 〈*r*〉 over 50 repetitions of the task. We then evaluate the maximum 〈*r*〉 obtained for each episode (and average it over the 10 realizations). In Fig 5E it is reported the average of the maximum reward as a function of *τ*_{⋆}. The decreasing monotonic trend suggests that learning with specific pattern of spikes (*τ*_{⋆} → 0) enables for optimal performances in this walking task. We stress that in this experiment we used a clamped version of the learning rule. In other words, we substituted to in the evaluation of in Eq (7). This choice, which is only possible when the maximum rank is considered (), allows for faster convergence and better performances. The results for the non-clamped version of the learning rule are reported in section D.2 in S1 Text.

(**A**). Representation of the 2D Bipedal Walker environment. The task is to successfully control the bipedal locomotion of the agent, reward is measured as the travelled distance across the horizontal direction. The agent receives a state vector containing several measurements such as joints position, velocity and LIDAR for environment sensing and outputs the torque vector for the four leg joints. (**B**) Rasterplot of the activity of a random sample of 100 neurons across *T* = 100 time unit of a task episode (**C**) Temporal dynamics of a subset of the core input state variables, action vector and spike dynamics. Top panels report respectively: *v*_{x,y}, the velocity vector in the bi-dimensional plane, the angles of the two leg joints with colors matching the scheme of panel **A**, and the action vector *a*^{t} containing the torque *τ*_{hip,knee} for the two joints of the left leg. (Bottom) Temporal dynamics of membrane potential for three randomly sampled neurons. (**D**). Average spike error Δ*S* as a function of the *τ*_{⋆}. Error bars represent the standard deviation of the mean. (**E**). Average final episode reward as a function of the *τ*_{⋆}. Error bars represent the standard error.

## Discussion

Despite the experimental, theoretical, and computational progresses, neuroscience is still a relatively young field of study. The sign of this can be observed in the fragmented panorama of different theories and models proposed in the literature. In recent years, theoretical neuroscientists have formulated new frameworks attempting at providing more general explanations to aspects concerning intelligence and learning [41, 42]. In this work we contribute to this generalization effort by providing a general framework that is capable to account for different learning approaches by modulating two sensible parameters, the rank of the feedback error propagation and the tolerance to precise spike timing *τ*_{⋆} (see Fig 1C).

We argue that many proposed learning rules can be seen as specific cases of our general framework (e-prop, LTTS, full-FORCE). In particular, the generalization on the rank of the feedback matrix allowed us to understand the target-based approaches as emerging from error-based ones when the number of independent constraints is high. Moreover, we understood that different values lead to different dimensionality of the solution space. If we see the learning as a trajectory in the space of internal dynamics, when the rank is maximum, every training converges to the same point in this space. On the other hand, when the is lower, the solution is not unique, and the possible solutions are distributed in a subspace whose dimensionality is inversely proportional to the rank of the feedback matrix. We suggest that this finding can be used to produce experimental observable to deduce the actual structure of error propagation in the different regions of the brain. On a technological level, our approach offers a strategy to clone on a (spiking) chip an expert behavior either previously learned via standard reinforcement learning algorithms or acquired from a human agent. Our formalism can be directly applied to train an agent to solve closed-loop tasks through a behavioral cloning approach. This allowed solving tasks that are relevant in the reinforcement learning framework by using a recurrent spiking network, a problem that has been faced successfully only by a very small number of studies [3]. Moreover, our general framework, encompassing different learning formulations, allowed us to investigate what learning method is optimal to solve a specific task. We demonstrated that a high number of constraints can be exploited to obtain better performances in a task in which it was required to retain a memory of the internal state for a long time (the state of the button in the button-and-food task). On the other hand, we found that a typical motor task (the 2D Bipedal Walker) strongly benefits from precise timing coding, which is probably due to the necessity to master fine movement controls to achieve optimal performances. In this case, a high rank in the error propagation matrix is not really relevant. From the biological point of view, we conjecture that different brain areas might be located in different positions in the plane presented in Fig 1C.

### Limitations of the study

We chose relevant but very simple tasks in order to test the performances of our model and understand its properties. However, it is very important to demonstrate if this approach can be successfully applied to more complex tasks, e.g. requiring both long-term memory and fine motor skills. It would be of interest to measure what are the optimal values for both the rank of feedback matrix and *τ*_{⋆} in a more demanding task. Finally, we suggested that our framework allows for inferring the error propagation structure. However, we observe that the measure we proposed is indirect since it is necessary to estimate the dimensionality of the solution space first and then deduce the dimensionality of the learning constraints. Future development of the theory might be to formulate a method that directly infers from the data the laws of the dynamics in the solution space induced by learning.

## References

- 1.
Sacramento Ja, Ponte Costa R, Bengio Y, Senn W. Dendritic cortical microcircuits approximate the backpropagation algorithm. In: Bengio S, Wallach H, Larochelle H, Grauman K, Cesa-Bianchi N, Garnett R, editors. Advances in Neural Information Processing Systems 31. Curran Associates, Inc.; 2018. p. 8721–8732.
- 2. Nicola W, Clopath C. Supervised learning in spiking neural networks with FORCE training. Nature communications. 2017;8(1):2208. pmid:29263361
- 3. Bellec G, Scherr F, Subramoney A, Hajek E, Salaj D, Legenstein R, et al. A solution to the learning dilemma for recurrent networks of spiking neurons. Nature communications. 2020;11(1):1–15.
- 4. Bellec G, Salaj D, Subramoney A, Legenstein R, Maass W. Long short-term memory and learning-to-learn in networks of spiking neurons. arXiv preprint arXiv:180309574. 2018;.
- 5.
Kreutzer E, Petrovici MA, Senn W. Natural gradient learning for spiking neurons. In: Proceedings of the Neuro-inspired Computational Elements Workshop; 2020. p. 1–3.
- 6. Meulemans A, Carzaniga FS, Suykens JA, Sacramento J, Grewe BF. A theoretical framework for target propagation. arXiv preprint arXiv:200614331. 2020;.
- 7.
Lee DH, Zhang S, Fischer A, Bengio Y. Difference target propagation. In: Joint european conference on machine learning and knowledge discovery in databases. Springer; 2015. p. 498–515.
- 8. DePasquale B, Cueva CJ, Rajan K, Abbott L, et al. full-FORCE: A target-based method for training recurrent networks. PloS one. 2018;13(2):e0191527. pmid:29415041
- 9. Muratore P, Capone C, Paolucci PS. Target spike patterns enable efficient and biologically plausible learning for complex temporal tasks. PloS one. 2021;16(2):e0247014. pmid:33592040
- 10. Capone C, Pastorelli E, Golosio B, Paolucci PS. Sleep-like slow oscillations improve visual classification through synaptic homeostasis and memory association in a thalamo-cortical model. Scientific Reports. 2019;9(1):8990. pmid:31222151
- 11. Golosio B, De Luca C, Capone C, Pastorelli E, Stegel G, Tiddia G, et al. Thalamo-cortical spiking model of incremental learning combining perception, context and NREM-sleep. PLoS Computational Biology. 2021;17(6):e1009045. pmid:34181642
- 12. Capone C, Lupo C, Muratore P, Paolucci PS. Burst-dependent plasticity and dendritic amplification support target-based learning and hierarchical imitation learning. arXiv preprint arXiv:220111717. 2022;.
- 13. Urbanczik R, Senn W. Learning by the dendritic prediction of somatic spiking. Neuron. 2014;81(3):521–528. pmid:24507189
- 14. Carr C, Konishi M. A circuit for detection of interaural time differences in the brain stem of the barn owl. Journal of Neuroscience. 1990;10(10):3227–3246. pmid:2213141
- 15. Johansson RS, Birznieks I. First spikes in ensembles of human tactile afferents code complex spatial fingertip events. Nature neuroscience. 2004;7(2):170. pmid:14730306
- 16. Panzeri S, Petersen RS, Schultz SR, Lebedev M, Diamond ME. The role of spike timing in the coding of stimulus location in rat somatosensory cortex. Neuron. 2001;29(3):769–777. pmid:11301035
- 17. Gollisch T, Meister M. Rapid neural coding in the retina with relative spike latencies. science. 2008;319(5866):1108–1111. pmid:18292344
- 18. Memmesheimer RM, Rubin R, Ölveczky BP, Sompolinsky H. Learning precisely timed spikes. Neuron. 2014;82(4):925–938. pmid:24768299
- 19. Diehl P, Cook M. Unsupervised learning of digit recognition using spike-timing-dependent plasticity. Frontiers in Computational Neuroscience. 2015;9:99. pmid:26941637
- 20. Lillicrap TP, Cownden D, Tweed DB, Akerman CJ. Random synaptic feedback weights support error backpropagation for deep learning. Nature communications. 2016;7(1):1–10. pmid:27824044
- 21. Zenke F, Ganguli S. Superspike: Supervised learning in multilayer spiking neural networks. Neural computation. 2018;30(6):1514–1541. pmid:29652587
- 22. Mozafari M, Ganjtabesh M, Nowzari-Dalini A, Thorpe SJ, Masquelier T. Bio-inspired digit recognition using reward-modulated spike-timing-dependent plasticity in deep convolutional networks. Pattern Recognition. 2019;94:87–95.
- 23. Ingrosso A, Abbott L. Training dynamically balanced excitatory-inhibitory networks. PloS one. 2019;14(8):e0220547. pmid:31393909
- 24. Capone C, Paolucci PS. Towards biologically plausible Dreaming and Planning. arXiv preprint arXiv:220510044. 2022;.
- 25. Traub M, Legenstein R, Otte S. Many-Joint Robot Arm Control with Recurrent Spiking Neural Networks. arXiv preprint arXiv:210404064. 2021;.
- 26. Roelfsema PR, Ooyen Av. Attention-gated reinforcement learning of internal representations for classification. Neural computation. 2005;17(10):2176–2214. pmid:16105222
- 27. Manchev N, Spratling MW. Target Propagation in Recurrent Neural Networks. Journal of Machine Learning Research. 2020;21(7):1–33.
- 28. Knudsen EI. Supervised learning in the brain. Journal of Neuroscience. 1994;14(7):3985–3997. pmid:8027757
- 29. Miall RC, Wolpert DM. Forward models for physiological motor control. Neural networks. 1996;9(8):1265–1279. pmid:12662535
- 30. Spratling M. Cortical region interactions and the functional role of apical dendrites. Behavioral and cognitive neuroscience reviews. 2002;1(3):219–228. pmid:17715594
- 31. Larkum ME. A cellular mechanism for cortical associations: an organizing principle for the cerebral cortex. Trends in Neurosciences. 2013;36:141–151. pmid:23273272
- 32. Brendel W, Bourdoukan R, Vertechi P, Machens CK, Denéve S. Learning to represent signals spike by spike. PLoS computational biology. 2020;16(3):e1007692. pmid:32176682
- 33. Pfister JP, Toyoizumi T, Barber D, Gerstner W. Optimal spike-timing-dependent plasticity for precise action potential firing in supervised learning. Neural computation. 2006;18(6):1318–1348. pmid:16764506
- 34. Jimenez Rezende D, Gerstner W. Stochastic variational learning in recurrent spiking networks. Frontiers in Computational Neuroscience. 2014;8:38. pmid:24772078
- 35. Gardner B, Grüning A. Supervised learning in spiking neural networks for precise temporal encoding. PloS one. 2016;11(8):e0161335. pmid:27532262
- 36. Brea J, Senn W, Pfister JP. Matching recall and storage in sequence learning with spiking neural networks. Journal of neuroscience. 2013;33(23):9565–9575. pmid:23739954
- 37. Capone C, Gigante G, Del Giudice P. Spontaneous activity emerging from an inferred network model captures complex spatio-temporal dynamics of spike data. Scientific reports. 2018;8(1):1–12. pmid:30451957
- 38.
Brockman G, Cheung V, Pettersson L, Schneider J, Schulman J, Tang J, et al.. OpenAI Gym; 2016.
- 39. Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. Proximal policy optimization algorithms. arXiv preprint arXiv:170706347. 2017;.
- 40.
Barhate N. Minimal PyTorch Implementation of Proximal Policy Optimization; 2021. https://github.com/nikhilbarhate99/PPO-PyTorch.
- 41. Modirshanechi A, Brea J, Gerstner W. Surprise: a unified theory and experimental predictions. bioRxiv. 2021;.
- 42. Haider P, Ellenberger B, Kriener L, Jordan J, Senn W, Petrovici M. Latent Equilibrium: Arbitrarily fast computation with arbitrarily slow neurons. Advances in Neural Information Processing Systems. 2021;34.