## Figures

## Abstract

Animals learn to make predictions, such as associating the sound of a bell with upcoming feeding or predicting a movement that a motor command is eliciting. How predictions are realized on the neuronal level and what plasticity rule underlies their learning is not well understood. Here we propose a biologically plausible synaptic plasticity rule to learn predictions on a single neuron level on a timescale of seconds. The learning rule allows a spiking two-compartment neuron to match its current firing rate to its own expected future discounted firing rate. For instance, if an originally neutral event is repeatedly followed by an event that elevates the firing rate of a neuron, the originally neutral event will eventually also elevate the neuron’s firing rate. The plasticity rule is a form of spike timing dependent plasticity in which a presynaptic spike followed by a postsynaptic spike leads to potentiation. Even if the plasticity window has a width of 20 milliseconds, associations on the time scale of seconds can be learned. We illustrate prospective coding with three examples: learning to predict a time varying input, learning to predict the next stimulus in a delayed paired-associate task and learning with a recurrent network to reproduce a temporally compressed version of a sequence. We discuss the potential role of the learning mechanism in classical trace conditioning. In the special case that the signal to be predicted encodes reward, the neuron learns to predict the discounted future reward and learning is closely related to the temporal difference learning algorithm TD(*λ*).

## Author Summary

Sensory inputs are often predictable. Lightning is followed by thunder, a falling object causes noise when hitting the ground, our skin gets wet when we jump into the water. Humans learn regularities like these without effort. Learned predictions allow to cover the ears in anticipation of thunder or close the eyes just before an object hits the ground and breaks into pieces. What changes in the brain when new predictions are learned? In this article, we present a mathematical model and computer simulations of the idea that the activity of a single neuron represents expected future events. Such a prospective coding can be learned in a neuron that receives input from the memory trace of a first event (e.g. lightning) and also input from the second event (e.g. thunder). Synaptic input connections from the memory trace are potentiated such that the spiking activity ramps up towards the onset of the second event. This deviates from the classical Hebbian learning that merely associates two events that are coincident in time. Learning in our model associates a current event to future events.

**Citation: **Brea J, Gaál AT, Urbanczik R, Senn W (2016) Prospective Coding by Spiking Neurons. PLoS Comput Biol 12(6):
e1005003.
https://doi.org/10.1371/journal.pcbi.1005003

**Editor: **Peter E. Latham,
UCL, UNITED KINGDOM

**Received: **June 25, 2015; **Accepted: **June 1, 2016; **Published: ** June 24, 2016

**Copyright: ** © 2016 Brea et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All relevant data are within the paper. Additionally the source code of the simulations is publicly available at https://github.com/jbrea/prospectiveCoding.

**Funding: **The work has been supported by the Swiss National Science Foundation www.snf.ch (personal grant no. 310030L-156863 of WS). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Animals can learn to predict upcoming stimuli. In delayed paired-associate tasks, animals learn to respond to pairs of stimuli (e.g. images A1-B1 and A2-B2) separated by a delay. These tasks can be solved by either keeping a memory of the first stimulus (A1 or A2) during the delay period (retrospective coding) or anticipating the second stimulus (B1 or B2) during the delay period (prospective coding). Monkeys seem to use both coding schemes [1]. Recordings in the prefrontal cortex of monkeys performing a delayed paired-associate task revealed single neurons with decreasing firing rate in response to a specific first stimulus (A1 or A2) and other neurons with ramping activity in trials where a specific second stimulus (B1 or B2) is anticipated [1, 2]. Thus, the firing rate of a neuron may encode not only past and current events, but also prospective events.

Learning to anticipate a future stimulus can also be observed in classical trace conditioning, where a conditioned stimulus (CS, e.g. sound of a bell) is followed after a delay by an unconditioned stimulus US (e.g. a sausage) that causes a response R (e.g. salivation) [3, 4]. After several repetitions of this protocol, the conditioned stimulus CS can elicit response R already before the onset of the unconditioned stimulus US.

A common experimental finding in these examples is the slowly ramping neuronal activity prior to the predicted event. In an experiment where mice choose to lick left or right in response to a tactile cue, the neural activity in the anterior lateral motor cortex ramps up in the waiting period before the response [5]. This activity pattern implements prospective coding as it indicates whether the animal will lick left or right. Serotonergic neurons in the dorsal raphe nucleus of mice show an activity ramp in a delay period between a predictive odor cue and the availability of a sucrose reward [6]. In rats that navigate a maze towards the learned position of a chocolate milk reward, the activity of striatal neurons increases while the rat approaches the reward position [7, 8]. In visual delayed paired associate tasks in which monkeys are trained to select a specific choice object that is associated with a previously shown cue object, increasing activity in the delay period was measured for neurons in the prefrontal cortex [1, 9, 10] and in the inferior temporal cortex [2, 11].

It is unclear how prospective coding emerges. The cue and the associated predictable event are typically separated by an interval of some seconds. On the other hand, synaptic plasticity, that is presumably involved in learning new associations, typically requires presynaptic and postsynaptic activity to coincide in a much shorter interval. Some tens of milliseconds is, for example, the size of the ‘plasticity window’ in spike-timing dependent plasticity; no synaptic change occurs, if presynaptic and postsynaptic spike are separated by more than the size of this plasticity window [12, 13]. This mismatch between the behavioral and the neuronal timescales begs the question how a neuronal system can learn to make predictions more than a second ahead. There are also plasticity mechanisms that can correlate pre- and postsynaptic spiking events that are separated by seconds [14, 15]. Yet, assuming many simultaneously active afferents, it remains unclear how the behaviourally relevant pair of pre- and postsynaptic spikes can be selected out of hundreds behaviourally irrelevant pairs.

In normative models of synaptic plasticity, the shape of the causal part of the plasticity window matches the shape of the postsynaptic potential (PSP), if the objective is to reproduce precise spike timings [16–18]. However, if the objective is to reproduce future activity, this specific learning rule is insufficient. Yet, as we demonstrate in this article, the same plasticity rule with only a slightly wider window also allows for learning a prospective code. With this mechanism, it is possible to learn an activity ramp towards a specific event in time, or to learn predicting a time-varying signal or a sequence of activities well ahead in time. In a 2-compartment neuron model, this mechanism leads to the dendritic prediction of *future* somatic spiking. The mechanism stands in contrast to the work of Urbanczik & Senn, where the current somatic spiking is predicted [18]. Despite this fundamental difference, the plasticity rules only differ in the width of the potentiation part of the plasticity window.

## Results

### Schematic description of the learning mechanism

Before defining the learning rule in detail, we provide an intuitive description. In a neuron with both static synapses (green connection in Fig 1A and 1B) and plastic synapses (blue in Fig 1A and 1B), we propose a learning mechanism for the plastic synapses that relies on two basic ingredients: spike-timing dependent synaptic potentiation and balancing synaptic depression. The synaptic connections are strengthened if a presynaptic spike is followed by a postsynaptic spike within a ‘plasticity window of potentiation’ (red in Fig 1A and 1B). The size of this plasticity window turns out to have a strong influence on the timing of spikes that are caused by strengthened dendritic synapses. If the plasticity window has the same shape as a postsynaptic potential (PSP), learned spikes are fired at roughly the same time as target spikes [16–18]. But if the plasticity window is slightly longer than the postsynaptic potential, learned spikes tend to be fired earlier than target spikes. More precisely, because of the slightly wider plasticity window of potentiation, presynaptic spikes may elicit postsynaptic spikes through newly strengthened connections (thick blue arrow in Fig 1B) even before the onset of the input through static synapses. These earlier postsynaptic spikes allow to strengthen the input of presynaptic neurons that spike even earlier. We refer to this as the bootstrapping effect of predicting the own predictions. As a result, a postsynaptic activity induced by the input through static synapses will be preceded by an activity ramp produced by appropriately tuned dendritic input. The neuron learns a prospective code that predicts an upcoming event.

**A** The signal to be predicted (target input) originates from the green neuron and depolarizes the black neuron (gray trace) such that it spikes (black lines). The synaptic connection between a blue neuron and the black neuron is strengthened if pre- and postsynaptic spikes lie within the red plasticity window of potentiation, which is slightly broader than a typical postsynaptic potential. **B** Due to the strengthened connection (red circle), the black neuron spikes already before the target input arrives. Since earlier presynaptic spikes now also lie within the potentiating plasticity window, the activity of the black neuron will be anticipated earlier, giving rise to prospective coding. **C** A spiking neuron receives input through plastic dendritic synapses with strengths *w*_{i} and an input *I* through static (i.e. non-plastic) synapses. The somatic membrane potential *U* is well approximated by the sum of attenuated dendritic input and attenuated somatic input *U**.

### The 2-compartment neuron model

We consider a 2-compartment neuron model that captures important functional details of spiking neurons and is well suited for analytical analysis [18]. In this model (Fig 1C), a dendritic compartment receives input through plastic synapses with strength *w*. The voltage *U* of the somatic compartment is coupled to the dendritic voltage *V*_{w} and receives additional input *I* through static synapses,
(1)
where *g*_{L} is the leak conductance, *g*_{D} is the coupling conductance between soma and dendrite and *C*, the somatic capacitance. The dendritic potential *V*_{w} is given by a weighted sum of presynaptic inputs, i.e.
(2)
with plastic synaptic weights *w*_{i}, postsynaptic potentials PSP_{i} that model the depolarization of the postsynaptic membrane potential due to the arrival of a presynaptic spikes at synapse *i*, set of spike arrival times at synapse *i* and spike response kernel *κ*. Spiking of the postsynaptic neuron is modeled as an inhomogeneous Poisson process with rate *φ*(*U*).

We model the input with time varying excitatory and inhibitory conductances *g*_{E} and *g*_{I} proximal to the soma such that
(3)
as proposed by Urbanczik & Senn [18].

For large total conductance and slowly varying input, the somatic membrane potential *U*(*t*) is well approximated (see Methods) by its steady state solution
(4)
where we introduced the attenuated dendritic potential
(5)
the attenuated somatic input
(6)
and the ‘nudging’ factor
(7)
with *g*_{tot}(*t*) = *g*_{L} + *g*_{D} + *g*_{E}(*t*) + *g*_{I}(*t*), to be in accordance with Urbanczik & Senn [18]. The nudging factor *λ*(*t*) ∈ (0, 1] is close to 1 for small somatic input and equal to 1 if *g*_{E}(*t*) + *g*_{I}(*t*) = 0.

### Learning as dendritic prediction of the neuron’s future discounted firing rate

The plasticity rule we consider for the dendritic synapses can be seen as differential Hebbian in the sense that both the potentiation and depression term are a product of a post- and presynaptic term. The strength of synapse *i* is assumed to change continuously according to the dynamics
(8)
where
(9)
is the low-pass filtered postsynaptic potential at synapse *i*, *φ*(*U*) and are the instantaneous firing rates based on the somatic potential and the attenuated dendritic potential, respectively, and *η* is the learning rate. The factor of potentiation *α* that scales the potentiation term is positive but smaller than the inverse of the largest nudging factor 1/max_{t} *λ*(*t*) to prevent the unbounded growth of synaptic strengths.

Under the assumption of a periodic environment, rich dendritic input dynamics, constant nudging factor *λ* and linear *φ* (Methods), the weight dynamics in Eq 8 leads to prospective coding by making the dendritic rate approach the expected future discounted somatic input rate, i.e.
(10)
where the effective discount time constant *τ*_{eff} is given by
(11)
Depending on the factor of potentiation *α* and the nudging factor *λ*, the effective time constant *τ*_{eff} can be much larger than the biophysical time constant *τ* of low-pass filtering and match behavioral timescales of seconds. In particular, if the somatic input is strong and hence *λ* close to 0 (close to ‘clamping’), the effective discount time constant is short, *τ*_{eff} ≈ *τ*. But when nudging is weak (*λ* close to 1), the synapses on the dendrite learn to predict their self-generated somatic firing rate and the effective discount time constant is extended up to . The case of weak nudging is also the case when the neuron’s somatic firing rate is roughly determined by the dendritic input, , see Eq 4. In particular, if after learning the somatic input is transiently silenced, the neuron’s firing rate *φ*(*U*(*t*)), according to Eq 10, represents the discounted future rate of the somatic input *U**(*t*) applied during the previous learning period, even if this was only slightly nudging the somatic potential *U*(*t*) itself.

Periodic inputs are unrealistic in a natural setting. But a similar result holds also in more general settings, where a neuron is occasionally exposed to correlated dendritic and somatic inputs. In this more general stochastic setting we derive the main result under the assumption that dendritic and somatic inputs depend on the state of a stationary latent Markov chain *X*_{0},*X*_{1}, …. The dependence on a stationary latent Markov chain assures that the neuron is occasionally exposed to correlated dendritic and somatic inputs. The main result in this setting is (cf. Eq 48)
(12)
where is a large discount factor that leads to a similar discount behavior as in the time-continuous case, if *t* = *kδ*.

It is important to note that in the stochastic case the dendritic rate is only informative about *expected* future somatic inputs. Metaphorically speaking, a neuron can learn to predict the expected win in a lottery, but obviously it cannot learn to predict single lottery draws.

### The bootstrapping effect of predicting the own predictions

In the limit, *τ* → 0 we find that and with *α* = 1 we recover the learning rule of Urbanczik & Senn [18]. This rule adapts the dendritic synapses such that the dendritic input matches the somatic input Fig 2B. On the other hand, the learning rule with a slightly larger potentiation window leads to dendritic input that ramps up long before the onset of somatic input Fig 2C.

**A**-**B** For orthogonal input patterns (exactly one presynaptic spike arrives at each synapse during 2 s) (A) and a somatic input after 1800 ms (*g*_{E} = 15 nS during green shading in B and C, *g*_{E} = 0 otherwise), the learned postsynaptic firing rate has a similar time course as the somatic input if (B, lines from light gray to black: postsynaptic firing rate after 100, …, 1000 training sessions). **C** If (with *τ* = 9 ms), the learned postsynaptic firing rate ramps up with an effective time constant of *τ*_{eff} = 600 ms towards the onset of the somatic input. The theoretical result is in good agreement with the simulation (dashed red line: computed by Eq 10). During training, the 2 s long pattern of dendritic and the somatic inputs is periodically repeated.

By looking at Eqs 4 and 8 we can now obtain a better intuition for the bootstrapping effect of predicting the own predictions. If at the beginning of learning all synaptic weights *w*_{i} are zero, the dendritic potential *V*_{w} is at rest (= 0) all the time and the somatic membrane potential *U*(*t*) follows the somatic input *U**(*t*) (see Eq 4). In this case, the learning rule in Eq 8 contains only the potentiation term
(13)
In the example in Fig 2C, the somatic input *U** and consequently *φ*(*U**) is non-zero only after 1800 ms. Therefore, synapse *i* is potentiated only if a presynaptic spike arrives shortly before the onset of the somatic input. The next time a presynaptic spike arrives at synapse *i*, the somatic membrane potential is depolarized by the dendritic input already before the onset of the somatic input and the learning rule contains at this moment (e.g. at 1780 ms in Fig 2C) the terms
(14)
These terms would cancel each other in the case of Urbanczik & Senn [18] where *α* = *λ* = 1 and . But if is the low-pass filtered version of the postsynaptic potential (as in Fig 2C) they do not cancel. Instead, synapses are potentiated, if a presynaptic spike arrives shortly before the somatic potential was depolarized due to dendritic input through already potentiated synapses. The consequence of this bootstrapping effect appears in Fig 2C in the gray curves. After 100 training sessions, the dendritic input starts to rise around 1200 ms, but synapses with earlier presynaptic spikes are not yet strengthened. With each further training session the dendritic input rises earlier.

The dendritic and the somatic inputs are deterministic periodic functions, in the example in Fig 2C. Therefore we can directly compare the simulation to the theoretical results of the previous section. For the interval without somatic input (0–1800 ms), where , we find a good agreement (dashed red and thick black line in Fig 2C). Small differences are to be expected, because in the theoretical derivations a constant nudging factor *λ* is assumed and the steady-state solution of the somatic membrane potential dynamics is used (see Eq 4). The dendritic rate is only slightly below the somatic rate *φ*(*U*) in the interval with somatic input (1800–2000 ms), because the somatic input is small.

### Dependence on the dendritic input structure

The input pattern in Fig 2A is a particularly simple example of a deterministic, periodic pattern with rich enough structure. Enough structure to learn a prospective code exists also in sufficiently many randomly generated (frozen) spike trains that are deterministically repeated, if there is always at least one presynaptic spike within the duration of a *PSP* and the probability of repeating a nearly identical presynaptic spike pattern is low (see Fig 3A). We did not systematically search for the minimal number of required dendritic synapses. But for the example in Fig 3A we found empirically that a few hundred synapses are necessary. If the presynaptic firing frequency is only 2 Hz, we found that 1000 presynaptic neurons are enough to learn the ramp in 100 trials, whenever the learning rate is larger than in the 20 Hz case. At the end of learning, the time course of the somatic potential matches the one of the previous example (black lines in Figs 2C and 3A). But during learning, the time course of the somatic potential is different in the two examples (gray lines in Figs 2C and 3A). This is a consequence of the influence of correlations in the dendritic input. For the frozen spike trains, the presynaptic auto-correlation is non-vanishing for all *s* and *i*. This causes the average firing rate to increase early during learning (Fig 3A; gray lines in interval 0–1500 ms in contrast to gray lines in the same interval in Fig 2C).

**A** Learning succeeds with deterministically repeated spike trains (20 frozen spike trains out of 500, *g*_{E} as in Fig 2). **B** Learning succeeds with stochastic spiking, if the spiking rate is variable (blue shading: spiking rate, blue ticks: one sample). **C** The amplitude of the ramp is smaller, if the training event occurs only with 50% probability. **D** If the spiking rate is constant during long intervals, the input is not sufficient to learn a smooth ramp. A stepwise ramp is learned instead (*g*_{E} = 8 nS during green shading)

In the examples given so far, the dendritic and the somatic inputs are deterministic, but deterministic repetitions of the exact same spike trains are unrealistic. In Fig 3B we consider the more realistic case of random spiking. In each trial, the spikes are sampled from an inhomogeneous Poisson process, with periodically repeating rates. The resulting activity ramp is noisier but in good agreement with the theoretical result. It is important that the rates of the Poisson process have sufficiently rich structure. In Fig 3D the firing rate of the Poisson process is kept constant for one third of the trial. In this case, the temporal structure is not sufficiently rich to learn a smooth ramp and a stepwise activity ramp is learned instead.

In Fig 3C, the target event occurs only with a 50% chance, i.e. the somatic input is given only in half the trials. This results in an activity ramp with smaller amplitude, which is consistent with the theoretical finding that the dendritic rate depends linearly on the average somatic input rate (see Eq 12).

### Delayed paired-associate task

Prospective coding in neurons of the prefrontal cortex was observed in an experiment with monkeys performing a delayed paired-associate task [1]. In this experiment, monkeys learned to associate a visual sample to a visual target presented one second later. Our learning rule allows for learning a prospective code in such a task.

During training, sample A1 is always followed by target B1 after a delay of 1s, and sample A2 is followed by target B2 (Fig 4A). In the simulation we assume that the sample (first stimulus) leaves a memory trace in form of a spatio-temporal activity pattern that projects through dendritic synapses, while the target (second stimulus) drives somatic synapses (Fig 4B). In order to have sufficiently rich presynaptic activity (c.f. Fig 3B), the memory trace of the sample is modeled by an inhomogeneous Poisson process with sample dependent rate trajectories (Fig 4C), i.e. during the presentation of the first stimulus the rate trajectory of each neuron approaches a previously chosen template trajectory that depends on the sample (see Methods). These memory traces are inspired by liquid state machines (see Discussion). If a neuron receives strong somatic input only in the presence of a specific target (neurons 1 and 2 in Fig 4B), its firing rate ramps up exclusively in anticipation of this target (neurons 1 and 2 in Fig 4D). In contrast to such a ‘grandmother-cell coding’ (one neuron for one target), a set of neurons could encode the target in a distributed manner, where the target is identified by the overall activity pattern and single neurons respond differently to different target stimuli. Such a distributed code can be learned with neurons that receive somatic input of target-specific strengths (neuron 3 in Fig 4B; B1 stronger than B2). After learning, the amplitude of the activity ramp reflects this target specificity (neuron 3 in Fig 4D).

**A** In the simulation, stimulus A1 and A2 is repeatedly followed by stimulus B1 and B2, respectively, with a delay of 1s. The two pairs are chosen randomly with equal probabilities. Intertrial intervals are chosen at random uniformly between 3 s and 10 s. **B** The first stimulus (A1 or A2) activates a recurrent network of 2000 neurons representing a short-term memory (STM). The dynamics of the recurrent network is modeled by a stochastic process (see Methods). The STM is read out by 3 neurons that encode in a distributed manner the second stimulus (setting *g*_{E} = 5 nS in neuron 1 and neuron 3 during the B1 presentation, and *g*_{E} = 5 nS in neuron 2 and *g*_{E} = 2.5 nS in neuron 3 during the B2 presentation). **C** The time course of the firing rates of neurons in the recurrent short-term memory network depends on the first stimulus (dark blue: A1; gold: A2; spike trains of a specific STM neuron during 4 A1 trials and its estimated rates for 4 A1 and 4 A2 trials). **D** After learning, the firing rate of neuron 1 ramps up after stimulus A1 (gold trace), but not after stimulus A2 (blue trace). The opposite holds for neuron 2. Since neuron 3 receives more somatic input when B1 is present, the firing rate of neuron 3 ramps up to a larger value after A1 than after A2.

### Prospective coding of times series

In Figs 2 to 4 the somatic target input was silent most of the time and active only during a short interval. This simple time course of the somatic input is, however, not a requirement and learning also converges for more complex trajectories of somatic input. In general, a time varying input through (static) somatic synapses induces plasticity that advances the postsynaptic firing rate *φ*(*U*(*t*)) relative to the firing rate *φ*(*U**(*t*)) determined by the somatic input alone. Fig 5A shows an example with an advancement of roughly 50ms that has been achieved with a shorter time window (∼20 ms) for synaptic potentiation. As in Fig 3A, the dendritic input was a periodically repeated random spike train that could also be replaced by stochastic spiking with time dependent firing rates as in Fig 3B.

**A** Learning leads to an advancement of the postsynaptic firing rate. The dendritic input consists of spike trains of 2000 neurons (bottom; 20 shown). The somatic input is given by *g*_{E}(*t*) = 6(1 − sin(*ωt*) sin(2*ωt*) cos(4*ωt*)) nS, with *ω* = 2*π*/(2000 ms). **B** The correlation of the firing rate curves in A peaks at *t*_{peak} = 52 ms. **C** The advancement increases with the potentiation factor *α*. **D** With increasing potentiation factor *αλ* the effective discount time constant *τ*_{eff} becomes much larger than *τ* (Eq 11).

Since the learning rule converges to a point where the dendritic input is proportional to the future discounted somatic input (Eq 10), the advanced sequence (black in Fig 5A) is not simply a forward shifted version of the somatic input (green in Fig 5A). This becomes clearly apparent at the center of the figure, where the somatic input is symmetric around 1000 ms, but the advanced sequence is decaying, because the somatic input has a strong dip around 1100 ms. Despite this, the advancement can be characterized by the peak time of the correlation function between *φ*(*U*(*t*)) and *φ*(*U**(*t*)) that, as the effective discount time constant *τ*_{eff}, diverges with increasing potentiation factor *α* (Fig 5B–5D).

Time series prediction is a fundamental operation of the brain that is, for instance, involved in motor planning. In our context, the activity time course that has to be reproduced may be provided by proprioceptive feedback from muscles as somatic input *U** to neurons in the primary motor cortex [19]. This feedback can be weak, delayed and sparse. The dendritic input *V**, in turn, may be conveyed by a higher visual area or a premotor planning area. This dendritic input learns to predict the discounted future firing rate caused by the somatic input, and hence learns to produce the muscle activity that feeds back again as a delayed proprioceptive signal.

### Prospective coding in a recurrent neural network

Lastly, we consider a recurrently connected network of 200 neurons that receive external input only at the soma and no external input at the dendrites. The input at the dendrites is given by the output spikes of the network neurons, where we consider all-to-all connectivity (Fig 6A). In contrast to the examples in Figs 2 to 5, there is no external control to assure the richness of the dendritic input and there are no guarantees that learning converges in the sense of Eq 10. Still, we observe the interesting result that learning changes synaptic strengths to allow fast replay of slow experienced sequences.

**A** The dendritic input consists of the spike trains of other neurons within the same network (for clarity only one axon is completely drawn; in the simulation we used all to all connections). **B** Groups of 50 neurons receive sequential somatic input (*g*_{E} = 20 nS during green shading) of duration 100 ms. After repeatedly stimulating, the firing rate increases already prior to somatic input (see for example neurons 51–100 in the first 100 ms). Before 800 ms the last two of 300 training repetitions are shown. Afterwards, no somatic input is provided anymore except for a brief stimulation (after 2400 ms), after which the sequence is autonomously replayed at a faster speed.

For sequentially and periodically repeated stimulations on a slow timescale (green shading in Fig 6), the recurrent dendritic connections between subsequently stimulated groups of neurons are strengthened. After 300 repetitions of the same sequence, a brief initial stimulation is sufficient to evoke an activity sequence that has the same ordering as the original sequence (Fig 6B after 2400 ms). However, the replay dynamics can be much faster than the dynamics of the stimulation. Replay depends on the internal dynamics of the network, notably the time constants of the *PSP* and the membrane time constant. Due to prospective coding, the sequence becomes advanced in time while repeatedly presenting the stimuli, and due to the recurrent connectivity the advanced sequence can be recalled with a brief stimulation of the first group of neurons (Fig 6B). Note that there is no need to explicitly distinguish between a training and recall session. Recall differs from training only in the somatic input, which consists of a brief activation of the first group of neurons during recall and slow, sequential activation during training. The learning rule is active all the time.

### Relation to TD(*λ*)

The proposed learning mechanism of prospective coding is related to a well studied version of temporal difference (TD) learning. Using our notation for a stochastic and time discrete setting, the goal in TD learning is to estimate a value function
(15)
where 0 < *γ*_{TD} < 1 is a discount factor and the expectation is taken over the Markov chain *X*_{0},*X*_{1}, …. We assume that this value function can be approximated by a linear function of the form
(16)
where *φ* is linear. In TD(*λ*) with linear function approximation, the weights *w* evolve according to the learning rule [20–22]
(17)
with learning rate *η*, eligibility trace , 0 ≤ *λ*_{TD} ≤ 1, and delta error
(18)

This delta error is zero on average if the approximation is equal to the value function in Eq 15. Furthermore, converges to under the learning rule of TD(*λ*) in Eq 17 [20]. The discrete time version of our learning rule (Eq 42), implemented in the 2-compartment model, converges to Eq 12 which is identical to the value function in Eq 15 if *γ*_{TD} = *γ*_{eff}. Therefore, this form of TD(*λ*) and our learning mechanism converge to the same value. It is also interesting to see that both methods use an eligibility trace and that are the same if *λ*_{TD} *γ*_{TD} = *γ*, i.e. *λ*_{TD} = *γ*/*γ*_{eff}. But despite the convergence to the same point and the use of the same eligibility trace, learning moves in general along different trajectories under this form of TD(*λ*) and the learning mechanism we propose.

So far we compared the learning mechanism of prospective coding to the plain TD(*λ*) that has access to the *PSP* and *U**. If only access to *U* = *λV**+*U** is available, it is also possible to combine TD(*λ*) with the bootstrapping effect of predicting the own predictions by implementing a variant of TD(*λ*) in the dendritic compartment of the 2-compartment model. If the delta error is defined as
(19)
one can show that the learning rule in Eq 8 is almost identical to the TD learning rule in Eq 17 with *λ*_{TD} = 1 (Methods). In this case, the weights during learning move along similar trajectories, irrespective of whether this form of TD(1) or our learning rule is used. If this form of TD(1) were not implemented in the 2-compartment model, i.e. if the first term in the delta error in Eq 19 would be replaced by *φ*(*U**(*X*_{t})), the time constant of future discounting would be *γ* instead of *γ*_{eff}. But since the first term in the delta error in Eq 19 depends on the full somatic potential *U* = *λV** + *U** the bootstrapping effect of predicting the own predictions applies and the large time constant *γ*_{eff} arises.

## Discussion

As a simple and biologically plausible explanation for how animals can learn to predict future events, we have proposed a local plasticity mechanism that leads to prospective coding in spiking neurons, i.e. the plastic synapses change such that the neuron’s current firing rate depends on its expected, future discounted firing rate.

Our model proposes a partial solution to the problem of learning associations on a behavioral timescale without using slow intrinsic processes. Even with a plasticity window that is only slightly larger than the duration of a postsynaptic potential, the effective time constant of discounting the expected future firing rate can be on the order of seconds, thanks to the bootstrapping effect of predicting the own predictions. This effect arises because already predictive inputs influence the activity of a neuron. This is captured by the 2-compartment model of Urbanczik & Senn [18], where the output depends on both the dendritic (student) and the somatic (target) input.

For clarity, we presented the model with target input through static (i.e. non-plastic) somatic synapses and in the examples of ramping activity in Figs 2 and 3 the somatic input was non-zero only during a short period. This simple form of the target input is not a requirement. First, the learning mechanism also applies to arbitrary time courses of the somatic input, as we show in the example of time series prediction in Fig 5, where an advanced and smoothed version of a complex somatic input is learned. Second, the somatic synapses do not need to be static. Yet, they should change slower than the dendritic synapses in order to get a separation of plasticity timescales. And third, the target input could also arrive at another dendritic branch instead of the soma (see generality of the results in Methods).

We focused solely on learning temporal associations and neglected important aspects of learning in animals. However, the proposed learning mechanism can easily be extended to include, for example, a weighting based on behavioral relevance. In the delayed paired-associate task, our model learns the associations between sample and target irrespective of the behavioral relevance of this association. In animal training, however, reward or punishment is crucial; for example the monkeys in the study of Rainer et al. [1] received juice rewards. The learning rate in our learning mechanism is a free parameter that could incorporate a weighting by behavioral relevance. Biophysically, a neuromodulator like dopamine could implement this modulation of the learning rate. It is also possible to postpone the weight update in Eq 1 and use reward modulated eligibility traces instead (see e.g. [23–25] for theory and [15, 26] for experiments).

The proposed learning mechanism could also be involved in classical trace conditioning, where the first stimulus (CS) is separated from the second stimulus (US) and the response (R) by a delay period, similar to the situation in the delayed paired-associate task. Let us assume that neuron 1 in Fig 4 is involved in initiating response R (e.g. salivation). If the unconditioned stimulus causes somatic input to this neuron and a memory trace of the conditioned stimulus arrives at the dendritic synapses, our learning mechanism would lead to ramping activity and salivation prior to the onset of the unconditioned stimulus that originally triggered the salivation. To our knowledge, there is no conclusive experimental data to support or discard the hypothesis that prospective coding is involved in classical trace conditioning. In the cited studies on ramping activity [1, 2, 6–11], the animals were actively engaged in a task (operant conditioning). It is unlikely, however, that the ramping activity is merely a side-effect of movement preparation, since Rainer et al. [1] found it to be stimulus-specific but not action-specific.

In our model of delayed paired-associate tasks, activity ramps rely on temporally structured input from short-term memory neurons. The usage of these short-term memory neurons is motivated by the observation that hippocampal activity is needed to overcome the temporal discontiguity in trace conditioning [4, 27]. We modeled the dynamics of the recurrent short-term memory network with a stochastic process. The parameter choice of this stochastic process is inspired by the widespread experimental observation that stimulus onset quenches the neural variability [28, 29]. It should also be possible to model the memory traces with “dynamical attractors” in recurrent networks of rate neurons [30] or with long and stable transient dynamics in balanced networks [31]. Since these memory traces are not the main focus of this study we generated them in a simpler way with the stochastic process, which still feels more natural than the delay-line like traces used in a study on trace conditioning [32].

In recurrent neural networks the learning rule of prospective coding allows fast replay of slow input sequences (Fig 6). Fast replay could be valuable for planning, where it is important to quickly assess the likely successors of a given state. The same fast replay of a previously induced slower activity sequence was also observed in the rat primary visual cortex [33] and it is as well studied as compressed hippocampal replay of a spatial sequence [34]. In rats these replay events can be observed minutes or hours after the spatial experience. In contrast, the simple form of the plasticity rule in Eq 8 does not have any consolidation properties and ongoing pre- and postsynaptic activity would quickly change the learned weight patterns and thus overwrite the memories. It is, however, straightforward to extend the plasticity model by a consolidation mechanism. In the three state consolidation model of Ziegler et al. [35, 36], early long-term potentiation (LTP) is induced by a triplet rule [37]. Replacing the triplet rule by the plasticity rule in Eq 8 would endow the learning rule of prospective coding with a consolidation mechanism. Such a consolidation mechanism would allow to replay sequences a long time after the training session.

Aiming at a better understanding of biological implementations of prediction learning, our model allows to speculate about physiological realizations of the model variables. Similar to previously proposed plasticity rules [16, 18], our learning mechanism depends on the postsynaptic firing rate *φ*(*U*), a function of the dendritic potential , the postsynaptic potential *PSP* and, as a new ingredient compared to previous propositions [16, 18]: a low-pass filtered version of the postsynaptic potential . A plasticity window that is slightly larger than the duration of a postsynaptic potential is in agreement with experimentally measured plasticity window sizes [13, 38]. In particular, an increased level of dopamine was observed to expand the effective time window of potentiation to at least ∼45 ms [38]. Importantly, even with a plasticity window on this timescale, predictions can be learned on a timescale of seconds due to the bootstrapping effect of predicting the own predictions.

We have shown that the proposed learning mechanism is closely related to temporal difference learning with eligibility traces TD(*λ*). As discussed in the previous paragraph, a local biological implementation of our learning rule seems straightforward. In contrast, it seems more challenging to locally implement the delta error of TD learning. Potjans et al. and Kolodziejski et al. propose a local implementation that depends either on differential Hebbian plasticity [39] or on two postsynaptic activity traces with different time constants to approximate the difference in the delta error [40]. Both methods require a gating mechanism that allows plasticity only shortly after the onset of a new state and they require transition intervals between states of fixed duration. Furthermore, “state neurons” are only highly active when the agent is in a certain state, which requires the segmentation of the sensory input stream into discrete states. The learning rule we propose does not require these strong assumptions.

Frémaux et al. [41] speculate about a non-local implementation of TD learning with spiking neurons, where the TD error is represented by the firing rate of dopaminergic neurons that receive input from three groups of neurons that encode reward, value function and derivative of the value function. In the simulations, however, Frémaux et al. did not use the proposed network implementation of the TD error and they mention that it remains to be seen whether such a circuit can effectively be used to compute a useful TD error. A non-local implementation of the TD error appears compelling in a actor-critic setting, since the actor and the critic can be learned with the same TD signal. However, if the task is to predict more than a scalar quantity like reward, it seems inefficient to use a non-local implementation of the TD error for each quantity to be predicted. Already in our simple example of prospective coding in a recurrent neural network, four TD error networks would be needed in such a non-local implementation.

Generally, associating temporally separated events requires some memory of the first event until the second event is present. Possible neural implementations of this memory rely on long spiking activity traces or on long synaptic eligibility traces. Our model of the delayed paired-associate task relies on long spiking activity traces. The short-term memory network can be seen as a liquid state machine [42] or echo state machine [43] and the ramping activity is learned as readout from this activity traces. Alternatively, the activity trace could be represented by slowly, exponentially decaying spiking activity after strong stimulation of a cell [44]. This proposition, however, fails to explain the experimentally observed activity ramps prior to predictable events [1, 2, 6–11]

The origin of the ramping activity observed in experiments is not yet fully understood. An alternative to our proposition can be found in recurrent neural network dynamics, where slowly ramping or decaying activity arises with appropriately tuned synaptic weights [2, 25]. In a reinforcement learning setting the time constant of the ramp can be learned by adjusting the recurrent weights with reward modulated Hebbian plasticity [45]. Data analysis of recordings in the macaque lateral intraparietal area revealed yet another candidate explanation: single neuron activity profiles could follow a step-like time course, while the averaged activity is a ramp, if the steps occur at different points in time [46].

Despite the formal link of our prospective coding algorithm to TD learning, the learning we consider is purely supervised on the level of the neuron. Yet, the same learning rule can also be used to explain conditioning experiments. Instead of the multiplicative modulation by a global reward signal, the reward signal could directly nudge the somatic compartment of the neurons and act as a teaching signal. But the learning rule would also allow for combining the somatic nudging signal with an additional modulatory factor, and nudging and modulatory signals could even be sparse and interleaved. For instance, the rule may explain the simultaneous shaping of predictive motor circuitries by sensory feedback and reward [5]. Fluctuating somatic inputs may cause behavioral variations and feedback signals may gate dendritic plasticity such that only rewarded fluctuations act as a target signal for prospective coding. It is also possible to adapt the somatic input connections directly with reinforcement learning, and a ramping activity could arise from learning a prospective code with stimulus-dependent dendritic input.

Since reward is an intrinsic component in animal training, we acquired an advanced knowledge about the neuronal bases of reward prediction. But predictions are not restricted to reward, and predicting the identity of stimuli yields more versatile information. We speculate that prospective coding is more abundant than previously thought and, as we showed, it could easily be implemented on the level of an individual neuron. This view is also consistent with the recently observed future-predicting encoding in the retina [47]. To this end, a potentiation window slightly larger than a PSP, together with the bootstrapping effect of predicting the own predictions, is a parsimonious mechanism for learning prospective codes by neurons. A characterisitics of these neurons is that their current firing rate matches their own expected future discounted rate.

## Methods

### Parameters of the neuron model

The spike response kernel *κ* in Eq 2 is given by
(20)
with Heaviside function *H*(*t*) = 0 if *t* < 0 and *H*(*t*) = 1 otherwise, *τ*_{m} = 10 ms, *τ*_{s} = 10/3 ms and .

We set the somatic capacitance *C* = 1 nF, the leak conductance *g*_{L} = 100 nS, the coupling conductance *g*_{D} = 1.8 *μS*, and the excitatory and inhibitory reversal potential *E*_{E} = 14/3 and *E*_{I} = -1/3, respectively. The description of the excitatory conductance *g*_{E}(*t*) is given in the figure captions. The inhibitory nudging conductance *g*_{I}(*t*) was equal to 0 except for simulations with in Fig 2, where *g*_{I}(*t*) = 4*g*_{E}(*t*). The resting potential is 0 for both, the dendritic potential *V*_{w} and the somatic potential *U*. If one takes our unitless resting potential of 0 to correspond to -70 mV, and a potential of 1 to correspond to -55 mV, the above choices for *E*_{E} and *E*_{I} correspond to reversal potentials of 0 mV (excitation) and -75 mV (inhibition).

The instantaneous firing rate of the neuron is assumed to depend on the somatic membrane potential through function *φ*(*U*), which in the simulations has the form
(21)
with *φ*_{max} = 0.06 kHz. In simulations with spiking, the firing rate multiplied by the simulation time step serves as instantaneous rate of an inhomogeneous Bernoulli process.

### Steady state solution of the somatic potential dynamics

For slowly enough changing *I*_{tot}(*t*) and *g*_{tot}(*t*), *U*(*t*) is well approximated by *I*_{tot}(*t*)/*g*_{tot}(*t*). To see this, we use the ansatz *U*(*t*) = *I*_{tot}(*t*)/*g*_{tot}(*t*) + *ϵ*(*t*) in Eq 1 and find
(22)
which leads to the conclusion that the error *ϵ* is small if during at least an interval of approximate duration .

Under these assumptions we write (23) where we introduce the ‘nudging’ factor , the attenuated dendritic potential , and the attenuated somatic input .

### Generality of the results

Our main results are robust to variations of the model. For example, the target input *I* could be given by the input through static synapses on another dendritic branch instead of synapses at the soma, i.e. *I*(*t*) = *g*_{D}(*V*_{s}(*t*) − *U*). In this case, the nudging factor becomes and is constant in time.

Modifying the depression term of the learning rule has an effect on the effective time scale *τ*_{eff}, but large effective time constants are achievable in any case. If the depression term in Eq 8 would be replaced by , the effective time constant would be *τ*_{eff} ≈ *τ*/(1 − *α*), i.e. *τ*_{eff} would be independent of *λ* but still diverge when *α* → 1. Similarly, for a depression term given by −*φ*(*V*_{w})PSP_{i}, the effective time constant would be *τ*_{eff} ≈ *τ*/(1 − *λ*_{2} *α*), with *λ*_{2} = *g*_{D}/*g*_{tot}.

In the current writing of the learning rule, Eq 8, the postsynaptic term arises as instantaneous firing rate *φ*(*U*). But this rate could also be replaced by a postsynaptic sample spike train *S*(*t*) that averages out to this same rate, 〈*S*(*t*)〉 = *φ*(*U*(*t*)). Since learning becomes slower by this sampling, we run our simulations in the form of Eq 8.

### Dynamics of short-term memory neurons in the delayed paired associate task

For each STM neuron *i* we first choose template rate trajectories for stimulus A1 and for stimulus A2 by sampling from a mean-zero Ornstein-Uhlenbeck process
(24)
where *W* is a Wiener process, 1/*θ*_{1} = 1000 ms, *σ*_{1} = 1 and *s* ∈ {1, 2}. Actual rate trajectories *r*_{i}(*t*) were sampled from a process with trial dependent mean and time dependent variance, i.e.
(25)
where 1/*θ*_{2} = 100 ms,
(26)
and *σ*^{2}(*t*) = 1 if *t* < 0 s or *t* > 3 s, *σ*^{2}(*t*) = 0.1 otherwise. This assures that in each trial the rate trajectories approach the template trajectories during the presentation of the sample. In between trials, the rate trajectories are independent of the template trajectories. Spike times are determined by sampling from an inhomogeneous Bernoulli process with rate *φ*(*r*_{i}(*t*))Δ*t*, where Δ*t* is the simulation time step.

### Simulation details

The differential equations were integrated with the Euler forward method with step size 0.1 ms. We choose the learning rate *η* = 0.5 in all simulations except for the simulation in Fig 2B and 2C where *η* = 50, since the presynaptic firing rate is low. All simulations are written in C. The plots are generated with Mathematica. The source code is publicly available at https://github.com/jbrea/prospectiveCoding.

### Stationary point of learning for periodic environments

We assume a stationary environment and rich dendritic input dynamics, such that the dendritic inputs can potentially be predictive of the somatic input. There are different ways to model stationarity of the environment. One way is to restrict the inputs to depend on a stationary latent Markov chain. We consider this case in detail in the next section. Here, to present the main ideas in a mathematically simple form, we look at the artificial case, where stationarity enters through deterministic and periodic functions PSP_{i}(*t*) and *U**(*t*) with period *T*. Under this assumption, learning is at a stationary point when the changes of the weights in Eq 8 integrated over one period vanish, i.e.
(27)
Using the definition of in Eq 9 we find
(28) (29)
where Eq 29 is obtained by changing the order of integration, changing the integration variable *t* to *t* + *s* and using , which holds for any *T*-periodic function *f*(*t*). The puzzling transition from an integral that depends on the past values of PSP_{i} in Eq 28 to an integral that depends on the future values of *U* in Eq 29 is a result of the assumed stationarity of the environment, which here is expressed in the periodicity of the functions PSP_{i}(*t*) and *U**(*t*). Eq 29 holds for all synapses *i*, if
(30)
Strictly, Eq 30 follows from Eq 29 only if the inputs PSP_{i}(*t*) span the space of square integrable, *T*-periodic functions. In actual implementations the number of synapses is limited, but we find empirically that Eq 30 holds approximately at the stationary point if, loosely speaking, the inputs PSP_{i}(*t*) at individual synapses are sufficiently rich and different from each other.

The right-hand side of Eq 30 also depends on the dendritic potential , since the membrane potential *U* depends both on the dendritic input and the somatic input *U** (see Eq 4). Assuming a linear transfer function *φ*, Eq 30 becomes
(31)
With a Fourier transform and assuming a constant nudging factor *λ* we can solve this equation for .

The Fourier coefficients , , of the *T*-periodic function are given by
(32) (33) (34)
where, in the first line, we changed the order of integration, changed the variable *t* to *t* − *s* and used the periodicity of the integrand to obtain . In the second line we introduced the Fourier coefficients .

With and *g*(*t*) = *φ*(*U**(*t*)) we rewrite Eq 31 (35)
and Fourier transform both sides to obtain
(36)
Solving for leads to
(37) (38)
This equation has the same structure as Eq 34. With the inverse Fourier transform and assuming *αλ* < 1 we find Eq 10, i.e.
(39)
where .

### Convergence of learning in stationary stochastic environments

We formalize the notion of a stationary environment by introducing a stationary latent Markov chain and restricting the dendritic input PSP_{i}(*t*) = PSP_{i}(*X*_{t}) and the somatic input *U**(*t*) = *U**(*X*_{t}) to depend on the state *X*_{t} of the Markov chain. An alternative way to formalize the notion of stationarity would be to define stationary dynamics of the dendritic inputs and define the correlation between dendritic and somatic input. As it is always possible to reformulate the stationary dendritic input dynamics and the correlation between dendritic and somatic input in terms of a stationary latent Markov chain—with potentially large state space—we stick to the description with a latent Markov chain.

Formally, for time , states *X*_{t} in a finite set evolve according to a stationary, irreducible Markov chain with transition probabilities *T*(*s*_{i},*s*_{j}) = *Pr*(*X*_{t+1} = *s*_{j}|*X*_{t} = *s*_{i}) and stationary distribution *π*(*s*_{i}) = *Pr*(*X*_{t} = *s*_{i}).

Note that the case of deterministic periodic input is readily formulated in terms of a stationary latent Markov chain that cycles deterministically through the state space, e.g. *T*(*s*_{i}, *s*_{j}) = 1 if *j* = *i* + 1 or *j* = 1 and *i* = *N* and *T*(*s*_{i}, *s*_{j}) = 0 otherwise. Functions that depend only on the state of the Markov chain are thus cyclic with period *N*, e.g. PSP_{i}(*X*_{t}) = PSP_{i}(*X*_{t+N}).

In order to switch to matrix notation in the rest of this section, we introduce the following terms:

- Discounting operator , with transition matrix
*T*and discount factor*γ*∈ [0, 1). - Postsynaptic potentials
**b**_{i}= (PSP_{i}(*s*_{1}), …,PSP_{i}(*s*_{N}))′ for each synapse*i*. - Matrix of postsynaptic potentials
*B*= [**b**_{1}**b**_{2}⋯**b**_{S}], where*S*is the number of synapses. - Postsynaptic firing rates
**r**_{U}= (*φ*(*U*(*s*_{1})), …,*φ*(*U*(*s*_{N})))′. - Dendritic rates .
- Somatic input rates
**r**_{I}= (*φ*(*U**(*s*_{1})), …,*φ*(*U**(*s*_{N})))′. - Expected future discounted firing rate .
- Expected low-pass filtered postsynaptic potential .

In the following we sketch the proof for the equivalents of Eqs 30, 12 and 11 in the Markov chain setting. We will make use of the following basic facts about conditional expectations:
(40) (41)
where *t* > 0, *T*^{t} denotes the matrix power of *T*, Π = diag(*π*(*s*_{1}),*π*(*s*_{2}), …, *π*(*s*_{N})) is the diagonal “stationary distribution matrix”, column vector **f** = (*f*(*s*_{1}),*f*(*s*_{2}), …, *f*(*s*_{N}))′ and row vector **f**′, the transposed of **f**.

#### 1. At the fixed point of learning the dendritic rate is proportional to the expected future discounted firing rate (cf. Eq 30).

In discrete time the learning rule in Eq 8 becomes
(42)
with . While Δ*w*_{t,i} is a stochastic variable in general, we will discuss in the following only the corresponding ordinary differential equation (ODE) of the mean
(43)
where is the expected low-pass filtered postsynaptic potential. This ODE has the same fixed point and convergence behavior as the dynamics in Eq 42 under mild assumptions [48].

As in Eqs 28 and 29, we are going to show now that we can rewrite the dynamics of the mean synaptic weight in terms of the future discounted firing rate *F*(*x*) instead of the expected low-pass filtered postsynaptic potential , i.e.
(44)

This result is a consequence of the assumed stationarity of the Markov chain. It can be found by focusing on the potentiation term in the learning rule in Eq 43 and using Eq 41 and the notation introduced in the last paragraph, in particular, (45) which leads to (46) Using this equality in Eq 43 leads to Eq 44.

Assuming a trivial kernel for *B*Π, i.e. *B*Π**x** = **0** ⇔ **x** = **0**, we find by looking at Eq 44 that
(47)
which is analogous to the statement in Eq 30. The assumption of a trivial kernel of *B*Π implies that the map PSP(*s*_{i}) from the state space of the latent Markov chain to dendritic inputs is one-to-one. This assumption is analogous to the statement that the dendritic inputs PSP_{i}(*t*) at individual synapses are sufficiently rich and different from each other (see Results after Eq 30).

#### 2. At the fixed point of learning the dendritic rate that is proportional to the expected future discounted somatic input rate, but with a longer discount time constant (cf. Eq 12).

Since the future discounted firing rate *F*(*x*) depends on , it is not trivial to solve Eq 47 for . Similar as in the result section in Eqs 31 and 10, however, we can show for a linear *φ* and constant *λ*(*s*_{i}) = *λ* that
(48)
with . Indeed, assuming linear *φ* we can rewrite Eq 47 in vector notation and solve for **r**_{V} to obtain
(49) (50)
where we have assumed that *λα* < 1 − *γ* such that the series converges. Powers of *A* evaluate to
(51)
and thus we can rewrite Eq 50 to get,
(52)
which proves the claim in Eq 48. The effective time constant can be much larger than *γ*. In fact, for *α* → (1 − *γ*)/*λ* we find *γ*_{eff} → 1.

#### Remarks.

- For affine
*φ*(*u*) =*a*⋅*u*+*c*the equivalent of Eq 50 is (53) In a first order approximation, the stationary**r**_{V}for a non-linear*φ*is thus a translated version of the stationary solution for a linear*φ*. - For input
**r**_{I}= 0 and linear*φ*we find that at the stationary point of learning**r**_{V}= 0. Thus we expect that learned weights decay again, once input**r**_{I}is removed.

#### 3. Convergence of learning.

For linear *φ*(*u*) = *ϕ* ⋅ *u* with *ϕ* > 0, we have **r**_{V} = *ϕB* **w** and thus the learning rule in Eq 43 can be written in vector notation as
(54)
where we introduced the diagonal ‘nudging matrix’ Λ = diag(*λ*(*s*_{1}),…,*λ*(*s*_{N})).

With **w*** = −*X*^{+} **c** + (1 − *X*^{+} *X*)**w** the orthogonal projection of **w** onto *W** = {**w**|*X* **w** = −**c**}, where *X*^{+} denotes the Moore-Penrose pseudoinverse of *X*, we are going to show that is a Lyapunov function of the dynamics in Eq 54. With **y** = **w** − **w*** and , the temporal evolution of *L* is given by
where we defined the scalar product 〈**x**,**y**〉 = **x**′ Π **y** and the inequality follows since both *A* and Λ are contracting maps, i.e. and therefore
(55) Λ is contracting because it is diagonal with entries between 0 and 1 and *A* is contracting because
(56)
where we used the facts that 0 ≤ *A*_{ij} < 1 and the row sums of *A* are equal to and therefore .

### Relation between the learning rule in Eq 42 and TD(1)

For *λ*_{TD} = 1 and therefore , we can rewrite Eq 17 by expanding the delta error in Eq 19 and using the identity to find
(57)
(58)
With small parameter updates in each time step, the terms in Eq 58 approximately cancel each other when summing over subsequent terms: contributes and contributes . What remains are the terms in Eq 57, which resemble the terms in the learning rule in Eq 42.

## Acknowledgments

We thank Christian Pozzorini, Laureline Logiaco and Kristin Völk for valuable comments on the manuscript.

## Author Contributions

Conceived and designed the experiments: RU WS JB. Performed the experiments: JB. Analyzed the data: JB WS. Wrote the paper: JB WS ATG. Methods: JB ATG RU WS.

## References

- 1. Rainer G, Rao SC, Miller EK (1999) Prospective Coding for Objects in Primate Prefrontal Cortex. The Journal of Neuroscience 19: 5493–5505. pmid:10377358
- 2. Reutimann J, Yakovlev V, Fusi S, Senn W (2004) Climbing Neuronal Activity as an Event-Based Cortical Representation of Time. Journal of Neuroscience 24: 3295–3303. pmid:15056709
- 3.
Pavlov IP (1927). Conditioned reflexes: an investigation of the physiological activity of the cerebral cortex. Oxford Univ. Press
- 4. Bangasser DA, Waxler DE, Santollo J, Shors TJ (2006) Trace conditioning and the hippocampus: the importance of contiguity. The Journal of Neuroscience 26: 8702–6. pmid:16928858
- 5. Li N, Chen Tw, Guo ZV, Gerfen CR, Svoboda K (2016) A motor cortex circuit for motor planning and movement. Nature 519: 51–56.
- 6. Liu Z, Zhou J, Li Y, Hu F, Lu Y, et al. (2014) Dorsal raphe neurons signal reward through 5-HT and glutamate. Neuron 81: 1360–74. pmid:24656254
- 7. Howe MW, Tierney PL, Sandberg SG, Phillips PEM, Graybiel AM (2013) Prolonged dopamine signalling in striatum signals proximity and value of distant rewards. Nature 500: 575–9. pmid:23913271
- 8. van der Meer MAA, Redish AD (2011) Theta phase precession in rat ventral striatum links place and reward information. The Journal of Neuroscience 31: 2843–2854. pmid:21414906
- 9. Quintana J, Fuster JM (1999) From perception to action: Temporal integrative functions of prefrontal and parietal neurons. Cerebral Cortex 9: 213–221. pmid:10355901
- 10. Miller EK, Erickson Ca, Desimone R (1996) Neural mechanisms of visual working memory in prefrontal cortex of the macaque. The Journal of Neuroscience 16: 5154–5167. pmid:8756444
- 11. Sakai K, Miyashita Y (1991) Neural organization for the long-term memory of paired associates. Nature 354: 152–155. pmid:1944594
- 12. Markram H, Lübke J, Frotscher M, Sakmann B (1997) Regulation of synaptic efficacy by coincidence of postsynaptic APs and EPSPs. Science 275: 213–5. pmid:8985014
- 13. Bi GQ, Poo M (2001) Synaptic modification by correlated activity: Hebb’s postulate revisited. Annu Rev Neurosci 24: 139–66. pmid:11283308
- 14. Cichon J, Gan W (2015) Branch-specific dendritic Ca2+ spikes cause persistent synaptic plasticity. Nature 520: 180–5 pmid:25822789
- 15. Yagishita S, Hayashi-Takagi A, Ellis-Davies GC, Urakubo H, Ishii S, et al. (2014) A critical time window for dopamine actions on the structural plasticity of dendritic spines. Science 345: 1616–1620. pmid:25258080
- 16. Pfister JP, Toyoizumi T, Barber D, Gerstner W (2006) Optimal spike-timing-dependent plasticity for precise action potential firing in supervised learning. Neural Computation 18: 1318–1348. pmid:16764506
- 17. Brea J, Senn W, Pfister JP (2013) Matching recall and storage in sequence learning with spiking neural networks. The Journal of Neuroscience 33: 9565–75. pmid:23739954
- 18. Urbanczik R, Senn W (2014) Learning by the Dendritic Prediction of Somatic Spiking. Neuron 81: 521–528. pmid:24507189
- 19. Wise S, Boussaoud D, Johnson P, Caminiti R (1997) Premotor and parietal cortex: corticocortical connectivity and combinatorial computations. Annu Rev Neurosci 20: 25–42. pmid:9056706
- 20. Jaakkola T, Jordan MI, Singh S (1994) On the Convergence of Stochastic Iterative Dynamic Programming Algorithms. Neural Computation 6: 1185–1201.
- 21.
Dayan P (1992) The convergence of TD(
*λ*) for general*λ*. Machine Learning 8: 341–362. - 22. Sutton RS (1988) Learning to Predict by the Methods of Temporal Differences. Machine Learning 3: 9–44.
- 23. Urbanczik R, Senn W (2009) Reinforcement learning in populations of spiking neurons. Nature Neuroscience 12: 250–2 pmid:19219040
- 24. Friedrich J, Urbanczik R, Senn W (2011) Spatio-temporal credit assignment in neuronal population learning. PLoS Computational Biology 7: e1002092. pmid:21738460
- 25. Gavornik J, Shuler M, Loewenstein Y, Bear M, Shouval HZ (2009) Learning reward timing in cortex through reward dependent expression of synaptic plasticity. Supporting Material. Proceedings of the National Academy of Sciences 106: 6826.
- 26. He K, Huertas M, Hong S, Tie X, Hell J, et al. (2015) Distinct Eligibility Traces for LTP and LTD in Cortical Synapses. Neuron.
- 27. Solomon PR, Vander Schaaf ER, Thompson RF, Weisz DJ (1986) Hippocampus and trace conditioning of the rabbit’s classically conditioned nictitating membrane response. Behavioral Neuroscience 100: 729–744. pmid:3778636
- 28. Churchland MM, Yu BM, Cunningham JP, Sugrue LP, Cohen MR, et al. (2010) Stimulus onset quenches neural variability: a widespread cortical phenomenon. Nature neuroscience 13: 369–78. pmid:20173745
- 29. Churchland MM, Abbott LF (2012) Two layers of neural variability. Nature Neuroscience 15: 1472–1474. pmid:23103992
- 30. Laje R, Buonomano DV (2013) Robust timing and motor patterns by taming chaos in recurrent neural networks. Nature Neuroscience 16: 925–933. pmid:23708144
- 31. Hennequin G, Vogels T, Gerstner W (2014) Optimal Control of Transient Dynamics in Balanced Networks Supports Generation of Complex Movements. Neuron 82: 1394–1406. pmid:24945778
- 32.
Ludvig EEA, Sutton RRS, Verbeek E, Kehoe EJ (2009) A computational model of hippocampal function in trace conditioning. In: Advances in Neural Information Processing Systems 21, Curran Associates, Inc. pp. 993–1000.
- 33. Xu S, Jiang W, Poo MM, Dan Y (2012) Activity recall in a visual cortical ensemble. Nature Neuroscience 15: 449–455. pmid:22267160
- 34. Davidson TJ, Kloosterman F, Wilson Ma (2009) Hippocampal Replay of Extended Experience. Neuron 63: 497–507. pmid:19709631
- 35. Ziegler L, Zenke F, Kastner DB, Gerstner W (2015) Synaptic Consolidation: From Synapses to Behavioral Modeling. Journal of Neuroscience 35: 1319–1334. pmid:25609644
- 36. Clopath C, Ziegler L, Vasilaki E, Büsing L, Gerstner W (2008) Tag-Trigger-Consolidation: A Model of Early and Late Long-Term-Potentiation and Depression. PLoS Computational Biology 4: e1000248. pmid:19112486
- 37. Pfister JP, Gerstner W (2006) Triplets of spikes in a model of spike timing-dependent plasticity. The Journal of Neuroscience 26: 9673–82. pmid:16988038
- 38. Zhang JC, Lau PM, Bi GQ (2009) Gain in sensitivity and loss in temporal contrast of STDP by dopaminergic modulation at hippocampal synapses. Proceedings of the National Academy of Sciences of the United States of America 106: 13028–13033. pmid:19620735
- 39. Kolodziejski C, Porr B, Wörgötter F (2009) On the asymptotic equivalence between differential Hebbian and temporal difference learning. Neural Computation 21: 1173–1202. pmid:19018698
- 40. Potjans W, Morrison A, Diesmann M (2009) A Spiking Neural Network Model of an Actor-Critic Agent. Neural Computation 21: 301–339. pmid:19196231
- 41. Frémaux N, Sprekeler H, Gerstner W (2013) Reinforcement learning using a continuous time actor-critic framework with spiking neurons. PLoS computational biology 9: e1003024. pmid:23592970
- 42. Maass W, Natschläger T, Markram H (2002) Real-time computing without stable states: a new framework for neural computation based on perturbations. Neural Computation 14: 2531–60. pmid:12433288
- 43. Jaeger H, Haas H (2004) Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science 304: 78–80. pmid:15064413
- 44. Drew PJ, Abbott LF (2006) Extending the effects of spike-timing-dependent plasticity to behavioral timescales. Proceedings of the National Academy of Sciences of the United States of America 103: 8876–81. pmid:16731625
- 45. Huertas MA, Shuler XMGH, Shouval HZ (2015) A Simple Network Architecture Accounts for Diverse Reward Time Responses in Primary Visual Cortex 35: 12659–12672.
- 46. Latimer KW, Yates JL, Meister MLR, Huk AC, Pillow JW (2015) Single-trial spike trains in parietal cortex reveal discrete steps during decision-making. Science 349: 184–187. pmid:26160947
- 47. Palmer SE, Marre O, Berry MJ, Bialek W (2015) Predictive information in a sensory population. PNAS 112:6908–13. pmid:26038544
- 48.
Kushner HJ, Yin GG (2003) Stochastic Approximation and Recursive Algorithms and Applications, volume 35. Springer Science & Business Media.