An Imperfect Dopaminergic Error Signal Can Drive Temporal-Difference Learning

An open problem in the field of computational neuroscience is how to link synaptic plasticity to system-level learning. A promising framework in this context is temporal-difference (TD) learning. Experimental evidence that supports the hypothesis that the mammalian brain performs temporal-difference learning includes the resemblance of the phasic activity of the midbrain dopaminergic neurons to the TD error and the discovery that cortico-striatal synaptic plasticity is modulated by dopamine. However, as the phasic dopaminergic signal does not reproduce all the properties of the theoretical TD error, it is unclear whether it is capable of driving behavior adaptation in complex tasks. Here, we present a spiking temporal-difference learning model based on the actor-critic architecture. The model dynamically generates a dopaminergic signal with realistic firing rates and exploits this signal to modulate the plasticity of synapses as a third factor. The predictions of our proposed plasticity dynamics are in good agreement with experimental results with respect to dopamine, pre- and post-synaptic activity. An analytical mapping from the parameters of our proposed plasticity dynamics to those of the classical discrete-time TD algorithm reveals that the biological constraints of the dopaminergic signal entail a modified TD algorithm with self-adapting learning parameters and an adapting offset. We show that the neuronal network is able to learn a task with sparse positive rewards as fast as the corresponding classical discrete-time TD algorithm. However, the performance of the neuronal network is impaired with respect to the traditional algorithm on a task with both positive and negative rewards and breaks down entirely on a task with purely negative rewards. Our model demonstrates that the asymmetry of a realistic dopaminergic signal enables TD learning when learning is driven by positive rewards but not when driven by negative rewards.

If λ(t) is constant for t > t 0 we get: For appropriately chosen time constants of the pre-synaptic efficacy and activity traces, the plasticity of the synapse is only significant when the agent has recently exited state s n and negligible otherwise. Assuming the transition occurs at t = 0, the net change in the mean synaptic weight of state s n is : where τ asp is the period for which the action neurons are suppressed so that they do not fire and τ ph is the duration of the phasic activity after a state transition.
To calculate Eq. (S2-3) we need to determine expressions for λ sn (t), ε sn (t), λ d (t) and λ STR (t). The rate of the state neurons representing s n is λ(s) whilst the agent is in s n . When the agent leaves s n, the neurons are no longer strongly stimulated by the environment and so the rate drops to approximately 0. Assuming a transition out of s n at t = 0, the mean efficacy trace and the mean pre-synaptic activity trace are: The dopaminergic rate λ d (t) is simply the constant baseline activity D b , except during the phasic activity of duration τ ph .
The phasic firing rate λ ph after a transition from a state s n to a state s n+1 is a function of the weight difference of the two corresponding states. We assume this firing rate to be constant for a particular state transition. For the sake of simplicity we consider the case that the phasic activity starts at t = 0. From Eq. (S2-2) follows: The post-synaptic activity λ STR (t) depends on the input from the currently active state, i.e. s n whilst the agent is in s n , and s n+1 after the transition at t = 0. Therefore, for t ∈ [0, τ asp ] the mean post-synaptic activity trace is given by The mean synaptic weight change is: One major difference between the traditional TD error and the dopaminergic signal is that the dopaminergic firing rate λ ph depends non-linearly on successive reward estimates, ∆w, whereas the TD error is a linear function of successive value functions (see Fig. 4 in the main text). However, is is possible to approximate the non-linear function for a given reward signal piecewise in ∆w by a linear function: To compare the synaptic weight change to the value function update we map the value function to the units of synaptic weights: The the reward amplitude I r for the parameters chosen in our simulation.
Because m d and c d are dependent on the range of ∆w and the direct current applied to the dopamine neurons I r , the weight update δw can be interpreted as a TD(0) learning value function update with self-adapting learning parameters and a self-adapting offset that depend on the current weight change and reward.
The values of m d and c d for the parameters chosen in our simulations for a direct current of I r = 600 pA and I r = 0 pA are summarized in Table S2-1.

Policy mapping
The probability of choosing a certain action in a certain state is given by the probability that the actor neuron encoding the action fires first in response to the input from the cortical neurons representing the state. This probability depends on the mean strength of the synapses connecting the cortical 'state' neurons to the actor neuron in comparison to the mean strength of the synapses connecting the state neurons to the other competing actor neurons. A mapping of synaptic weights to probabilities for a similar architecture was derived in [1]. To obtain the mapping, first spike time distributions are measured as a function of synaptic weight and fitted with a gamma probability densitiy function f (t|κ, θ) = 1 θ κ Γ(κ) t κ−1 e − t θ , where Γ is the gamma function. The probability that an actor neuron p with mean synaptic weights w p fires before an actor neuron q with mean synaptic weights w q is given by: Here, γ(t, κ) is the incomplete gamma function, with γ(t, κ) = 1 The policy defined by the probabilities of the respective action neurons firing first is not identical to the Gibbs softmax method used to select actions in the discrete-time algorithmic implementation (see introduction). However, the precise nonlinear function used to select actions is not critical; any selection mechanism that predominantly selects the most preferred action but occasionally selects a less preferred action would be expected to generate a similar policy.

Discrete-time simulation
In order to have a reference for the learning performance of the neuronal network in the grid-world task, we implemented in The discrete-time algorithmic implementation selects actions by the Gibbs softmax method (see introduction). As this nonlinear function cannot be mapped to the neuronal action selection process on the basis of first spike time probabilities, we arbitrarily set the learning parameter for the policy update to β = 0.3. The learning behaviour is not particularly sensitive to the choice of β in the range [0.1, 0.5]; for higher values the learning is less stable leading to a worse equilibrium performance (data not shown). Similarly to the neuronal implementation, we restrict the maximal and minimal probabilities of selecting an action by restricting the maximal and minimal values for the action preferences p to the range [1, 5.8]. This results in a maximum probability of choosing an action of 97.59%, as for the neuronal implementation, and a minimal probability of 0.27%, compared to the value of 2.82% in the neuronal implementation. If this contraint is relaxed, the discrete-time algorithmic implementation results in a slightly better equilibrium performance.