A Learning Theory for Reward-Modulated Spike-Timing-Dependent Plasticity with Application to Biofeedback

doi:10.1371/journal.pcbi.1000180

Figure 1.

Scheme of reward-modulated STDP according to Equations 1–4.

(A) Eligibility function f_c(t), which scales the contribution of a pre/post spike pair (with the second spike at time 0) to the eligibility trace c(t) at time t. (B) Contribution of a pre-before-post spike pair (in red) and a post-before-pre spike pair (in green) to the eligibility trace c(t) (in black), which is the sum of the red and green curves. According to Equation 1 the change of the synaptic weight w is proportional to the product of c(t) with a reward signal d(t).

More »

Expand

Figure 2.

Differential reinforcement of two neurons (within a simulated network of 4000 neurons, the two rewarded neurons are denoted as A and B), corresponding to the experimental results shown in Figure 9 of [17] and Figure 1 of [19].

(A) The spike response of 100 randomly chosen neurons at the beginning of the simulation (20 sec–23 sec, left plot), and at the middle of simulation just before the switching of the reward policy (597 sec–600 sec, right plot). The firing times of the first reinforced neuron A are marked by blue crosses and those of the second reinforced neuron B are marked by green crosses. (B) The dashed vertical line marks the switch of the reinforcements at t = 10 min. The firing rate of neuron A (blue line) increases while it is positively reinforced in the first half of the simulation and decreases in the second half when its spiking is negatively reinforced. The firing rate of the neuron B (green line) decreases during the negative reinforcement in the first half and increases during the positive reinforcement in the second half of the simulation. The average firing rate of 20 other randomly chosen neurons (dashed line) remains unchanged. (C) Evolution of the average weight of excitatory synapses to the rewarded neurons A and B (blue and green lines, respectively), and of the average weight of 1744 randomly chosen excitatory synapses to other neurons in the circuit (dashed line).

More »

Expand

Figure 3.

Setup of the model for the experiment by Fetz and Baker [17].

(A) Schema of the model: The activity of a single neuron in the circuit determines the amount of reward delivered to all synapses between excitatory neurons in the circuit. (B) The reward signal d(t) in response to a spike train (shown at the top) of the arbitrarily selected neuron (which was selected from a recurrently connected circuit consisting of 4000 neurons). The level of the reward signal d(t) follows the firing rate of the spike train. (C) The eligibility function f_c(s) (black curve, left axis), the reward kernel ε_r(s) delayed by 200 ms (red curve, right axis), and the product of these two functions (blue curve, right axis) as used in our computer experiment. The integral of f_c(s+d_r)ε_r(s) is positive, as required according to Equation 10 in order to achieve a positive learning rate for the synapses to the selected neuron.

More »

Expand

Figure 4.

Simulation of the experiment by Fetz and Baker [17] for the case where an arbitrarily selected neuron triggers global rewards when it increases its firing rate.

(A) Spike response of 100 randomly chosen neurons within the recurrent network of 4000 neurons at the beginning of the simulation (20 sec–23 sec, left plot), and at the end of the simulation (the last 3 seconds, right plot). The firing times of the reinforced neuron are marked by blue crosses. (B) The firing rate of the positively rewarded neuron (blue line) increases, while the average firing rate of 20 other randomly chosen neurons (dashed line) remains unchanged. (C) Evolution of the average weight of excitatory synapses to the reinforced neuron (blue line), and of the average weight of 1663 randomly chosen excitatory synapses to other neurons in the circuit (dashed line). (D) Spike trains of the reinforced neuron before and after learning. (E) Histogram of the time-differences between presynaptic and postsynaptic spikes (bin size 0.5 ms), averaged over all excitatory synapses to the reinforced neuron. The black curve represents the histogram values for positive time differences (when the presynaptic spike precedes the postsynaptic spike), and the red curve represents the histogram for negative time differences.

More »

Expand

Figure 5.

Evolution of the dynamics of a recurrent network of 4000 LIF neurons during application of reward-modulated STDP.

(A) Distribution of the synaptic weights of excitatory synapses to 50 randomly chosen non-reinforced neurons, plotted for 4 different periods of simulated biological time during the simulation. The weights are averaged over 10 samples within these periods. The colors of the curves and the corresponding intervals are as follows: red (300–360 sec), green (600–660 sec), blue (900–960 sec), magenta (1140–1200 sec). (B) The distribution of average firing rates of the non-reinforced excitatory neurons in the circuit, plotted for the same time periods as in (A). The colors of the curves are the same as in (A). The distribution of the firing rates of the neurons in the circuit remains unchanged during the simulation, which covers 20 minutes of biological time. (C) Cross-correlogram of the spiking activity in the circuit, averaged over 200 pairs of non-reinforced neurons and over 60 s, with a bin size of 0.2 ms, for the period between 300 and 360 seconds of simulated biological time. It is calculated as the cross-covariance divided by the square root of the product of variances. (D) As in (C), but between seconds 1140 and 1200. (Separate plots of (B), (C), and (D) for two types of excitatory neurons that received different amounts of noise currents are given in Figure S1 and Figure S2.)

More »

Expand

Figure 6.

Setup for reinforcement learning of spike times.

(A) Architecture. The trained neuron receives n input spike trains. The neuron μ* receives the same inputs plus additional inputs not accessible to the trained neuron. The reward is determined by the timing differences between the action potentials of the trained neuron and the neuron μ*. (B) A reward kernel with optimal offset from the origin of t_κ = −6.6 ms. The optimal offset for this kernel was calculated with respect to the parameters from computer simulation 1 in Table 1. Reward is positive if the neuron spikes around the target spike or somewhat later, and negative if the neuron spikes much too early.

More »

Expand

Figure 7.

Results for reinforcement learning of exact spike times through reward-modulated STDP.

(A) Synaptic weight changes of the trained LIF neuron, for 5 different runs of the experiment. The curves show the average of the synaptic weights that should converge to (dashed lines), and the average of the synaptic weights that should converge to (solid lines) with different colors for each simulation run. (B) Comparison of the output of the trained neuron before (top trace) and after learning (bottom trace). The same input spike trains and the same noise inputs were used before and after training for 2 hours. The second trace from above shows those spike times S* which are rewarded, the third trace shows the realizable part of S* (i.e. those spikes which the trained neuron could potentially learn to reproduce, since the neuron μ* produces them without its 10 extra spike inputs). The close match between the third and fourth trace shows that the trained neuron performs very well. (C) Evolution of the spike correlation between the spike train of the trained neuron and the realizable part of the target spike train S*. (D) The angle between the weight vector w of the trained neuron and the weight vector w* of the neuron μ* during the simulation, in radians. (E) Synaptic weights at the beginning of the simulation are marked with ×, and at the end of the simulation with •, for each plastic synapse of the trained neuron. (F) Evolution of the synaptic weights w/w_max during the simulation (we had chosen for i<50, for i≥50).

More »

Expand

Figure 8.

Test of the validity of the analytically derived conditions 13–15 on the relationship between parameters for successful learning with reward-modulated STDP.

Predicted average weight changes (black bars) calculated from Equation 22 match in sign and magnitude the actual average weight changes (gray bars) in computer simulations, for 6 different experiments with different parameter settings (see Table 1). (A) Weight changes for synapses with . (B) Weight changes for synapses with . Four cases where constraints 13–15 are not fulfilled are shaded in light gray. In all of these four cases the weights move into the opposite direction, i.e., a direction that decreases rewards.

More »

Expand

Table 1.

Parameter values used for computer simulation 3 (see Figure 8).

More »

Expand

Figure 9.

Training a LIF neuron to classify purely temporal presynaptic firing patterns: a positive reward is given for firing of the neuron in response to a temporal presynaptic firing pattern P, and a negative reward for firing in response to another temporal pattern N.

(A) The spike response of the neuron for individual trials, during 500 training trials when pattern P is presented. Only the spikes from every 4-th trial are plotted. (B) As in (A), but in response to pattern N. (C) The membrane potential V_m(t) of the neuron during a trial where pattern P is presented, before (blue curve) and after training (red curve), with the firing threshold removed. The variance of the membrane potential increases during learning, as predicted by the theory. (D) As in (C), but for pattern N. The variance of the membrane potential for pattern N decreases during learning, as predicted by the theory. (E) The membrane potential V_m(t) of the neuron (including action potentials) during a trial where pattern P is presented before (blue curve) and after training (red curve). The number of spikes increases. (F) As in (E), but for trials where pattern N is given as input. The number of spikes decreases. (G) Average number of output spikes per trial before learning, in response to pattern P (gray bars) and pattern N (black bars), for 6 experiments with different randomly generated patterns P and N, and different random initial synaptic weights of the neuron. (H) As in (G), for the same experiments, but after learning. The average number of spikes per trial increases after training for pattern P, and decreases for pattern N.

More »

Expand

Figure 10.

A LIF neuron is trained through reward-modulated STDP to discriminate as a “readout neuron” responses of generic cortical microcircuits to utterances of different spoken digits.

(A) Circuit response to an utterance of digit “one” (spike trains of 200 out of 540 neurons in the circuit are shown). The response within the time period from 100 to 200 ms (marked in gray) is used as a reference in the subsequent 3 panels. (B) The circuit response from (A) (black) for the period between 100 and 200 ms, and the circuit response to an utterance of digit “two” (red). (C) The circuit spike response from (A) (black) and a circuit response for another utterance of digit “one” (red), also shown for the period between 100 and 200 ms. (D) The circuit spike response from (A) (black), and another circuit response to the same utterance in another trial (red). The responses differ due to the presence of noise in the circuit. (E) Spike response of the LIF readout neuron for different trials during learning, for trials where utterances of digit “two” (left plot) and digit “one” (right plot) are presented as circuit inputs. The spikes from each 4th trial are plotted. (F) Average number of spikes in the response of the readout during training, in response to digit “one” (blue) and digit “two” (green). The number of spikes were averaged over 40 trials. (G) The membrane potential V_m(t) of the neuron during a trial where an input pattern corresponding to an utterance of digit “two” is presented, before (blue curve) and after training (red curve), with the firing threshold removed. (H) As in (G), but for an input pattern corresponding to an utterance of digit “one”. The variance of the membrane potential increases during learning for utterances of the rewarded digit, and decreases for the non-rewarded digit.

More »

Expand

Table 2.

Mean values of the U, D, and F parameters in the model from [37] for the short-term dynamics of synapses, depending on the type of the presynaptic and postsynaptic neuron (excitatory or inhibitory).

More »

Expand

Table 3.

Specific parameter values for the cortical microcircuits in computer simulation 1 and 5.

More »

Expand

Table 4.

Specific parameter values for the trained (readout) neurons in computer simulation 2, 4, and 5.

More »

Expand