Learning the payoffs and costs of actions

doi:10.1371/journal.pcbi.1006285

Fig 1.

The organization of the basal ganglia.

Circles denote neural populations in the areas indicated by labels next to them, where D1 and D2 correspond to striatal neurons expressing D1 and D2 receptors respectively, STN stands for the subthalamic nucleus, GPe for the external segment of globus pallidus, and Output for the output nuclei of the basal ganglia, i.e. the internal segment of globus pallidus and the substantia nigra pars reticulata. Arrows and lines ending with circles denote excitatory and inhibitory connections respectively.

More »

Expand

Fig 2.

Qualitative description of learning payoffs and costs.

(a) Operant conditioning chamber setup: a rat obtains a food pellet by pressing a lever. (b) Diagrams of changes in the weights G and N associated with lever-pressing at each stage of the experience presented in panel (a). In all diagrams, the black circles represent the cortical neurons selective for the state (being in the operant box), and the green and red circles represent the Go and No-Go populations of striatal neurons, respectively, selective for the action (pressing the lever). The thickness of the arrows linking the circles represents the connection strength between the respective neuron populations. The blue shading in the background indicates the strength of the immediate reinforcement, with a colour intensity proportional to the magnitude of reinforcement.

More »

Expand

Fig 3.

The incremental construction of the learning rules.

(a)–(c) The different stages in the construction of the learning rules. All panels feature a mathematical formulation of the rules at the given stage and a simulation of these rules. The reinforcements in those simulations, indicated by black dots, alternate between a fixed payoff of magnitude 20 and a fixed cost of −20. The Go weights G are depicted in green, the negative No-Go weights −N are depicted in red. The parameters used in the simulations were α = 0.300, ϵ = 0.443 and λ = 0.093. (d)–(f) Definition, visualization and properties of the nonlinear function f_ϵ.

More »

Expand

Fig 4.

The relation of reinforcement statistics to payoff and cost.

The graph shows a representative reinforcement distribution over the magnitude r of all received reinforcements. The parts of the distribution that indicate negative reinforcements (costs) are colored red, while the parts that indicate positive reinforcements (payoffs) are colored green. The mean q and the mean spread s are indicated above the distribution, the mean cost −n and the mean payoff p are indicated below the distribution.

More »

Expand

Fig 5.

Simulations of learning.

In all graphs, the collective strength G of the Go weights is depicted in green, while the negative collective strength −N of the No-Go weights is depicted in red. The received reinforcements are indicated by solid black dots in the panels on the left, and by transparent black dots in the panels on the right. Each simulation shows how G and N change due to the reception of 30 prediction errors. Panel (a) contains a simulation based on predictable, alternating reinforcements. It also contains the parameter values used for the simulations. Panels (b) to (d) show both single and averaged simulations of stochastic reinforcements: On the left, we show a single sequence of learning, with reinforcements sampled from different distributions. On the right, we show averages over many such sequences of learning. There, the mean weights are depicted as green and red lines, while the shaded green and red areas around these lines of G and N in the right column indicate one standard deviation. The bars to the right of the averaged learning curves indicate the mean and mean spreads of the respective reinforcement distributions.

More »

Expand

Fig 6.

Effects of D2 blocking on the willingness to exert effort.

(a) Schematic illustration of the experimental setup. (b) Action selection in control state. Green and red circles on the left denote striatal Go and No-Go neurons associated with pressing the lever, while the green and red circles on the right denote the neurons associated with approaching the free food. The strengths of the synaptic connections, which result from simulated learning, are indicated by the thickness of the arrows, and the labels. The parameters used for the simulations were obtained through a fit of the model to the experimental data. The blue circle represents a population of dopaminergic neurons, and its shading indicates the level of activity. (c) Action selection in the dopamine-depleted state. The notation is the same as in panel (b), with the thick red circles indicating enhanced activity in the No-Go population, which results from blocked dopaminergic inhibition (symbolised by the smaller inhibitory projections of the dopamine neurons).

More »

Expand

Fig 7.

fI-curves of D1- and D2-expressing neurons at different levels of receptor activation.

(a) fI-curves of a D1-expressing pyramidal neuron, replotted from [23]. The blue points are recorded from a neuron at a higher level of D1 receptor activation (e.g. with dopamine present), the black points are recorded at a lower level of receptor activation (e.g. without dopamine). Smooth curves have been obtained from the data through LOESS regression to serve as visual guides (black and blue lines). (b) fI-curves of a D2-expressing neuron, replotted from [24]. The blue points are recorded from a neuron at a higher level of D2 receptor activation (e.g. in the presence of the D2 agonist quinpirole), the black points are recorded from a neuron in the control group at a lower level of D2 activation (e.g. in the absence of the agonist). As in panel (a), LOESS curves (black and blue lines) have been added as visual guides.

More »

Expand

Fig 8.

Frequency of choosing pellets (dark blue) and lab chow (light blue) in control and D2-blocked states.

The top displays (a) and (b) correspond to a condition with free pellets, while the bottom displays (c) and (d) correspond to a condition where pressing a lever was required to obtain a pellet. The left displays (a) and (c) re-plot experimental data. The values in (a) were taken from Figure 1 in the paper by Salamone et al. [22]: pellet consumption was 15.5g and 15.7g in control and D2-blocked state, while chow consumption was 0.2g and 0.8g respectively. The values in (c) were taken from Figure 4 in [22]: pellet consumption was 7.2g and 2.1g in control and D2-blocked state, while chow consumption was 3.9g and 7g respectively. The right displays (b) and (d) show the results of simulations. The parameters used to simulate learning were α = 0.1, ϵ = 0.6327 and λ = 0.0204.

More »

Expand

Fig 9.

Actor-only in comparison with actor-critic learning.

The columns labeled with ‘action value 1’ (panels a and d) and ‘action value 2’ (panels b and d) show the simulated evolution of the collective synaptic weights G and N of the actor network over 30 successive trials. The first row (panels a and b) shows the evolution of the actor network in an actor-only architecture, while the second row (panels d and e) provides the evolution of the actor in an actor-critic architecture. The weights G are drawn as solid green lines, the negative weights-N are drawn as solid red lines. The reinforcements obtained by choosing the respective actions are indicated by black dots. For the actor-critic simulations (second row), we additionally provide the evolution of the state value in panel c. There, the state value V_critic is represented by a solid purple line. The expected reinforcements of both actions are indicated by dashed horizontal lines. The parameter settings used in these simulations were α = 0.4, ϵ = 0.519, λ = 0.1013 and β = 0.9. The same set of parameters was used for both the actor-only and the actor-critic model.

More »

Expand

Fig 10.

Relationship of learning rules to synaptic plasticity and receptor properties.

(a) Instantaneous reinforcement r when an action with effort n is selected to obtain payoff p. (b) Cortico-striatal weights before the action, after performing the action, and after obtaining the payoff. Red and green circles correspond to striatal Go and No-Go neurons, and the thickness of the lines indicates the strength of synaptic connections. The intensity of the blue background indicates the dopaminergic teaching signal at different moments of time. (c) The average excitatory post-synaptic potential (EPSP) in striatal neurons produced by cortical stimulation as a function of time in the experiment reported in [11]. The vertical black lines indicate the time when synaptic plasticity was induced by successive stimulation of cortical and striatal neurons. The amplitude of EPSPs is normalized to the baseline before the stimulation indicated by horizontal dashed lines. The green and red dots indicate the EPSPs of Go and No-Go neurons respectively. Displays with white background show the data from experiments with rat models of Parkinson’s disease, while the displays with blue background show the data from experiments in the presence of corresponding dopamine receptor agonists. The four displays re-plot the data from Figures 3E, 3B, 3F and 1H in [11]. (d) Changes in dopamine receptor occupancy. The green and red curves show the probabilities of D1 and D2 receptor occupancies in a biophysical model [30]. The two dashed blue lines in each panel indicate the levels of dopamine in dorsal (60 nM) and ventral (85 nM) striatum estimated on the basis of spontaneous firing of dopaminergic neurons using the biophysical model [32]. Displays with white and blue backgrounds illustrate changes in receptor occupancy when the level of dopamine is reduced or increased respectively.

More »

Expand

Fig 11.

Changes in the weight G of the Go neurons, N of the No-Go neurons and V of the critic in the OpAL model over the course of simulations.

(a) The purple line represents the evolving critic weight. The experienced reinforcements are indicated by black dots. (b) The actor weights, represented by a green and a red line respectively, were initialized to G = N = 1. Again, the black dots indicate the received reinforcements. The simulation was run with learning rate α = 0.3.

More »

Expand