Predictive reward-prediction errors of climbing fiber inputs integrate modular reinforcement learning with supervised learning

doi:10.1371/journal.pcbi.1012899

Fig 1.

Q-learning model of a licking behavior during Go/No-go auditory discrimination task.

A: schematic of the Q-learning model. B: the averaged lick rate for the four cue-response conditions (blue, red, orange and magenta for HIT, FA, CR and MISS trials, respectively: see also inset for color codes). Solid and dashed vertical lines indicate the cue onset and response window (1s after cue), respectively. Light cyan shading represents the window for possible reward delivery (0.5 - 2 s after cue). C-D: fraction correct for Go cue (blue) and fraction incorrect for No-go cue (red) for experimental data (solid lines) and Q-learning model (dashed lines), averaged for 17 mice in 7 training sessions (C) and for individual mice (D). Vertical bars in C show standard errors. E: hyperparameter values of the Q-learning model estimated for individual mice (1~17); and from left to right and top to bottom, learning rate α, initial Q values for Go and No-go cues, q₁ and q₂, respectively, temperature τ, and punishment value for FA trials ξ. F-G: evolution of Q (F) and δQ (G) of a representative mouse for the four state-action combinations (HIT, FA, CR, MISS) during the course of learning.

More »

Expand

Fig 2.

CS activities to cues and their correlations with reward and sensorimotor variables.

A: Panels showed PSTHs of CSs in 8 aldolase-C zones (7+ to 4b-, columns) in the four cue-response conditions (rows). Blue, green and red traces are for 1st, 2nd and 3rd learning stages, respectively. The horizontal lines indicate cue onset. Dashed horizontal lines represent the boundary between lateral vs. medial parts of the left Crus II. B: Bars showed the variable-importance-in-prediction (VIP) scores of 10 reinforcement-learning and sensorimotor-control variables (from left to right, R, Q, δQ, Go ✕ ELick, No-go ✕ ELick, Go ✕ RLick, No-go ✕ RLick, Go ✕ LLick, No-go ✕ LLick and latency fluctuation) for spiking activity of neurons in 8 aldolase-C zones. Dashed lines indicated VIP score = 1, which is considered a threshold of importance. See the inset for color codes of the 10 explanatory variables.

More »

Expand

Fig 3.

Tensor-component analysis (TCA) and computation of tensor score at a trial-basis.

A: TCA was conducted for PSTHs in 4 cue-response conditions of n=6,445 neurons and the resulting four tensor components (TC1-4) explained more than 50% of variance. B: for the i-th single neuron, its activity for the r-th TC (y^r) in the particular j-th trial was computed by filtering spike timings with temporal profile of the r-th TC , multiplying corresponding coefficients of the i-th neuron and of the cue-response condition c. C-D: PSTHs (C) of two representative neurons, which have the highest coefficients of TC1 and TC2, respectively, and their TC1 and TC2 scores, respectively, computed for all trials in their corresponding sessions (D). E: Heatmaps showed TC1-4 scores averaged for all neurons in each of the eight zones distinctively for the four cue-response conditions.

More »

Expand

Fig 4.

Sparse canonical-correlation analysis (sCCA) of TC scores with reinforcement-learning and sensorimotor-control variables.

A: Bars show coefficients of reinforcement-learning and sensorimotor-control variables corresponding to TC1-4 scores. B-F: the scatter plots of trials showing correlations of TC1 with Q (B), TC1 with δQ (C), TC2 with δQ (D), TC3 with R (E) and TC4 with No-go ✕ ELick (F). Black and gray lines indicate regression between variables when using all trials and trials of the cue-response condition with which each TC is primarily associated, i.e., TC1-HIT, TC2-FA and TC4-CR, respectively. Panel E shows the boxplot with gray lines indicating the median and the bottom and top edges of the box the 25th and 75th percentiles, respectively. All correlations in B-F are significant (p < 0.0001). Color convention of trials is the same as Fig 1. The inset of A shows color codes of the selected 7 reward and sensorimotor variables among 10 according to sCCA.

More »

Expand

Fig 5.

The conceptual model for modular reinforcement learning of the Go/No-go task and bidirectional plasticity at the PF-PC synapses.

A: The state (s) and action (a) are conveyed to the critic by upstream networks and efference copies from the actors, respectively. The critic then computes a temporal-difference (TD) prediction error δQ as comparing the observed reward-penalty value R-P with the predicted Q value. The prediction error δQ is used to update the state-action dependent reward prediction in the critic as well as policy of the actors (red arrows). In the Go/No-go task, subsets of Purkinje cells act as context-dependent actors for Go (gray shade) and No-go (blue shade) cues separately. Here, we postulated that two neuronal populations acquire necessary motor commands by utilizing negative reward-prediction error δQ, relayed by their CF inputs, in a supervised learning framework. B: bidirectional PF-PC plasticity may occur depending on the magnitude of CSs. Consequently, modulation of SSs was in the same direction with change of CS activities during learning (black arrows). Note that CSs of TC1 and TC2 neurons were negatively correlated with reward-prediction errors in Go (blue line) and No-go (red line) trials, respectively. Horizontal dashed lines indicate the threshold , which determines LTD or LTP at the PF-PC synapses. MFs – mossy fibers, PFs – parallel fibers, CF – climbing fiber, GrCs – Granule cells, PCs – Purkinje cells, LTP – long-term potentiation, LTD – long-term depression, CS – complex spike, SS – simple spike.

More »

Expand

Fig 6.

Spiking neural network model of the cerebellum with 5,000 neurons in Go/No-go tasks.

A: The model consists of two groups of neurons in the PC–CN–IO circuitry, each corresponding to TC1 & TC3 (TC_Go: PC_Go–CN_Go–IO_Go) and TC2 & TC4 (TC_Nogo: PC_Nogo–CN_Nogo–IO_Nogo). Sensory input to the PC_Go and PC_Nogo were transmitted via mossy fibers (MFs) to granule cells for Go (GrC_Go) and No-go (GrC_Nogo), respectively. Note that the two neuronal groups received shared mossy fiber input, which is represented by equal connection of GrC_Go and GrC_Nogo to both PC_Go and PC_Nogo. In this model, LTP and LTD are assumed to occur at PF-PC synapses of TC_Go and TC_Nogo, when IO firing is lower and higher than the threshold, respectively. For each group, PCs, CN, and IO designated by green, yellow and blue discs contained 100 simulated neurons each, and we prepared 2000 GrCs for both Go (GrC_Go) and No-go (GrC_Nogo) cues. B: The lick rate is modeled as a sigmoid function of the combined firing rates of CN_Go and CN_Nogo neurons, with the maximum lick rate (rate_max) set at 6 Hz. C: The error rates of Go and No-go trials, defined by the difference between the target lick rate (rate_max for Go and 0 for No-Go trials) and the actual lick rate, are transformed into the rate of Poisson spike generator inputs Err_Go and Err_Nogo to IO_Go and IO_Nogo neurons, respectively. This reproduces the established negative correlations between δQ and CSs in Go trials for TC_Go (blue region) and No-go trials for TC_Nogo (red region). D: A lattice structure with 10x10 IO neurons for each of TC_Go and TC_Nogo is modeled, where the effective coupling strength between neurons is proportional to their relative distance. In each trial, the effective coupling strength was determined by the firing rate of CN neurons (see Methods for details).

More »

Expand

Fig 7.

Licking behaviors and neural firings of the model.

A: lick rate in 500 randomly generated trials, distinguished by Go (blue dots) and No-go (red dots) cues. Each dot represents a single trial. The right panel presents mean ± std of lick rates in the first 100 trials (open bars, first stage) and the last 100 trials (filled bars, last stage). B: firing rates of IO_Go and IO_Nogo neurons in the first and last stages of the trials. C: bidirectional changes in the weight of PF-PC synapses for TC_Go (cyan trace) and TC_Nogo (magenta trace) throughout the learning process. D: effective coupling between IO_Go (left) and IO_Nogo (right) neurons in individual trials. E-F: raster plots of IO_Go (upper panel) and IO_Nogo (lower panel) neurons in the first (E) and last (F) stages. Vertical dashed lines indicate trial (Go vs No-go) boundaries. Asterisks in A-B indicate significance level of the t-tests between the first and last stages: n.s, p < 0.05; ****, p<0.0001.

More »

Expand

Table 1.

Parameters of LIF neuron model.

More »

Expand

Table 2.

Synaptic connections.

More »

Expand