Predictive reward-prediction errors of climbing fiber inputs integrate modular reinforcement learning with supervised learning
Fig 5
The conceptual model for modular reinforcement learning of the Go/No-go task and bidirectional plasticity at the PF-PC synapses.
A: The state (s) and action (a) are conveyed to the critic by upstream networks and efference copies from the actors, respectively. The critic then computes a temporal-difference (TD) prediction error δQ as comparing the observed reward-penalty value R-P with the predicted Q value. The prediction error δQ is used to update the state-action dependent reward prediction in the critic as well as policy of the actors (red arrows). In the Go/No-go task, subsets of Purkinje cells act as context-dependent actors for Go (gray shade) and No-go (blue shade) cues separately. Here, we postulated that two neuronal populations acquire necessary motor commands by utilizing negative reward-prediction error δQ, relayed by their CF inputs, in a supervised learning framework. B: bidirectional PF-PC plasticity may occur depending on the magnitude of CSs. Consequently, modulation of SSs was in the same direction with change of CS activities during learning (black arrows). Note that CSs of TC1 and TC2 neurons were negatively correlated with reward-prediction errors in Go (blue line) and No-go (red line) trials, respectively. Horizontal dashed lines indicate the threshold , which determines LTD or LTP at the PF-PC synapses. MFs – mossy fibers, PFs – parallel fibers, CF – climbing fiber, GrCs – Granule cells, PCs – Purkinje cells, LTP – long-term potentiation, LTD – long-term depression, CS – complex spike, SS – simple spike.