Meta-Reinforcement Learning reconciles surprise, value, and control in the anterior cingulate cortex

doi:10.1371/journal.pcbi.1013025

Fig 1.

Description of the RML model.

a Anatomo-functional mapping of RML modules. b Schematic representation of RML modules interactions. The agent consists of the RML and a task-specific model that provides the RML with specific functions necessary to execute a task. The RML optimizes the task-specific model via LC output (that can be interpreted as the cognitive control signal).

More »

Expand

Fig 2.

RML-external modules interactions in different tasks.

a: RML-DDM interaction during the speeded decision-making task and the foraging task. The RML received input from the environment (about rewards and environmental states), and controlled a task-specific module (the DDM), which helped in task execution. δv: difference in the expected value of the two options (different fractals for the speeded decision-making, “forage” or “engage” for the foraging task), whose absolute value determined the drift rate, while its sign determined the drift direction (up or down). The LC output modulated the decision boundaries, influencing the decision time. In the foraging task only, the LC output additionally influenced the bias of the DDM towards the “engage” option (human propension for “engaging” ^17,35). This bias was set to 0 for all trials in the speeded decision-making task. b: RML-cRNN interaction during the execution of the verbal WM task. The LC output modulated the gain of the neural units in the articulatory process layer, improving words retention in WM.

More »

Expand

Fig 3.

Speeded decision-making task simulation.

a: Task layout. The RML was presented with two possible choices (fractal images). Each image was assigned a hidden reward value ranging from 2 to 7 (bottom fractals list). The agent’s task was to choose the image with the higher reward, and it had to do this as quickly as possible. We tested the RML on 36 different combinations of reward values for the two images. After the RML made its choice, it received the reward associated with the selected image. b: The MRI results from Vassena et al. [32] (adapted). The dACC activity is shown in the black line, while the grey, dashed line shows the best fitting quartic function to this data. c: The dACC activity as simulated by the RML (black line), and the best fitting quartic function (blue, dashed line). This activity is the sum of the value (panel d) and the cognitive control (panel e). d: The value component of the RML activity. e: Cognitive control signal (RML boost) as a function of stimuli value difference. f: RML surprise-related activity. g: Activation clusters within the MPFC (based on data extracted from the figures from Vassena et al. [32]). Blue: dACC activation as a mixture of value and cognitive control, RML prediction in panel c; Red: vMPFC value-based activation, RML prediction in panel d; Green: mid-cingulate activation relative to average surprise, RML prediction in panel f.

More »

Expand

Fig 4.

Verbal WM task simulation.

a. Setup of the verbal WM task. During each trial, the RML was exposed to either 1, 4, 6 or 8 words, generating four different difficulty levels. After a delay of 10s, the model was presented with a target word that matched one of the memorized words in 50% of trials. The model’s goal was to indicate whether the target word matched one of the words presented before. b. fMRI results from Engstrom et al. [31] showing dACC activity (red cluster in subpanel) as a function of WM load (difficulty levels 1-4), dashed line shows linear fitting. C. Boost signal from RML as a function of WM load, dashed line shows linear fitting. d-e. RML surprise and expected value (average of MPFC_Act and MPFC_Boost) as a function of WM load. Dashed lines show linear fitting.

More »

Expand

Fig 5.

Foraging task simulation.

a. Single trial schema. Each trial began with a compound cue (e.g., green and blue) indicating the proposed bandit and its source environment. Choosing “engage” led to the next trial state where the indicated bandit was played. Choosing “forage” led to a waiting state (foraging cost), then back to the initial state with a new, randomly selected bandit from the same environment (green cue changes accordingly). b-c. fMRI results from [17], showing respectively the dACC activity as a function of the value of the “forage” choice and as a function of the similarity between the “forage” and the “engage” options (choice difficulty). d. RML boost signal as a function of foraging value. The subplot shows the simulated activity of the whole MPFC sector of the RML, computed as the combination of boost and value signals. e. RML boost signal as a function of choice difficulty. As in d, the subplot shows the simulated activity of the whole MPFC, computed as the combination of boost and value signals.

More »

Expand

Fig 6.

Anatomical overlap of all the dACC fMRI clusters from the three experimental paradigms we simulated in this work, all plotted on a T1 image at X = 0.

a. the cluster for the speeded decision-making task, based on data extracted from the figures from Vassena et al. [32]. b. the cluster for the verbal working memory task, based on data extracted from the figures from Engström et al. [31]. c. the cluster for the foraging task, based on data extracted from the figures from Kolling et al. [35]. d. the cluster for the foraging task, based on data extracted from the figures from Shenhav et al. [17]. e. the overlap between the four different clusters. The different colours indicate the number of overlapping clusters.

More »

Expand