Meta-Reinforcement Learning reconciles surprise, value, and control in the anterior cingulate cortex
Fig 5
a. Single trial schema. Each trial began with a compound cue (e.g., green and blue) indicating the proposed bandit and its source environment. Choosing “engage” led to the next trial state where the indicated bandit was played. Choosing “forage” led to a waiting state (foraging cost), then back to the initial state with a new, randomly selected bandit from the same environment (green cue changes accordingly). b-c. fMRI results from [17], showing respectively the dACC activity as a function of the value of the “forage” choice and as a function of the similarity between the “forage” and the “engage” options (choice difficulty). d. RML boost signal as a function of foraging value. The subplot shows the simulated activity of the whole MPFC sector of the RML, computed as the combination of boost and value signals. e. RML boost signal as a function of choice difficulty. As in d, the subplot shows the simulated activity of the whole MPFC, computed as the combination of boost and value signals.