Neural responses in macaque prefrontal cortex are linked to strategic exploration

doi:10.1371/journal.pbio.3001985

Fig 1.

Task and model.

(A) During the task, we manipulated whether the information could be used in the future by including both long and short horizon sequences. In both trial types, monkeys initially received four samples (“observations”) from the unknown underlying reward distributions. In short horizon trials, they then made a one-off decision between the two options presented on screen (“choice”). In long horizon trials, they could make four consecutive choices between the two options (fixed reward distributions). On the first choice (highlighted), the information content was equivalent between short and long horizon trials (same number of observations), whereas the information context was different (learning and updating is only beneficial in the long horizon trials). (B) Example short and long horizon trials. The monkeys first received some information about the reward distributions associated with choosing the left and right option. The length of the orange bar indicates the number of drops of juice they could have received (0–10 drops). The horizon length of the trial is indicated by the size of the grey area below the four initial samples. The monkeys then make one (short horizon) or four (long horizon) subsequent choices. As monkeys progressed through the four choices, more information about the distributions was revealed. Displayed here is a partial information trial where only information about the chosen option is revealed. (C) Ideal model observer for the options of the example trial shown in B (color code corresponds to the side of the option). The distributions correspond to the probabilities to observe the next outcome for each option. The expected value corresponds to the peak of the distribution and the uncertainty to the variance. Thick lines correspond to post-outcome estimate and thin lines to pre-outcome estimates (from the previous trial). (D) We also modulated the contingency between choice and information by including different feedback conditions. In the partial feedback condition, monkeys only receive feedback for the chosen option. In contrast, in the complete feedback condition, they receive feedback about both options after active choices (not in the observation phase). (E) Example partial and complete feedback trials (both short horizon). Here, the observation phase shown in (B) is broken up into the components the monkeys see on screen during the experiment. Initially, the samples were displayed on screen, but a red circle in the center indicates that the monkeys could not yet respond. After a delay, the circle disappears, and the monkeys could choose an option. After they responded, the chosen side was highlighted (red outline). After another delay, the outcome was revealed. In the partial feedback condition (top), only the outcome for the chosen option was revealed. In contrast, in the complete feedback condition (bottom), both outcomes were revealed. After another delay, the reward for the chosen option was delivered in both conditions.

More »

Expand

Fig 2.

First choice.

(A) In our experimental design, on the first choice of a horizon, directed exploration is only sensible in long horizon trials in the partial feedback condition. This is because in short horizon trials, the information gained by exploring is of no use for subsequent choices, so a rational decision-maker would only choose based on the expected value of the options. Moreover, in the complete feedback condition, all information is obtained regardless of which option is chosen, so an ideal observer would again always choose the option with the highest expected value. (B) The proportion of trials in which the monkeys chose the option with the higher expected value is above chance level (0.5) across both feedback conditions and horizons. Mean across sessions (partial feedback: 41 sessions, complete feedback: 40 sessions). (C) Monkeys’ choices are sensitive to nuanced differences in expected value. Mean across all sessions (81 sessions). (D) According to the logistic regression model predicting monkeys’ first choices in a horizon (see main text and methods for details), monkeys’ first choices are less driven by expected value in the partial than in the complete feedback condition. Within the partial feedback condition, they are less driven by expected value in long than in short horizon trials. No such difference was found in the complete feedback condition. This is evidence that monkeys deliberately modulate their exploration behavior to explore more on partial feedback long horizon trials, where exploration is sensible (see (A)). Error bars indicate standard error to the mean in B and C and standard deviation in D. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

More »

Expand

Fig 3.

Behavioral update.

(A) As monkeys progressed through the long horizon, they were more likely to choose the option with the higher expect reward in both the partial and complete feedback condition. Mean across sessions (partial feedback: 41 sessions, complete feedback: 40 sessions). (B) Monkeys were sensitive to changes in the expected value compared to the baseline expected value they experienced during the observation phase both for the chosen option (mean across all sessions (81 sessions)) and (C) the unchosen option (mean across all complete feedback sessions (40 sessions)). (D) Monkeys were also more likely to repeat their choice as they progressed through the long horizon. Mean across sessions (partial feedback: 41 sessions, complete feedback: 40 sessions). (E) Results of the single logistic regression model predicting second, third, and fourth choices in the long horizon. In both the partial and complete feedback, monkeys were sensitive to the expected value at observation but more so in the complete than the partial feedback condition (left). Monkeys tended to repeat previous choices in both conditions but more so in the partial than in the complete feedback condition (center left). In both conditions, monkeys were sensitive to the change in expected value compared to the observation phase with no significant difference between conditions (center right). In the complete feedback condition, monkeys were also sensitive to the change compared to baseline of the additional information they received. Error bars represent standard error to the mean in A-D and standard deviation in E. *p < 0.05, **p < 0.01, and ***p < 0.001. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572.

More »

Expand

Fig 4.

First choice neural results.

(A) When combining partial and complete feedback sessions, we found clusters for a differential in activity in long horizon than short horizon in the pgACC, the dlPFC, and the lateral OFC. Cluster p < 0.05, cluster forming threshold of z > 2.3. (B) We placed ROIs (in yellow) in the overlap of the functional cluster and anatomical region and extracted t-statistics for the difference between long horizon and short horizon. Mean across sessions (partial feedback: 40 sessions, complete feedback 34 sessions). (C) We looked for differences in how the contingency between choice and information (complete vs. partial feedback) modulates the initial information that was presented before first choices. Within our VOI, we found clusters of activity in MCC both for the main effect of feedback type and a greater sensitivity to expected value in the complete feedback condition. We also found a cluster of activity in dlPFC for a greater sensitivity to expected value in the complete feedback condition. (D) We placed an ROI (in yellow) in the part of MCC that is activated by the main effect of feedback type and extracted the t-statistics of the regressor for every session. We found that the effect we observe in the VOI is driven by increased activity in the complete feedback condition, whereas there is no activity in the partial feedback condition. Mean across sessions (partial feedback: 40 sessions, complete feedback 34 sessions). (E) We also placed ROIs (in yellow) in the parts of MCC and dlPFC where we found significant clusters in the VOI for the interaction of feedback type and expected value and extracted the t-statistics for the expected value regressor of every session. Plotting these regressors separately for feedback type reveals that both MCC and dlPFC were more active when an option with high expected value was chosen in the complete feedback condition, whereas they were more active when an option with low expected value was chosen in the partial feedback condition. Mean across sessions (partial feedback: 40 sessions, complete feedback 34 sessions). Error bars represent standard error to the mean. *p < 0.05. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572. dlPFC, dorsolateral prefrontal cortex; MCC, anterior and mid-cingulate cortex; OFC, orbitofrontal cortex; pgACC, pregenual anterior cingulate cortex; ROI, region of interest; VOI, volume of interest.

More »

Expand

Fig 5.

Prediction error neural results.

(A) In complete feedback sessions only, we found clusters for inverted prediction error activity in the central part of OFC (area 13), extending into lOFC (area 47/12o). We also found inverted prediction error activity in the cOFC (area 13) and mOFC (area 14) for the unchosen, counterfactual reward. (B) Brain–behavior correlational analysis between the prediction error signal in the mOFC (t-statistic) and session-specific t-statistic of the behavioral effect of the change in expected value on choices (estimated with a separate GLM for each session). (C) We placed ROIs (in yellow) in the overlap of the functional cluster modulated by the magnitude of the chosen outcome and anatomical region. We extracted t-statistics for reward and expectation, both of the chosen and unchosen option. Prediction error activity should evoke both a reward and an expectation response with opposite signs. We did not found evidence for outcome expectation of the chosen option. Mean across complete feedback sessions (34 sessions). (D) When defining the ROIs (in yellow) according to the response to the magnitude of unchosen outcome, we find evidence for a classic reward prediction error and a counterfactual prediction error about the unchosen option both in cOFC and mOFC: We observe activity related both to the obtained and the unobtained reward, and also activity related both the chosen and unchosen outcome expectation. Mean across complete feedback sessions (34 sessions). Error bars represent standard error to the mean. *p < 0.05, **p < 0.01. Data and code to reproduce the figure can be found at https://doi.org/10.5281/zenodo.7464572. cOFC, central OFC; GLM, general linear model; OFC, orbitofrontal cortex; lOFC, lateral OFC; mOFC, medial OFC; ROI, region of interest.

More »

Expand

Table 1.

Timings.

More »

Expand