Fig 1.
A. Schematic decision-outcome representations in two variants of the bandit task. In the single bandit task, one decision (d1) is followed by one outcome (o1). In the multiple-bandits task two decisions (d1, d2) are followed by two outcomes (o1, o2). White circles constitute the different states of the task. Gray and colored boxes indicate the true causal structure, called decision-outcome mapping. Colored arrows indicate the correct or incorrect policy, where correctness relates to the match between causal structure and an agent’s representation. Outcomes are considered relevant if they belong to the correct representation and irrelevant if the belong to the incorrect representation. B. Graphical representation the multiple-bandits task, as implemented in the study. P and Q define the outcome value associated with each action. Over the course of a block, these values are subject to independent Gaussian random walks, as depicted in the boxes below. C and D. Trial representation of the multiple-bandits task in Experiment 1 and 2. E and F. Trial representation of the transfer task in Experiment 1 and 2. For the transfer task, stimuli from the different decision were mixed and participants were instructed to always choose the stimuli associated with a specific color (e.g., blue was associated with the star and the circle).
Fig 2.
Behavioral results for Experiment 1 (left) and Experiment 2 (right). A,B. Stay probabilities following different combinations of relevant and irrelevant outcome. Relevant outcomes indicate the correct decision-outcome mapping, whereas irrelevant outcomes indicate the incorrect decision-outcome mapping. As the decision-outcome mapping is hidden from participants, stay probabilities for each decision are classified for both possible mappings (see Fig 1A). For example, decision 1 results in a win (blue), whereas decision 2 results in a loss (yellow). For decision 1, the relevant outcome (blue) is a win and the irrelevant outcome (yellow) a loss. Vice versa, for decision 2, the relevant outcome (yellow) is a loss and the irrelevant outcome (blue) is a win. C,D. Correlation between performance (probability correct) in the multiple-bandits task and the subsequent transfer task. Circles indicate participants. The black line indicates the regression line between both variables. E,F. Correlation between implicit credit assignment (stay probability of relevant win/irrelevant loss minus relevant loss/irrelevant win) in the multiple-bandits task and the subsequent transfer task.
Fig 3.
Simulated data for the validation of the surprise minimization model.
A. Choice behavior when action selection is fully driven by the first-level policy representing the correct mapping between decisions and outcomes. B. Choice behavior when action selection is fully driven by the lower-level loop representing the incorrect mapping. C. Choice behavior when action selection is arbitrated between policies by the inference process of the surprise minimization model. D. Distribution of surprise signals calculated as the absolute prediction errors for both the correct and the incorrect policy. E. Illustration of the evidence accumulation process. Surprise is calculated for both the correct (green) and incorrect (red) mapping (top panel). The evidence signal is calculated as the difference between these two surprise signals (middle panel). Accumulation of evidence and development of the arbitration weight (logit-1(ω)) over the course of a block (bottom panel). Starting in a state of uncertainty (0.5), the inference process gradually und robustly establishes the correct arbitration weight, leading to increasingly optimal credit assignment.
Table 1.
Model comparison.
Fig 4.
Linear regression between the estimated parameters of the surprise minimization model (regressors) and the behavioral measures (criteria).
A. Regression values for Experiment 1. B. Regression values Experiment 2. Implicit credit assignment is the difference in stay probabilities between relevant win/irrelevant loss and relevant loss/irrelevant win.
Fig 5.
Model-based regression analyses.
A. Mean regression (beta) values for prediction errors for the correct and incorrect policies, separately for the feedback-locked data (left panel) and response-locked data (right panel) at posterior electrode sites. B. Mean regression (beta) values for the evidence signal, calculated as the difference between correct and incorrect surprise and the arbitration weight, calculated as the inverse logit-transformed accumulated evidence signal. Gray bars below indicate the time windows which were considered for cluster-based permutation testing. Colored bars indicate time windows with significant positive and negative effects. The posterior cluster was defined by electrodes Pz, P3, P4, CP1, CP2, PO3, and PO4, as indicated in the central inlay. Topographies show the significant cluster for the correct policy and the incorrect policy. Black diamonds indicate significant clusters.