Information uncertainty influences learning strategy from sequentially delayed rewards

doi:10.1371/journal.pcbi.1013879

Fig 1.

Experimental design and random walks.

A. Example trial sequence of the conjoint condition. Participants choose between two objects, then receive feedback. Here, a purple three-line offers an immediate reward of ‘1’, while a red cross provides a delayed reward of ‘4’ after two trials. B. The disjoint condition is presented with the same sequence of events but with dissociable feedback. C. This illustrates the three fixed random walks reward value patterns for all eight stimuli across trials. Each stimulus is linked to either an immediate (solid lines) or a delayed (dashed lines) reward. The starting reward values are 4, 1, -1, or -4, and they gradually drift throughout the task. Participants were randomly assigned to one of the three random walks at the beginning of each session (hence could experience different random walks in conjoint and disjoint condition) and encountered a random sequence of unique pairs (28 in total) in each session. Schematic elements designed in Google Slides per Google’s copyright agreement.

More »

Expand

Table 1.

Descriptive statistics for behavioral performance.

More »

Expand

Fig 2.

Graphical representation of task conditions, learning models, and value functions.

A. Differences in feedback presentation based on the participant’s condition and the outcome used to generate the prediction error for immediate feedback. B. Example sequence of two-alternative forced choice trials and the participant’s selection (darker arrow) of object 2 (Obj 2) in the current trial (red), the previous trial (T-1), and the trial-minus-two (T-2). The colors correspond to the tabular model, which updates the immediate choice (+1) and trial-minus-two choice (+4), each generating a prediction error to update the value function. C. Temporal sequence of assigning credit (shown as a blue heatmap). In this model, the tabular model skips assigning credit to the previous state (S2). The triple period signifies that credit assignment can extend beyond the three depicted states. Note that the extent to which past states are assigned credit in each model depends on the free parameter lambda: for eligibility, higher lambda values mean that credit extends further back in time (less decay), while specifically for tabular, higher lambda values mean less discounting of the trial-minus-two state. D-E. Value functions for the tabular model (D), which involves separate, independent, updates for the immediate and delayed chosen options, and for the eligibility trace (E), which utilizes a single prediction error for updates. S: State, Obj: Object, I: Immediate, D: Delay.

More »

Expand

Fig 3.

Behavioral signatures of learning and model predictions.

(A) This diagram demonstrates the logical structure of a delayed choice that reappears after three trials, categorized as either ‘stay’ (S) or ‘switch’ (W). It includes feedback valence for immediate and two-trials forward choices. Correctly utilizing two-trials forward information (illustrated with a curved arrow) suggests staying with the initial choice. (B) The probability of maintaining (i.e., staying on) a delayed choice three trials later was modeled using multilevel logistic regression, as a function of condition (conjoint or disjoint), time (immediate, 0F or two-trial forward, 2F), and reward (positive, + or negative, -), as well as their interactions (equation 1). This regression was run on data from participants (red), as well as data generated by the tabular model (green) and data generated by the eligibility model (blue) for comparison. The logistic regression’s estimated marginal means are displayed, accompanied by 95% confidence intervals for each condition. Text displays when participants’ estimated confidence intervals overlap with eligibility [E,-], tabular [-,T], or both [E,T]. (C-E) Collapsed two-way interactions for each of the combinations between the three variables, specifically time*reward collapsed across condition (C), condition*time collapsed across reward (D), and reward*condition collapsed across time (E). Bars represent marginal means; error bars represent 95% confidence intervals.

More »

Expand

Fig 4.

Model prediction of trial-by-trial choice.

(A) Average participants’ propensity to choose left on three example pairs from one of the reward random walks, selected to show a clean reversal moving from left-to-right choice. The title identifies the reward contingency (I = Immediate, D = Delayed) with the initial value enclosed in parentheses (4 = initial start 4, -1 = initial start -1). The y-axis is the probability of selecting the left choice from the title, such as I(4)-D(1) illustrating the left choice to be the immediate option with starting value 4. The grey dotted line tracks when the optimality of the reward shifts from left (1.00) to equivalent (0.50) and then to right (0.00). Thus, participants in the different conditions (disjoint, conjoint) and different stages (1st, 2nd) should follow the optimality of reward with participant data on top. Further, lighter colors belong to the same group as darker colors. (B) The same analysis was performed on choice data generated by the hybrid model, using the best-fitting participant parameters. (C) Predictive accuracy (i.e., percentage of model-predicted choice matching participant choice) for each model and each pair type with delayed vs. delayed (Del v Del), immediate vs. immediate (Imm v Imm), and immediate vs. delayed (Mixed), shown as a boxplot across participants’ best-fitting parameters. (D) Fixed effects from a logistic regression predicting choice on each trial from tabular predictions, eligibility predictions, and their interaction with condition (equation 2), aligning with our hypothesis that tabular predicts behavior better in disjoint than in the conjoint condition (significant disjoint * tabular interaction), while eligibility only showed the opposite direction (non-significant disjoint * eligibility interaction). Dots and associated numbers represent the odds ratio for each effect; horizontal error bars represent 95% confidence intervals. *** p < .001.

More »

Expand

Table 2.

Average model-fitting metrics.

More »

Expand

Table 3.

Differences in model parameters by condition, for each strategy.

More »

Expand

Fig 5.

Effect of condition and stage order group on computational model parameters.

A mixed-effects linear regression model was applied to each RL parameter from their yoked independent model, predicting the parameter value from condition, stage order group, and their interaction (equation 3). Specifically, this regression was performed on the decision weight - Beta - for both tabular (A) and eligibility (B), decay rate - Lambda - for tabular (C), eligibility (D), as well as the learning rate - Alpha - across tabular (E) and eligibility (F). The results display estimated marginal means, marked with arrows pointing towards the participants’ final condition (Conjoint and Disjoint) and the two groups based on the order of stages, namely from conjoint stage 1 to disjoint stage 2 (C → D, black) and from disjoint stage 1 to conjoint stage 2 (D → C, grey). Shaded areas represent 95% confidence intervals. Additionally, individual dots and lines show parameter estimates for each participant. The top left corner of the results highlights the p-value associated with the main effect of the condition (C), the main effect of the stage (S), or their interaction (C * S).

More »

Expand