Fig 1.
A schematic of the two-stage task (panel A) and an example of a random walk used to generate the true expected value for each of the four bandits at the task second-stage (panel B). At the first-stage participants choose between two options (represented by abstract fractal images) that determined the presentation of the second-stage via fixed transition probabilities of 70% (‘common’) or 30% (‘rare’). At the second-stage, participants again choose between two bandits that led to receipt of reward (£0 or £1 play pounds). Note the second-stage included two pairs of bandits where the composition of each pair was fixed, but where the value of each bandit drifted slowly and independently. More specifically the reward associated with the second-stage bandits were subjected to random walks and thus had to be constantly learned by participants.
Fig 2.
Examining the relationship between model-agnostic scores (MB-I(choice), MB-II(RT)) and the w-parameter.
To obtain these plots we gradually increased the w-parameter from 0 to 1 in .1 steps, each time simulating 200 experiments with 5,000 trials each using the DDM-RL model (All other parameters were selected randomly and uniformly form a pre-defined range, of α1/2[0,1], λ[0,1], w[0,1], p[0,.5], b1/2[1,10], a1/2[1,3], τ1/2[.01,.5] for each experiment). (A) For each of the 200 experiments, we averaged MB-I(choice) and MB-II(RT) (see Eqs 11 and 13) scores. We then standardized the eleven mean scores, separately for each MB score. Results showed a strong relationship between the w-parameter and both model-agnostic measures. (B) Here we illustrate how deployment of model-based strategies in the first-stage is affecting MB-II(RT) via systematic effects on second-stage value discrimination. Specifically, Panel B presents averaged ΔQ-value (max–min Q-value) for the second-stage state the agent visited. Results confirmed that higher w-parameter values lead to higher/lower value discriminability (ΔQ-value) after common/uncommon transitions, respectively. Notably, in the DDM-RL model ΔQ-values are directly and positively associated with drift-rates and hence contribute to faster RTs (see Eq 8). This result illustrate why higher w-parameter is associated with quicker/slower RT2 after common/uncommon transitions, respectively. (C/D) To further demonstrate how deployment of model-based strategies in the first-stage leads to systematic value differences in the second-stage we labelled in each trial the best and worst state (state that included the highest Q-value out of the four available second-stage bandits, and the alternative state). Panel C shows that across all simulation the best state was related with higher value discriminability (higher ΔQ-value), regardless of the w-parameter. Panel D further shows that higher w-parameter is related with higher probability of visiting the best state by means of common transitions (see Eq 3). Therefore, Panels C & D illustrates the reason that higher w-parameter leads to higher value discriminability after common trials as illustrated in Panel B.
Fig 3.
(A/B/C) Scatterplots showing the relationship between the three hierarchical model-based estimates obtained from empirical data (scores were averaged across baseline and follow-up).
Table 1.
Correlation estimates describing the relationship between the different model-based estimates.
Fig 4.
(A/B) Scatter plots for true compared to recovered w-parameter (estimating model-based/free trade off). Results show a better correlation for DDM-RL (panel B; modeling choice & RT, r = .9) compared with an RL (choice only) model previously reported in the literature (panel A, r = .62).
Table 2.
Spearman's correlation estimating the relationship between the true and recovered parameters.
Table 3.
Psychometric properties for model-based estimates.
Fig 5.
Internal consistency estimates for MB-I(choice) and MB-II(RT).
In all figures, x-axis represents the number of trials in the analysis, and y-axis the Pearson’s correlation (corrected using Spearman-Brown formula) between the scores calculated for odd and even trials. (A) Internal stability for MB-I(choice) obtained from simulated data of RL vs. DDM-RL models. Results suggest that reliability reached criteria for the RL-DDM with fewer trials compared to the RL model. (B) Internal stability for MB-II(RT) obtained from simulated data of the DDM-RL model. (C/D) Internal stability for MB-I(choice) and MB-II(RT) calculated from empirical data (follow-up only). (E) Internal consistency in empirical data for the four conditions that assemble MB-I(choice) (CR: common-rewarded, CU: common-unrewarded, UR: uncommon-rewarded, UU: uncommon-unrewarded, see Eq 9–11). Ribbons present 95% CI. The horizontal line represents the .7 criteria for internal stability.
Table 4.
Statistical power (percent of studies that rejected the null hypothesis, given an effect exists) for a between group design (control vs. experiment).
Table values show the chance of finding a statistically significant between group effect as a function of true effect-size, sample-size and number of trials in the experiment.