Modeling the Violation of Reward Maximization and Invariance in Reinforcement Schedules
Figure 5
Predictions of the context-sensitive model in choice tasks.
(A) Two-choice task. At decision node N (of value VN) the agent can either choose action A (which gives a larger or more probable reward) or action B (smaller or less probable reward). The same value σVN is carried over to whatever outcome of the choice (curved arrows). (B) Mean frequency of choosing action A in the two-choice task of panel 5A (Psel(A)) vs. the probability that action A is rewarded (Prew(A)) for different values of σ (see the text). For each value of Prew(A), four values of σ were used (0. 0.1, 0.2, and 0.3). Shown are means (dots) and standard deviations (error bars) over 20 simulations with β = 3 and r = 1 together with the theoretical prediction (dashed line). For σ = 0, the model is the standard TD model. Choice preference does not depend on the value of σ. (C) 4-armed bandit task. At decision node N the agent can choose between 4 possible actions, each rewarding the agent according to a predefined probability distribution. The same value σVN is carried over to whatever outcome of the choice. (D) Mean frequency of choosing each of the four alternative actions of the 4-armed bandit task of panel 5C for different values of σ (same values as in panel 5B). Each choice was rewarded according to a Gaussian distribution truncated at negative values, with mean μ = 0.25, 0.5, 0.75, 1 and standard deviation 0.25. Shown are means (dots) and standard deviations (error bars) over 20 simulations with β = 3, together with the theoretical prediction
(dashed line). Choice frequencies do not depend on the value of σ.