Modeling the Violation of Reward Maximization and Invariance in Reinforcement Schedules

doi:10.1371/journal.pcbi.1000131

Modeling the Violation of Reward Maximization and Invariance in Reinforcement Schedules

Figure 5

Predictions of the context-sensitive model in choice tasks.

(A) Two-choice task. At decision node N (of value V_N) the agent can either choose action A (which gives a larger or more probable reward) or action B (smaller or less probable reward). The same value σV_N is carried over to whatever outcome of the choice (curved arrows). (B) Mean frequency of choosing action A in the two-choice task of panel 5A (P_sel(A)) vs. the probability that action A is rewarded (P_rew(A)) for different values of σ (see the text). For each value of P_rew(A), four values of σ were used (0. 0.1, 0.2, and 0.3). Shown are means (dots) and standard deviations (error bars) over 20 simulations with β = 3 and r = 1 together with the theoretical prediction (dashed line). For σ = 0, the model is the standard TD model. Choice preference does not depend on the value of σ. (C) 4-armed bandit task. At decision node N the agent can choose between 4 possible actions, each rewarding the agent according to a predefined probability distribution. The same value σV_N is carried over to whatever outcome of the choice. (D) Mean frequency of choosing each of the four alternative actions of the 4-armed bandit task of panel 5C for different values of σ (same values as in panel 5B). Each choice was rewarded according to a Gaussian distribution truncated at negative values, with mean μ = 0.25, 0.5, 0.75, 1 and standard deviation 0.25. Shown are means (dots) and standard deviations (error bars) over 20 simulations with β = 3, together with the theoretical prediction (dashed line). Choice frequencies do not depend on the value of σ.

doi: https://doi.org/10.1371/journal.pcbi.1000131.g005