Modeling the Violation of Reward Maximization and Invariance in Reinforcement Schedules
Figure 7
Model for the general Markov Decision Process (MDP).
(A) Policy for the general MDP. In the fragment of MDP shown, the agent is in state i and must decide (1) whether to leave the state (with probability P(m|i)), and (2) in which state to go in case of a positive decision (weighting each choice with probability P(i→j|m)). Decision 1 depends on the motivational value of current state; decision 2 depends on the relative values of the possible arrival states, or choices. Both the motivational and the choice values are learned with the TD method of the main text. If the agent is not motivated to perform the trial, it will find itself in the same state one time step later (curved arrow). If the agent is sufficiently motivated to perform the trial correctly, it proceeds to make a choice. In the figure, this situation is represented by the curved shaded region from which the arrows to the possible choices reach out. In the general case, the transition probability Pij is the product of the probabilities P(m|i) and P(i→j|m). (B) Policy in the reward schedule task. In this case, P(i→j|m) because there is no choice and j can only be the next schedule state (in this example, i = 1/2, j = 2/2). Thus, Pij = P(m|i). (C) Policy in the choice task when considering only correct trials. In this case, P(m|i) is determined to be 1 and thus Pij = (i→j|m).