Modeling the Violation of Reward Maximization and Invariance in Reinforcement Schedules

doi:10.1371/journal.pcbi.1000131

Figure 1.

Behavioral paradigm used in the reward schedule task.

(A) Color discrimination task. Each trial begins with the monkey touching a bar. A visual cue (horizontal black bar) appears immediately. Four hundred milliseconds later a red dot (WAIT signal) appears in the center of the cue. After a random interval of 500–1500 ms the dot turns green (GO signal). The monkey is required to release the touch-bar between 200 and 800 ms after the green dot appeared, in which case the dot turns blue (OK signal), and a drop of water is delivered 250 to 350 ms later. If the monkey fails to release the bar within the 200–800 ms interval after the GO signal, an error is registered, and no water is delivered. An anticipated bar release (<200 ms) is also counted as an error. (Red, green and blue dots are enlarged for the purpose of illustration). (B) 2-trial schedule. Each trial is a color discrimination task as in panel A, with cues of different brightness for different trials (see Materials and Methods for details). In the 2-trial schedule, completion of the first trial is not rewarded and is followed by the second trial after an inter-trial interval (ITI) of 1–2 seconds. An error at any point during a trial causes the trial to be aborted and then started again after the ITI interval. The same applies to schedules of any length. Schedules of different length are randomly interleaved. Note that after an error, the schedule is resumed from the current trial and not from the first trial of the schedule.

More »

Expand

Figure 2.

Monkeys' behavior in the reward schedule task.

(A–B) Error rates as a function of schedule state for two monkeys, for both valid (circles) and random cues (“x”). Each schedule state is labeled by the fraction τ/s, where τ stands for current trial and s stands for current schedule length. Maximal schedule length was 3 for monkey A and 4 for monkey B. In both monkeys, error rates with valid cues are significantly different (χ² test, p<10⁻¹⁰). In monkey A, the error rate in states 1/2 is larger than in state 2/3 (Marasquilo test for multiple comparisons, p<0.005); in monkey B, the error rate in state 2/3 is larger than in state 3/4, and error rate in state 1/2 is larger than in state 2/3 (Marasquilo test, p<0.05). Original data from refs. [25] (A) and [44] (B). (C) scatter plot of the difference in error rates between states 1/2 and 2/3 (E(1/2)−E(2/3)) vs. the maximal error rate for all 24 monkeys. Filled circles mark positive differences E(1/2)−E(2/3) which are significant (Marasquilo test for multiple comparisons, p<0.05). (D) error rates (dots) for the 12 monkeys corresponding to the closed circles in panel C. Thick grey lines: medians.

More »

Expand

Figure 3.

Models.

(A) Diagrammatic representation of the basic model for 3- and 2-trial schedules. (B) General pattern of error rates predicted by the basic model. For trials with the same reward proximity (pre-reward number, preRN, plotted in the same color) the model predicts equal error rates. (C) Diagrammatic representation of the context-sensitive model for the 3-trial schedule. (D) General pattern of error rates predicted by the context-sensitive model. For trials with the same reward proximity (in same color) the model predicts smaller error rates in longer schedules.

More »

Expand

Figure 4.

Predictions of the context-sensitive model in the reward schedule task.

(A–B) Theoretical error rates predicted by the context-sensitive model (black) for both valid (circles) and random (“x”) cues. The model parameters were tuned to match the experimental error rates of Figure 2A and 2B respectively using least-square minimization as described in Materials and Methods. The experimental data from Figure 2 are reproduced in grey for comparison. Parameters for Monkey A (B): β = 3.6 (3.2), σ = 0.3 (0.8), γ = 0.4 (0.3). (C) Error rate (Equation 2) as a function of schedule state values (full curve) for the model of panel B. Black dots are the actual values of valid cues in the standard model (i.e., with σ = 0; see Equation 6); larger dots are the mean values of valid (black) and random (grey) cues. The inset shows the predicted mean values of valid (black) and random (grey) cues for paradigms with 2 to 10 schedules (basic model). Larger dots correspond to the case of main figure (4 schedules). (D) Linear regression of the median error rates with valid cues against the median error rates with random cues for the 13 monkeys tested in both conditions (r² = 0.69, p<0.0005).

More »

Expand

Figure 5.

Predictions of the context-sensitive model in choice tasks.

(A) Two-choice task. At decision node N (of value V_N) the agent can either choose action A (which gives a larger or more probable reward) or action B (smaller or less probable reward). The same value σV_N is carried over to whatever outcome of the choice (curved arrows). (B) Mean frequency of choosing action A in the two-choice task of panel 5A (P_sel(A)) vs. the probability that action A is rewarded (P_rew(A)) for different values of σ (see the text). For each value of P_rew(A), four values of σ were used (0. 0.1, 0.2, and 0.3). Shown are means (dots) and standard deviations (error bars) over 20 simulations with β = 3 and r = 1 together with the theoretical prediction (dashed line). For σ = 0, the model is the standard TD model. Choice preference does not depend on the value of σ. (C) 4-armed bandit task. At decision node N the agent can choose between 4 possible actions, each rewarding the agent according to a predefined probability distribution. The same value σV_N is carried over to whatever outcome of the choice. (D) Mean frequency of choosing each of the four alternative actions of the 4-armed bandit task of panel 5C for different values of σ (same values as in panel 5B). Each choice was rewarded according to a Gaussian distribution truncated at negative values, with mean μ = 0.25, 0.5, 0.75, 1 and standard deviation 0.25. Shown are means (dots) and standard deviations (error bars) over 20 simulations with β = 3, together with the theoretical prediction (dashed line). Choice frequencies do not depend on the value of σ.

More »

Expand

Figure 6.

Predictions of the context-sensitive model in choice-schedule tasks.

(A) Description of the choice-schedule task with 2-trial schedules. At decision node N (of value V_N) the agent can either choose the immediate-reward schedule A (which gives a larger reward, R, sooner and a smaller reward, r, later) or the delayed-reward schedule B (smaller reward sooner and larger reward later). The same value σV_N is carried over to whatever outcome of the choice, but following trials in each schedule modify the value of A or B differently (curved grey arrows, shown for schedule A only. See the text for details). (B) Mean frequency of choosing the immediate-reward schedule (schedule A) in the task of panel 6A predicted by the model as a function of σ. Shown are means (dots) and standard deviations (error bars) over 20 simulations with β = 3, γ = 0.55, R = 1 and r = 0.5. Dashed line: theoretical prediction according to the equation with V_sch.A−V_sch.B = (1−γσ)⁻¹(1−γ)(R−r). A positive value of σ enhances the existing preference for the immediate-reward schedule. (C) Choice-schedule task between two 3-trial schedules, a generalization of the task in panel 6A. (D) mean frequency of choosing the immediate-reward schedule (schedule A) in the task of panel 6C predicted by the model as a function of σ. Shown are means (dots) and standard deviations (error bars) over 20 simulations with the same parameters as in 6B. Dashed line: theoretical prediction according to the equation with V_sch.A−V_sch.B = (1−2γσ)⁻¹(1−γ−γ²−γσ)(R−r). Dotted line: indifference point P_sel(sch.A) = 0.5, i.e., the situation where the agent has no preference for either schedule. For σ larger than ≈0.268, choice preference is reversed and the delayed-reward schedule is chosen more often.

More »

Expand

Figure 7.

Model for the general Markov Decision Process (MDP).

(A) Policy for the general MDP. In the fragment of MDP shown, the agent is in state i and must decide (1) whether to leave the state (with probability P(m|i)), and (2) in which state to go in case of a positive decision (weighting each choice with probability P(i→j|m)). Decision 1 depends on the motivational value of current state; decision 2 depends on the relative values of the possible arrival states, or choices. Both the motivational and the choice values are learned with the TD method of the main text. If the agent is not motivated to perform the trial, it will find itself in the same state one time step later (curved arrow). If the agent is sufficiently motivated to perform the trial correctly, it proceeds to make a choice. In the figure, this situation is represented by the curved shaded region from which the arrows to the possible choices reach out. In the general case, the transition probability P_ij is the product of the probabilities P(m|i) and P(i→j|m). (B) Policy in the reward schedule task. In this case, P(i→j|m) because there is no choice and j can only be the next schedule state (in this example, i = 1/2, j = 2/2). Thus, P_ij = P(m|i). (C) Policy in the choice task when considering only correct trials. In this case, P(m|i) is determined to be 1 and thus P_ij = (i→j|m).

More »

Expand