Intrinsic rewards explain context-sensitive valuation in reinforcement learning

doi:10.1371/journal.pbio.3002201

Fig 1.

Top: The same outcome (getting chocolate ice cream) can lead to very different feelings of reward depending on the alternatives available at the time of choice. When chocolate is the best available option, it feels rewarding to get that flavor of ice cream, but when a better flavor (pistachio) is available, the feeling of reward for chocolate is dampened. This phenomenon may be explained through the intrinsic enhancement of absolute rewards based on goal achievement or failure (purple) or through a range adaptation mechanism (teal). In this situation, the 2 models make similar predictions but capture different cognitive processes. Bottom left: In a RL task, different outcomes (1 and 0.1) may feel similarly rewarding within their contexts when compared to a baseline (0) despite having different numeric values. Bottom right: The same outcome (0) may feel different in different contexts, where it is compared to different outcomes (1 and −1). RL, reinforcement learning.

More »

Expand

Table 1.

Summary information for each of the data sets used for data analysis and/or modeling.

More »

Expand

Fig 2.

(A) Model validation by experimental condition and context with parameters extracted through Laplacian estimation [30], showing the simulated performance yielded by the intrinsically enhanced model (in purple) and the range adaptation model (in teal). The participants’ data is shown in the gray bars. Contexts 1–4 refer to the learning phase and contexts 5–8 to the test phase. Overall, both the intrinsically enhanced model and the range adaptation model captured participants’ behavior relatively well, the former outperforming the latter. Abbreviations: the first letter in each triplet indicates whether feedback was partial (P) or complete (C) during learning; the second letter indicates whether feedback in the test phase was partial (P), complete (C), or not provided (N); the third letter indicates whether the experimental design was interleaved (I) or blocked (B). Error bars indicate the SEM. (B) Model responsibilities overall and across experimental conditions. Data underlying this figure are available at https://github.com/hrl-team/range. Computational modeling scripts to produce the illustrated results are available at https://osf.io/sfnc9/.

More »

Expand

Fig 3.

(A) Model validation by context with parameters extracted through Laplacian estimation, showing the simulated performance yielded by the intrinsically enhanced model (in purple) and the range adaptation model (in teal), overlaid with the data (gray bars). The intrinsically enhanced model outperformed the range model in capturing participants’ behavior in the test phase, although both expressed the key behavioral pattern displayed by participants. Bars indicate the SEM. (B) Model responsibilities across participants. Data underlying this figure are available at https://github.com/sophiebavard/Magnitude/. Computational modeling scripts to produce the illustrated results are available at https://osf.io/sfnc9/.

More »

Expand

Fig 4.

(A) Model validation by context with parameters extracted through Laplacian estimation, showing the simulated performance yielded by the intrinsically enhanced model (in purple) and the range adaptation model (in teal), overlaid with the data (gray bars). The intrinsically enhanced model outperformed the range model in capturing participants’ behavior in the test phase, although both expressed the key behavioral pattern displayed by participants. Bars indicate the SEM. (B) Model responsibilities across participants. Data underlying this figure are available at https://osf.io/8zx2b/. Computational modeling scripts to produce the illustrated results are available at https://osf.io/sfnc9/.

More »

Expand

Fig 5.

Predictions made by the unbiased, intrinsically enhanced, and range^z models.

(A) The unbiased model correctly predicts lower choice rates for high-value and mid-value options in wide than in narrow contexts (upper gray box), but incorrectly predicts similar choice rates for the option with value 50 regardless of context (lower gray box). (B) The intrinsically enhanced model captures all behavioral patterns found in participants’ data [13]. It correctly predicts lower choice rates for high-value and mid-value options in wide than in narrow contexts (upper purple box) and correctly predicts higher choice rates for the option with value 50 in the trinary narrow context than in the trinary wide context (lower purple box). It also predicts that choice rates for mid-value options will be closer to those of low-value options than high-value options (lower purple box). (C) The range^z model correctly predicts higher choice rates for the option with value 50 in the trinary narrow context than in the trinary wide context (lower teal box), but incorrectly predicts similar choice rates for high-value options in the narrow and wide trinary contexts (upper teal box). Simulation scripts used to produce this figure are available at https://osf.io/sfnc9/.

More »

Expand

Fig 6.

Left: Task structure. Participants viewed available options, indicated their choice with a mouse click, and viewed each option’s outcome, including their chosen one highlighted. Right: Experimental design. Both context 1 (top row) and context 2 (bottom row) contained 3 options, each having a mean value of 14, 50, or 86. The contexts differed in the frequency with which different combinations of within-context stimuli were presented during the learning phase (gray shaded area). In particular, while option M₁ (EV = 50) was presented 20 times with option L₁ (EV = 14), option M₂ (EV = 50) was presented 20 times with option H₂ (EV = 86). Intuitively, this made M₁ a more frequent intrinsically rewarding outcome than M₂. The 2 contexts were otherwise matched.

More »

Expand

Fig 7.

Ex ante model predictions based on simulations, for the intrinsically enhanced (A) and range^z (B) models. Contexts 1 and 2 are shown in dark and light gray, respectively. The core prediction that differentiates intrinsically enhanced and range models is that participants will have a bias in favor of the middle option from context 1, compared to the middle option from context 2 (compare the purple and teal boxes). Simulation scripts used to produce this figure are available at https://osf.io/sfnc9/.

More »

Expand

Fig 8.

Behavioral results and computational modeling support the intrinsically enhanced model.

(A) During the test phase, the mid-value option of context 1 (darker gray) was chosen more often than the mid-value option of context 2 (lighter gray), a pattern that was also evident in the intrinsically enhanced model’s, but not the range^z model’s behavior. (B) Difference in test phase choice rates between stimulus M₁ and M₂. (C) When the 2 mid-value options were pitted against each other, participants preferred the one from context 1. When either was pitted against a low-value option, participants selected the mid-value option from context 1 more often than the mid-value option from context 2. When either was pitted against a high-value option, participants selected the high-value option from context 1 less often than the high-value option from context 2. The dotted line indicates chance level (0.5). (D) Difference between M₁ and M₂ in the proportion of times the option was chosen when compared to either L₁ or L₂. All these behavioral signatures were preregistered and predicted by the intrinsically enhanced, but not the range adaptation model. (E) Participants explicitly reported the mid-value option of context 1 as having a higher value than the mid-value option of context 2. (F) Differences in explicit ratings between option M₁ and M₂. (G) Model fitting favors the intrinsically enhanced model against range^z, as evidenced by higher responsibility across participants for the former compared to the latter. Data and analysis scripts underlying this figure are available at https://osf.io/sfnc9/.

More »

Expand