Humans combine value learning and hypothesis testing strategically in multi-dimensional probabilistic reward learning

doi:10.1371/journal.pcbi.1010699

Fig 1.

The “build your own icon” task.

Participants built stimuli by selecting a feature in zero to three dimensions (marked by black squares). After hitting “Done”, the stimulus showed up on the screen, with features randomly determined for any dimension in which participant did not make a selection (in this example, circle was randomly determined). Reward feedback was then shown.

More »

Expand

Table 1.

The reward probability of a stimulus in each game type (1D, 2D, and 3D-relevant games) was determined by the number of rewarding features in the stimulus.

Each row corresponds to one game type. Across all game types, the reward probabilities were 20% if the stimulus contained no rewarding features, 80% if it contained all rewarding features, and linear interpolations between 20% and 80% if it contained a subset of rewarding features. For example, in a 3D-relevant game, if the stimulus contained two of the three rewarding features, the reward probability for that trial would be 60%. These probabilities guarantee that a participant who performs randomly would have 40% probability of obtaining a reward across all game types. This can be seen by calculating, for each game type, the chance of randomly choosing a certain number of rewarding features, multiplied by the corresponding reward probability. Equal chance probability across game types ensured that chance behavior would not be informative about the number of relevant dimensions in unknown games.

More »

Expand

Fig 2.

Participants’ behavior in the “build your own icon” task.

(A, B): Performance and choices over the course of a game, by game type. (A) Participants’ average probability of reward (based on the number of rewarding features in their configured stimuli), over the course of 1D-, 2D- and 3D-relevant games (left, middle and right columns). Red and blue curves represent “known” and “unknown” conditions, respectively. For all game types, chance reward probability is 0.4 and 0.8 is the maximum reward probability. Shading (ribbons around the lines) represents ±1 s.e.m. across participants. ** p < .01. For grouping of these learning curves by task complexity, see S1 Fig. (B) Same as in (A), but for the number of features selected. (C, D): Responses to post-game questions regarding the rewarding features in each game condition. (C) Average number of correctly-identified rewarding features; (D) Average number of false positive responses, i.e., falsely identifying an irrelevant dimension as relevant. *** p < .001. Error bars represent ±1 s.e.m. across participants.

More »

Expand

Fig 3.

A diagram of the serial hypothesis testing models.

More »

Expand

Fig 4.

Model comparison supports both reinforcement learning (RL) and serial hypothesis testing (SHT) strategies.

(A) Geometric average likelihood per trial for each model (i.e., average total log likelihood divided by number of trials and exponentiated). Higher values indicate better model fits. Dashed lines indicate chance. Error bars represent ±1 s.e.m. across participants. (B, C) Simulation of the best-fitting value-based SHT model. The same learning curves as in Fig 2 but for model simulation.

More »

Expand

Fig 5.

Strategic balance of two learning mechanisms.

(A) The contribution of serial hypothesis testing (SHT) was inversely correlated with reaction time such that participants who responded faster used SHT to a greater extent. (B) The contribution of reinforcement learning (RL) was correlated with average reward rate: participants for whom adding the RL component improved the model fit to a greater extent earned more rewards on the task, on average. Each dot represents one participant. (C, D) Contribution of RL and SHT for each game type. The contribution of each component was measured as the difference in likelihood per trial between the hybrid value-based SHT model and the other component model (SHT: the feature RL with decay model; RL: the random-switch SHT model). Error bars represent ±1 s.e.m. across participants.

More »

Expand