The pursuit of happiness: A reinforcement learning perspective on habituation and comparisons

doi:10.1371/journal.pcbi.1010316

The pursuit of happiness: A reinforcement learning perspective on habituation and comparisons

Fig 5

Results of the multi-armed bandit experiments.

(a) 10-armed bandit simulation where the mean of the 9 sub-optimal arms is drawn from a uniform distribution on range [−1, 0.9]. The graph plots how frequently the agents select the best arm in their lifetimes. The ‘Fixed compare’ agent learns faster than the ‘Objective’ agent and selects the optimal action at a higher rate, especially early during training. The ‘Dynamic compare’ agent selects the optimal action at a higher rate throughout its lifetime compared to these two agents. (b) Bandit task where the arms are very close to each other. Here, the comparison-based agents and the ‘Objective only’ agent select the optimal action at a similar rate throughout their lifetime (and the UCB selects the optimal action at a higher rate). (c) Plot of the average subjective reward of the agents in the previous bandit task. Compared to the ‘Objective only’ and the UCB agent, the comparison-based agents experience lower subjective rewards (due to their aspiration level). This seems needless since comparisons do not help the agents make better choices. (d) Non-stationary bandit task where the reward distribution changes abruptly during the agent’s lifetime. Compared to the ‘Objective only’ agent, the comparison-based agents select the optimal action at a higher frequency, especially after step = 2500 i.e., when the environment changes. (e) Non-stationary bandit task where the reward distribution changes constantly during the agent’s lifetime. Early during training, the ‘Fixed Compare’ agent selects the optimal action at a relatively good rate but it is then comfortably outperformed by the other agents. The rising aspirations of the ‘Dynamic Compare’ agent allows it adapt to the changes in the environment and it selects the optimal action at a very high rate throughout the lifetime. (f) Despite accumulating high objective rewards, the subjective rewards experienced by the ‘Dynamic Compare’ agent keep decreasing due to its constantly increasing aspiration.

doi: https://doi.org/10.1371/journal.pcbi.1010316.g005