Reward-predictive representations generalize across tasks in reinforcement learning

doi:10.1371/journal.pcbi.1008317

Reward-predictive representations generalize across tasks in reinforcement learning

Fig 7

Transferring state representations influences learning speed on the maze curriculum.

(A) Performance comparison of each learning algorithm that uses Q-learning to obtain an optimal policy. The reward-predictive model identifies two state abstractions and re-used them in tasks 3 through 5, resulting in faster learning than the reward-maximizing model. (B) Performance comparison of each learning algorithm that uses SF-learning to obtain an optimal policy. Similar to (A), the reward predictive model identifies two state abstractions and re-used them in tasks 3 through 5. Re-using previously learned SFs across tasks (orange curve) degrades performance. (A, B) Each experiment was repeated ten times and the average across all repeats was plotted. The shaded areas indicate the standard errors of measure. For each experiment, different learning rates and hyper-parameter settings were tested and the settings resulting in the lowest average episode length are plotted. Supporting S3 Text describes the tested implementation and hyper-parameters in detail. (C, D) Plot of the posterior distribution as a function of training episode. The orange rectangle indicates tasks in which the agent used the identity abstraction to learn a new state representation that was added into the belief set after 200 episodes of learning.

doi: https://doi.org/10.1371/journal.pcbi.1008317.g007