Reward-predictive representations generalize across tasks in reinforcement learning

doi:10.1371/journal.pcbi.1008317

Reward-predictive representations generalize across tasks in reinforcement learning

Fig 1

State-abstraction examples, adopted from [9].

(A) The column world task is a 3 × 3 grid world where an agent can move up (↑), down (↓), left (←), or right (→). A reward of +1 is given when the right column is entered from the centre column by selecting the action “move right” (→). (B) A reward-predictive state representation generalizes across columns (but not rows) and compresses the 3 × 3 grid world into a 3 × 1 grid world with three latent states labelled with ϕ₁, ϕ₂, and ϕ₃. In this compressed task, only the transition moving from the centre orange state ϕ₂ to the right green state ϕ₃ is rewarded. (C): A reward-maximizing state representation compresses all states into one latent state. In the 3 × 3 grid, there are three out of nine locations where an agent can receive a reward by selecting the action move right (→). If states are averaged uniformly to construct the one-state compressed task, then the move right action is rewarded with 1/3 and all other actions are not rewarded. In this case, an optimal policy can still be found using the compressed task, but accurate reward predictions are not possible.

doi: https://doi.org/10.1371/journal.pcbi.1008317.g001