Reward-predictive representations generalize across tasks in reinforcement learning
Fig 1
State-abstraction examples, adopted from [9].
(A) The column world task is a 3 × 3 grid world where an agent can move up (↑), down (↓), left (←), or right (→). A reward of +1 is given when the right column is entered from the centre column by selecting the action “move right” (→). (B) A reward-predictive state representation generalizes across columns (but not rows) and compresses the 3 × 3 grid world into a 3 × 1 grid world with three latent states labelled with ϕ1, ϕ2, and ϕ3. In this compressed task, only the transition moving from the centre orange state ϕ2 to the right green state ϕ3 is rewarded. (C): A reward-maximizing state representation compresses all states into one latent state. In the 3 × 3 grid, there are three out of nine locations where an agent can receive a reward by selecting the action move right (→). If states are averaged uniformly to construct the one-state compressed task, then the move right action is rewarded with 1/3 and all other actions are not rewarded. In this case, an optimal policy can still be found using the compressed task, but accurate reward predictions are not possible.