Reward-predictive representations generalize across tasks in reinforcement learning

doi:10.1371/journal.pcbi.1008317

Reward-predictive representations generalize across tasks in reinforcement learning

Fig 4

Transfer with multiple state abstractions curriculum.

(A) A curriculum of transfer tasks is generated by first constructing the three-state MDP. At each state, only one action causes a transition to a different state. Only one state-to-state transition is rewarded; the optimal policy is to select the correct action needed to cycle between the node states. (B) To generate a sequence of abstract MDPs , the action labels are randomly permuted as well as the transitions generating positive reward (similar to the Diabolical Rooms Problem [3]). Two hidden state abstractions ϕ_A and ϕ_B were randomly selected to “inflate” each abstract MDP to a nine-state problem. One state abstraction was used with a frequency of 75% and the other with a frequency of 25%. The resulting MDP sequence M₁, …, M₂₀ was presented to the agent, without any information about which state abstraction was used to construct the task sequence.

doi: https://doi.org/10.1371/journal.pcbi.1008317.g004