Reward-predictive representations generalize across tasks in reinforcement learning
Fig 4
Transfer with multiple state abstractions curriculum.
(A) A curriculum of transfer tasks is generated by first constructing the three-state MDP. At each state, only one action causes a transition to a different state. Only one state-to-state transition is rewarded; the optimal policy is to select the correct action needed to cycle between the node states. (B) To generate a sequence of abstract MDPs , the action labels are randomly permuted as well as the transitions generating positive reward (similar to the Diabolical Rooms Problem [3]). Two hidden state abstractions ϕA and ϕB were randomly selected to “inflate” each abstract MDP to a nine-state problem. One state abstraction was used with a frequency of 75% and the other with a frequency of 25%. The resulting MDP sequence M1, …, M20 was presented to the agent, without any information about which state abstraction was used to construct the task sequence.