Reward-predictive representations generalize across tasks in reinforcement learning
Fig 3
Minimizing reward-sequence prediction errors identifies state abstractions amenable for “deep transfer”.
For each task set (A, B, C), all possible state abstractions in were enumerated using Algorithm U [24] to obtain a ground truth distribution over the hypothesis space
. In each grid-world task (A, C) the agent can transition up, down, left, or right to move to an adjacent grid cell. If the agent attempts to transition of the grid or across one of the black barriers in (C), then the agent remains at its current grid position. State abstractions were scored by compressing an MDP using the state abstraction of interest [6]. The total reward score was computed by running the computed policy 20 times for 10 time steps in the MDP from a randomly selected start state. The reward-sequence error was computed by selecting 20 random start states and then performing a random walk for 10 time steps. (D, E, F) The histograms report averages over all repeats and transfer MDPs for all state abstractions that are possible in a nine state MDP. (G, H, I) The histograms report averages over all repeats and transfer MDPs for all state abstractions that compress nine states into three latent states. For each histogram, the Welch’s t-test was performed to compute the p-values of the difference in mean total reward being insignificant for each histogram.