Reward-predictive representations generalize across tasks in reinforcement learning

doi:10.1371/journal.pcbi.1008317

Fig 1.

State-abstraction examples, adopted from [9].

(A) The column world task is a 3 × 3 grid world where an agent can move up (↑), down (↓), left (←), or right (→). A reward of +1 is given when the right column is entered from the centre column by selecting the action “move right” (→). (B) A reward-predictive state representation generalizes across columns (but not rows) and compresses the 3 × 3 grid world into a 3 × 1 grid world with three latent states labelled with ϕ₁, ϕ₂, and ϕ₃. In this compressed task, only the transition moving from the centre orange state ϕ₂ to the right green state ϕ₃ is rewarded. (C): A reward-maximizing state representation compresses all states into one latent state. In the 3 × 3 grid, there are three out of nine locations where an agent can receive a reward by selecting the action move right (→). If states are averaged uniformly to construct the one-state compressed task, then the move right action is rewarded with 1/3 and all other actions are not rewarded. In this case, an optimal policy can still be found using the compressed task, but accurate reward predictions are not possible.

More »

Expand

Fig 2.

Transferring state abstractions between MDPs.

(A) In both grid-world tasks, the agent can move up (↑), down (↓), left (←), or right (→) and is rewarded when a reward column is entered. The black line indicates a barrier the agent cannot pass. Both Task A and Task B differ in their rewards and transitions, because a different column is rewarded and the barrier is placed at different locations. (B) A reward-predictive state representation generalizes across different columns and the corresponding SFs are plotted below in (D). (D) Each row in the shown matrix plots visualizes the entries of a three dimensional SF vector. Similar to the example in Fig 1, a reward-predictive state abstraction merges each column into one latent state, as indicated by the colouring. In both tasks, reward sequences can be predicted using the compressed representation for any arbitrary start state and action sequence, similar to Fig 1B. In this case the agent simply needs to learn a different policy for Task B using the same compressed representation. In contrast, the matrix plots in the bottom panels illustrate that SFs are different in each task and cannot be immediately reused in this example (because SFs are computed for the optimal policy which is different in each task [14]). Note that states that belong to the same column have equal SF weights (as indicated by the coloured boxes). LSFMs construct a reward-predictive state representation by merging states with equal SFs into the same state partition. This algorithm is described in supporting S3 Text and prior work [9]. (C) One possible reward-maximizing state abstraction may generalize across all states. While it is possible to learn or compute the optimal policy using this state abstraction in Task A (i.e., always go right), this state abstraction cannot be used to learn the optimal policy in Task B in which the column position is needed to know whether to go left or right. This example illustrates that reward-predictive state representations are suitable for re-use across tasks that vary in rewards and transitions. While reward-maximizing state abstractions may compress a task further than reward-predictive state abstractions, reward-maximizing state abstractions may also simplify a task to an extend that renders them proprietary to a single specific task.

More »

Expand

Fig 3.

Minimizing reward-sequence prediction errors identifies state abstractions amenable for “deep transfer”.

For each task set (A, B, C), all possible state abstractions in were enumerated using Algorithm U [24] to obtain a ground truth distribution over the hypothesis space . In each grid-world task (A, C) the agent can transition up, down, left, or right to move to an adjacent grid cell. If the agent attempts to transition of the grid or across one of the black barriers in (C), then the agent remains at its current grid position. State abstractions were scored by compressing an MDP using the state abstraction of interest [6]. The total reward score was computed by running the computed policy 20 times for 10 time steps in the MDP from a randomly selected start state. The reward-sequence error was computed by selecting 20 random start states and then performing a random walk for 10 time steps. (D, E, F) The histograms report averages over all repeats and transfer MDPs for all state abstractions that are possible in a nine state MDP. (G, H, I) The histograms report averages over all repeats and transfer MDPs for all state abstractions that compress nine states into three latent states. For each histogram, the Welch’s t-test was performed to compute the p-values of the difference in mean total reward being insignificant for each histogram.

More »

Expand

Fig 4.

Transfer with multiple state abstractions curriculum.

(A) A curriculum of transfer tasks is generated by first constructing the three-state MDP. At each state, only one action causes a transition to a different state. Only one state-to-state transition is rewarded; the optimal policy is to select the correct action needed to cycle between the node states. (B) To generate a sequence of abstract MDPs , the action labels are randomly permuted as well as the transitions generating positive reward (similar to the Diabolical Rooms Problem [3]). Two hidden state abstractions ϕ_A and ϕ_B were randomly selected to “inflate” each abstract MDP to a nine-state problem. One state abstraction was used with a frequency of 75% and the other with a frequency of 25%. The resulting MDP sequence M₁, …, M₂₀ was presented to the agent, without any information about which state abstraction was used to construct the task sequence.

More »

Expand

Fig 5.

Results for transfer with multiple state abstractions experiment.

(A, D) Plot of how different α and β model parameters influence the average size of after training. (B, E) Performance of each model (average total reward per MDP) for different α and β model parameters. After observing the transition and reward tables of a task M_t in the task sequence, the average total reward was obtained by first computing a compressed abstract MDP for each abstraction and then solving each compressed MDP using value iteration, as described in supporting S1 Text. The resulting mixture policy was then tested in the task M_t for 10 time steps while logging the sum of all obtained reward. If β = ∞ the agent obtains an optimal total reward level when using either loss function for ten time steps in each MDP. (C, F) Plot of the average count for the most frequently used state abstraction. As described in Fig 4, one of two possible “hidden” state abstractions, ϕ_A and ϕ_B, were embedded into each MDP. Each task sequence consists of 20 MDPs and on average 15 out of these 20 MDPs had the state abstraction ϕ_A embedded and the remaining MDPs had the state abstraction ϕ_B embedded. The white bar labelled “Ground Truth” plots the ground-truth frequency of the “hidden” state abstraction ϕ_A. If the non-parametric Bayesian model correctly detects which state abstraction to use in which task, then the average highest count will not be significantly different from the white ground truth bar. In total, 100 different task sequences, each consisting of 20 MDPs, were tested and all plots show averages across these 100 repeats (the standard error of measure is indicated by the shaded area and variations are very low if not visible).

More »

Expand

Fig 6.

Maze curriculum.

Maze A and Maze B are augmented with an irrelevant state variable to construct a five-task curriculum. In each maze, the agent starts at the blue grid cell and can move up, down, left, or right to collect a reward at the green goal cell. The black lines indicate barriers the agent cannot pass. Once the green goal cell is reached, the episode finishes and another episode is started. (These rewarding goal cells are absorbing states.) Transitions are probabilistic and succeed in the desired direction with probability 0.95; otherwise the agent remains at its current grid cell and cannot transition off the grid map or through a barrier. A five-task curriculum is constructed by augmenting the state space either with a “light” or “dark” colour bit (first, third, and fourth task), or the right half of the maze is augmented with the colour red, green, or blue (second and fifth task).

More »

Expand

Fig 7.

Transferring state representations influences learning speed on the maze curriculum.

(A) Performance comparison of each learning algorithm that uses Q-learning to obtain an optimal policy. The reward-predictive model identifies two state abstractions and re-used them in tasks 3 through 5, resulting in faster learning than the reward-maximizing model. (B) Performance comparison of each learning algorithm that uses SF-learning to obtain an optimal policy. Similar to (A), the reward predictive model identifies two state abstractions and re-used them in tasks 3 through 5. Re-using previously learned SFs across tasks (orange curve) degrades performance. (A, B) Each experiment was repeated ten times and the average across all repeats was plotted. The shaded areas indicate the standard errors of measure. For each experiment, different learning rates and hyper-parameter settings were tested and the settings resulting in the lowest average episode length are plotted. Supporting S3 Text describes the tested implementation and hyper-parameters in detail. (C, D) Plot of the posterior distribution as a function of training episode. The orange rectangle indicates tasks in which the agent used the identity abstraction to learn a new state representation that was added into the belief set after 200 episodes of learning.

More »

Expand

Fig 8.

Guitar-playing example.

(A): Guitar-Scale task for scale C-D-E-F-G-A-B. The fret board is translated into a bit matrix, where each entry corresponds to one circle. Because the note “C” can be played at multiple fret-board locations (orange circles), each location is mapped to the same latent state. The bottom schematic illustrates how the guitar-scale MDP is constructed for one octave: Starting at the start state (black dot), the agent progresses through different fret-board configurations by selecting which note to play next. Note that the illustrated state sequence is repeated three times, once for each octave. (The schematic illustrates only one chain to simplify the presentation.) Which octave is played is determined at random and the transition from the start state (black dot) into one of the fret boards that correspond to the note “C” is non-deterministic. This assumption allows us to reduce the action space from 60 fret board positions to 12 notes (A, A#, B, C, C#, D, D#, E, F, F#, G, G#, A). For each correct transition, a reward of zero is given, and for each incorrect transition a reward of −1 is given. The last fret board (a fret board corresponding to the note “B” in this example) is an absorbing state. (B): Total reward for each algorithm after first learning an optimal policy for Scale 1 (C-D-E-F-G-A-B) and then learning an optimal policy for Scale 2 (A-B-C-D-E-F-G). Each algorithm was simulated in each task for 100 episodes and each simulation was repeated ten times. The supporting S3 Text provides a detailed description of all hyper-parameters. (C): Reward per episode plot of one repeat for both the SF transfer and reward-predictive model. For the first 50 episodes, which are spent in scale task 1, both algorithms converge to an optimal reward level equally fast and learn to play the scale correctly. A recording of the optimal scale sequence is provided in supporting S1 Audio File. On scale task 2 (episodes 51 and onward), the reward-predictive model can re-use a previously learned state abstraction and converge to an optimal policy faster than the SF transfer algorithm. After only ten episodes in scale task 2, the reward-predictive model has learned how to play the scale correctly (please refer to supporting S2 Audio File) while the SF transfer algorithm has not yet converged to an optimal policy and does not play the scale correctly (please refer to supporting S3 Audio File).

More »

Expand