Reward-predictive representations generalize across tasks in reinforcement learning
Fig 8
(A): Guitar-Scale task for scale C-D-E-F-G-A-B. The fret board is translated into a bit matrix, where each entry corresponds to one circle. Because the note “C” can be played at multiple fret-board locations (orange circles), each location is mapped to the same latent state. The bottom schematic illustrates how the guitar-scale MDP is constructed for one octave: Starting at the start state (black dot), the agent progresses through different fret-board configurations by selecting which note to play next. Note that the illustrated state sequence is repeated three times, once for each octave. (The schematic illustrates only one chain to simplify the presentation.) Which octave is played is determined at random and the transition from the start state (black dot) into one of the fret boards that correspond to the note “C” is non-deterministic. This assumption allows us to reduce the action space from 60 fret board positions to 12 notes (A, A#, B, C, C#, D, D#, E, F, F#, G, G#, A). For each correct transition, a reward of zero is given, and for each incorrect transition a reward of −1 is given. The last fret board (a fret board corresponding to the note “B” in this example) is an absorbing state. (B): Total reward for each algorithm after first learning an optimal policy for Scale 1 (C-D-E-F-G-A-B) and then learning an optimal policy for Scale 2 (A-B-C-D-E-F-G). Each algorithm was simulated in each task for 100 episodes and each simulation was repeated ten times. The supporting S3 Text provides a detailed description of all hyper-parameters. (C): Reward per episode plot of one repeat for both the SF transfer and reward-predictive model. For the first 50 episodes, which are spent in scale task 1, both algorithms converge to an optimal reward level equally fast and learn to play the scale correctly. A recording of the optimal scale sequence is provided in supporting S1 Audio File. On scale task 2 (episodes 51 and onward), the reward-predictive model can re-use a previously learned state abstraction and converge to an optimal policy faster than the SF transfer algorithm. After only ten episodes in scale task 2, the reward-predictive model has learned how to play the scale correctly (please refer to supporting S2 Audio File) while the SF transfer algorithm has not yet converged to an optimal policy and does not play the scale correctly (please refer to supporting S3 Audio File).