Accounting for sensitivity of latent learning to behavioral statistics with successor representations
Fig 9
Evaluation of the policies learned during the pre-exposure phase.
Q values for targeted and continuous pre-exposure in the A) gridworld environment, and B) Tolman maze, computed from the learned SF multiplied with the ground-truth reward function. Targeted pre-exposure drives elevated values for states that are more distant from the goal, compared to continuous pre-exposure. The green edges show the action with the highest Q-value in that state. C) The probability of selecting the optimal action at the conclusion of the pre-exposure phase in the Tolman maze, averaged over 30 simulations. Targeted pre-exposure tends to lead to the optimal actions more frequently than continuous pre-exposure. Dead-end states, where only one action is viable, exhibit an action selection probability of 1 (see Methods) and are omitted here for clarity.