Accounting for sensitivity of latent learning to behavioral statistics with successor representations

doi:10.1371/journal.pcbi.1014131

Fig 1.

Simulation environments.

A) Spatial setup of the environments. The gridworld is a arena for both one-hot encoding (top-left) and image-based (bottom-left) representations. The Tolman maze [3] is either a 72-state environment (top-right) with one-hot encoding representations, or a 30-state environment with image representations (bottom-right). In the Tolman maze, virtual doors (dotted lines), if activated, can restrict the agent’s movement to only one direction (along arrow). The starting and goal states are indicated by s and g, respectively. The states marked with i are the additional starting locations during the continuous pre-exposure, while m is the incorrect goal location in the mistargeted pre-exposure. The locations of these distinct states shown in the one-hot encoding setting are the same in the image-based environments. B) In the image-based learning, a pre-trained VGG16 network [21] encodes a RGB image into a state feature that is further used as the input to the deep SR networks. C) Cosine similarity between each pair of state features in the gridworld. Features of closeby locations have high similarities, but diverge for distant locations. In this, and comparable plots, the states are stacked to a 100-dimensional vector to facilitate the plotting of similarity between all states.

More »

Expand

Fig 2.

Learning curves demonstrating latent learning.

Left column: gridworld. Right column: Tolman maze. A) Latent learning agents (blue curve) with targeted pre-exposure to the environment (trials –50 to –1) consistently reach the goal faster than direct learning agents (orange dashed curve), after a reward is introduced from trial 0 in both one-hot encoding (top row) and image-based representations (bottom row). Curves represent the averages over 30 simulations, and the shaded area the S.E.M. B) The success rate of reaching the goal location in each trial. Latent learning agents are mostly unsuccessful on the spatial task during the pre-exposure phase, but after the introduction of rewards they reach the goal state more consistently than the direct learning agents do.

More »

Expand

Fig 3.

Agent learns local connectivity of the maze during latent learning.

Deep SR matrices derived from one-hot encoding for all states are shown for the A) gridworld for the right action and B) Tolman maze for the up action, averaged over 30 simulations. Rows and columns correspond to the state indices. Each row of panels represents to the successor transitions of a particular state to every state in the environment. Trials 0 and 50 correspond to before and after the learning phase, respectively. At the end of the learning phase, both agents exhibit similar transition patterns that reflect a movement from the start to the goal location in both mazes.

More »

Expand

Fig 4.

Successor representation of individual states reveal predictive encoding.

Shown are visualizations of one row of the deep SR matrix (one-hot encoding, targeted pre-exposure), mapped onto the environment structure, for the state indicated by the orange cross in the A) gridworld and B) Tolman maze. Trial –50 corresponds to the randomly initialization. After the pre-exposure to the environment (trial 0), the future states which are expected to be visited by the agent are mostly near the current state, whereas by the end of learning (trial 50), the future occupancy of states has evolved into a path leading to the goal.

More »

Expand

Fig 5.

Agent generalizes only image-based state representation to novel inputs.

Bottom: Deep SR matrix of individual states trained with image features and mapped into an environment (direct learning, 20 simulations) shown for a location that was not used during training (orange x). Note, the clear representation of a trajectory toward the goal. s and g mark the start and goal states, respectively, that were used during training. Top: When one-hot encodings are used for state representations, the deep SR matrix exhibits no directional bias for the novel location toward the goal location.

More »

Expand

Fig 6.

Learned state-action values in the one-hot encoding setup.

For the A) gridworld environment and B) Tolman maze, the color of each arrow represents the Q value (indicated by color) of the corresponding action in that state. For direct learning agents, both the SF network and the reward weight vector are randomly initialized, resulting in a stochastic distribution of Q-values at trial 0. Latent learning agents have encoded the transition structure between neighboring states and a proper reward function (, for all s) by the end of the pre-exposure phase. This leads to a more uniform distribution of Q values, which nevertheless are sensitive to random fluctuations, e.g., there are high Q values near the starting in the gridworld where they should not be high. Following learning with reward, both agents converge towards similar Q values, with higher values for actions leading toward the goal.

More »

Expand

Fig 7.

Experimental task design influences latent learning.

Comparison of latent learning agents with three pre-exposure designs (targeted, continuous, mistargeted) to the direct learning agent for the gridworld. Lines indicate averages across 30 simulations and shaded regions the S.E.M. A) Performance in the grid world for agents using deep SR based on one-hot encodings (first row), deep SR based on image-representations (second row), and image-based Dyna-DQN (third row). B) Same as in A) for the Tolman maze.

More »

Expand

Fig 8.

The structure of Deep SR accounts for performance difference in the three different pre-exposure paradigms.

A) Left: Well-learned SF for the direct learning agent in the gridworld. Left-center: After targeted pre-exposure, the SF structure is similar to what the direct learning agent learns from rewards, particularly in states near the goal (shown in inset). Right-center: After continuous pre-exposure, the structure is more dissimilar to that of the direct learning agent. Right: After mistargeted pre-exposure, the structure is similar to that of the targeted pre-exposure, but for the mistargeted location. B). Same as in A) for the Tolman maze without the mistargeted pre-exposure simulations. C) For the gridworld. The cosine similarity between the SF vector from the latent learning agents at the end of the pre-exposure phase and that from the direct learning agent at the task’s conclusion. Targeted pre-exposure results in superior scores in states close to the goal (last state), while mistargeted pre-exposure shows the lower scores in states close to the mistargeted state. Error bars are S.E.M. D) Same as in C) for the Tolman maze.

More »

Expand

Fig 9.

Evaluation of the policies learned during the pre-exposure phase.

Q values for targeted and continuous pre-exposure in the A) gridworld environment, and B) Tolman maze, computed from the learned SF multiplied with the ground-truth reward function. Targeted pre-exposure drives elevated values for states that are more distant from the goal, compared to continuous pre-exposure. The green edges show the action with the highest Q-value in that state. C) The probability of selecting the optimal action at the conclusion of the pre-exposure phase in the Tolman maze, averaged over 30 simulations. Targeted pre-exposure tends to lead to the optimal actions more frequently than continuous pre-exposure. Dead-end states, where only one action is viable, exhibit an action selection probability of 1 (see Methods) and are omitted here for clarity.

More »

Expand

Fig 10.

Targeted pre-exposure drives more optimal action selection than continuous pre-exposure in Tolman maze.

Left and middle: Thin lines represent the optimality trace of a single simulation. For each trial, optimality was quantified as the ratio of optimal actions to the total number of actions taken across all states that transition to the goal, i.e., excluding dead-end states. Thick lines represent the average over 30 simulations. Right: The agents’ optimality ratios show that targeted pre-exposure consistently outperforms continuous pre-exposure in all trials. Average over 30 simulations, shaded regions indicate S.E.M.

More »

Expand

Fig 11.

Latent learning observed with different RL exploration strategies.

Evaluation of the softmax policy for exploration in the A) gridworld and B) Tolman maze (average of 30 simulations, error bars are S.E.M). Latent learning agents consistently exhibit faster learning during the learning phase, compared to direct learning agents. The subtle performance differences between targeted, continuous, and mistargeted pre-exposures persist even as the exploration policy changes. C) Restricting the agents movement, by introducing doors that are only passable in one direction, in the Tolman maze during pre-exposure and learning phases significantly influences the performance of the DSR (left) and Dyna-DQN (right) agents (average over 30 simulations, with a maximum of 500 steps per trial, error bars are S.E.M.). Performance is improved when doors are active for the DSR agents and even more so for the Dyna-DQN agent. Legend labels for agents with pre-exposure indicate the door conditions during pre-exposure and learning phases, e.g., “No doors/Doors” indicates that doors were not active during pre-exposure, but were during learning.

More »

Expand