Goal-directed navigation in humans and deep reinforcement learning agents relies on an adaptive mix of vector-based and transition-based strategies

doi:10.1371/journal.pbio.3003296

Fig 1.

Task design and experimental set-up.

A: underlying structure of the 8 × 8 grid, unseen by participants. Every state is represented by an image of an object, and these objects and their positions change on every trial. B: schematic diagram of the ‘map reading’ phase of each trial. Participants see a top–down view of the grid with objects obscured and successively click on blue squares to reveal ‘landmark’ objects at the location. After 16 clicks have been completed, a yellow square appears. Clicking on the yellow square reveals the ‘goal’ object for the trial. C: schematic diagram of the navigation phase of each trial. Participants start in a random, previously unobserved location and are tasked with navigating to the ‘goal’ object they had just learnt about (displayed at the top). They can navigate in two ways. First, they could choose a direction to travel in by clicking on the corresponding arrow (highlighted yellow). This is analogous to using a ‘vector-based’ strategy. Alternatively, they could choose an adjacent state to travel to by clicking on one of the associated images (displayed in a random order; highlighted blue). This corresponds to using a ‘transition-based’ navigation strategy. Both response methods were equivalent in that they both only allowed participants to move to the four adjacent states, but setting up the response methods in this way allowed us to determine if participants were focusing more on the direction they were travelling in, or the identity of the next state they would be transitioning to.

More »

Expand

Fig 2.

Human participants benefit from freely arbitrating between vector- and transition-based strategies.

A: Performance for each participant across the different conditions of Experiment 1. Y-axis represents number of steps taken to reach a goal on a logarithmic scale, while x-axis represents the different conditions. Each dot represents an individual participants’ performance on each condition, and the dashes represent the mean performance across all participants, with error bars representing the 95% CI. B: Relationship between the proportion of steps made using direction responses (x-axis) and the number of steps taken to reach a goal (y-axis; represented on a logarithmic scale) for each participant. The different colors represent different types of environments, while the lines represent the best-fitting quadratic curve. C: Selected sample participant trajectories on the task. Participants’ trajectories progress from the darker squares to the lighter squares. A red circle indicates the location of a landmark, and a red square indicates an obstacle in the cluttered condition. A yellow square indicates the goal for the trial. A cross indicates when participants used a state response to get to the state. In these trajectories, participants use direction responses most of the time but use state responses to get to a landmark or goal. D: Participants’ use of direction responses (y-axis) as a function of destination type (i.e., goal, landmark, or non-landmark; x-axis) and whether the state had been visited before (color of bar). Error bars represent the 95% CI. E: Performance for each participant across the different numbers of landmarks in Experiment 2. Y-axis represents number of steps taken to reach a goal on a logarithmic scale, while x-axis represents the different numbers of landmarks. Each dot represents an individual participants’ performance on each condition, and the dashes represent the mean performance across all participants, with error bars representing the 95% CI. Data and code underlying this figure are available at https://osf.io/w39d5/ and https://github.com/denis-lan/navigation-strategies, respectively.

More »

Expand

Fig 3.

Deep RL model meta-trained for few-shot navigation recapitulates key features of human behavior.

A: architecture of the deep reinforcement learning network. The network consisted of an LSTM with separate policy and value heads. B: model performance across the different conditions. Y-axis represents number of steps taken to goal on a logarithmic scale, while x-axis represents the different conditions. Each dot represents an individual model’s performance on each condition, and the dashes represent the mean performance across all participants, with error bars representing the 95% CI. C: scatterplot showing the correspondence between model performance (x-axis, as measured by number of steps to goal, represented on a logarithmic scale) and human performance (y-axis) for each condition. The colors of the scatter points represent different action conditions, while the shapes represent the type of environment. Error bars represent the 95% confidence interval. D: Performance (as measured by number of steps to goal; on y-axis, on a logarithmic scale) and proportion of steps using vectors (x-axis) of the models (colored dots) superimposed on scatter plot of human performance (translucent dots) and best-fit quadratic curve for the relationship between the proportion of steps made using vector-based responses and performance in humans. D: Models’ use of vector-based responses (y-axis) as a function of destination type (i.e., goal, landmark, or non-landmark; x-axis) and whether the state had been visited before (color of bar). Each dot represents the behavior of an individual model, with error bars representing the 95% CI. Data and code underlying this figure are available at https://osf.io/w39d5/ and https://github.com/denis-lan/navigation-strategies, respectively.

More »

Expand

Fig 4.

Models spontaneously develop separate modules for ‘vector-‘ and ‘transition’-based strategies.

A: Example heatmaps for a unit with a ‘landmark’, ‘spatial’, or ‘conjunctive’ response pattern in two different environments. Red circles on the heat maps denotes the presence of a landmark in that location. ‘Spatial’ units respond stably to certain regions of space across environments regardless of landmark configuration, ‘landmark’ units respond to all landmarks across environments, while ‘conjunctive’ units respond to landmarks differently across different environments. B: Stacked bar plots showing the proportion of response pattern types across all units or in the ‘vector’, ‘transition’, or ‘unspecialized’ clusters. ‘Vector’ units are more likely to have spatial responses, while ‘transition’ units are more likely to have conjunctive or landmark responses. C: Scatter plot showing the R² value for the correlation between the activation of each LSTM unit’s cell state and the output values of the either the ‘direction’ or ‘state’ actions in the policy network. Values are normalized to be between 0 and 1. The 20 units that explained the most variance in ‘direction’ or ‘state’ actions were designated as the ‘vector’ and ‘transition’ units, respectively, while the 20 units that explained the least variance in either type of action were designated as ‘unspecialized’ units. D: Performance deficit of lesioned models (as measured by excess number of steps taken to get to the goal compared to an intact model) on the both, directions-only, and states-only conditions. Each dot represents one of the 20 trained models, and the line represents the mean, and the error bar represents the 95% confidence interval. Lesioning ‘vector’ units leads to deficits in the ‘directions-only’ condition, while lesioning ‘transition’ units leads to deficits in the ‘states-only’ condition. Lesioning ‘vector’ and ‘transition’ units both lead to deficits in the ‘both’ condition. E: Change in use of ‘direction’ actions after lesions to the ‘vector’, ‘transition’ and ‘unspecialized’ units, compared to the unlesioned models. Each dot represents one of the 20 trained models, and the line represents the mean, and the error bar represents the 95% confidence interval. Lesioning ‘vector’ units leads to a decrease in use of ‘direction’ actions and lesioning ‘transition’ units leads to a decrease in use of ‘state’ actions. F: Decoding error on held-out time steps for current and goal locations (as measured by Euclidean distance) for the ‘vector’, ‘transition’ and ‘unspecialized’ units. Current and goal locations are both best decodable from ‘vector’ units. G: Decoding error on held-out time steps for whether the agent is currently adjacent to a landmark or a goal. Goal and landmark adjacency are both best decodable in ‘transition’ units. H: First three principal components for the PCA on the cell state activations of ‘vector’ units. Each dot represents the centroid of the PCs for each location in the grid. Red and blue dots represent the PCs before and after a landmark is encountered, respectively. The representations of ‘vector’ units faithfully reflect spatial structure after a landmark is encountered. PCA results are shown for one representative model. I: First three principal components for the PCA on the cell state activations of ‘transition’ units. Each dot represents the centroid of the PCs for each location in the grid. Purple and green dots represent the PCs for non-landmarks and landmarks, respectively. The representations of ‘transition’ units seem to separate landmarks and non-landmarks without apparent spatial structure. Data and code underlying this figure are available at https://osf.io/w39d5/ and https://github.com/denis-lan/navigation-strategies, respectively.

More »

Expand

Fig 5.

Participants successfully choose landmarks that are beneficial for few-shot navigation.

A: Performance on the second day in the free-sampling and forced-sampling groups as measured by number of steps taken to get to a goal (y-axis, represented on a logarithmic scale). Each dot represents an individual participants’ mean performance across the task and the dash represents mean performance across all participants, with the error bars representing the 95% CI. B: Average distance from all states to their nearest landmarks for participants in the free-sampling group. The dotted line represents the mean distance expected by chance. C: Average distance from landmarks to the center for participants in the free-sampling group. The dotted line represents the mean distance expected by chance. D: Mean error on probe trials for participants in the free-sampling and forced-sampling conditions. E: Results from a principal component analysis (PCA) conducted on each participants’ number of samples on each location in the 8 × 8 grid. The loadings for each of the locations in the 8 × 8 grid on the three sampling-related principal components are shown in the first row. The second row shows the relationship between each component (x-axes) and navigation performance, as measured by number of steps taken to reach a goal (y-axes, represented on a logarithmic scale). Each dot represents an individual participant, and the line represents the best-fitting line. The third row shows the same for the deep RL model, with each dot represents the model’s performance when tested on the landmarks sampled by an individual human participant. F: Model performance, as measured by steps taken to goal (y-axis, represented on a logarithmic scale) when tested on the freely selected landmarks chosen by the free-sampling group or the randomly chosen landmarks that the forced-sampling group was exposed to. Each dot represents the model’s performance when tested on the landmarks a single human participant sampled or was exposed to. G: Correlation between participant’s performance (as measured by number of steps taken to reach the goal, represented on a logarithmic scale) and the model’s performance when it is tested on the landmarks chosen by each participant. Participants’ performance on the navigation phase was significantly associated with the models’ performance when it was tested on the landmarks they chose. Data and code underlying this figure are available at https://osf.io/w39d5/ and https://github.com/denis-lan/navigation-strategies, respectively.

More »

Expand

Table 1.

Summary of the input vector received by deep RL agents, showing the information provided to the agents and the length of each information encoded.

More »

Expand

Table 2.

Hyperparameters used for model training.

More »

Expand