Model-based spatial navigation in the hippocampus-ventral striatum circuit: A computational analysis

doi:10.1371/journal.pcbi.1006316

Fig 1.

Spatial navigation in rodents: functional organization, scenario, and overall architecture of the model.

(A) Structures in the rodent brain that are involved in goal-directed navigation. HC-VS constitute the essential structures of the putative model-based control system, which supports goal-directed behavior. AT and MEC provide input to the model-based control system. Output of the HC-VS circuitry may reach the cortex/mPFC via VP and MD. (B) The Y-maze used here and in [19]. Each room contains 3 goal locations, which can be cued with a light with different probability in three different phases (separate panels) according to the color legend. See the main text for more explanation. (C) Sample grid cells providing systematic (spatial) information about the environment. (D) Architecture of our biologically inspired model-based reinforcement-learning model for spatial navigation that combines non-parametric clustering of the input signal P_c(s|x), a state-transition model P_M(s′|s,a), and a state-value model P_V(r|g,s) with lookahead prediction mechanism for learning and control. See the Methods for details. Abbreviations: AT, anterior thalamus; HC, hippocampus; MD, mediodorsal thalamus; MEC, medial entorhinal cortex; mPFC, medial prefrontal cortex, VP, ventral pallidum, VS, ventral striatum.

More »

Expand

Fig 2.

Benefits of forward sweeps for action selection and control.

(A,B) Learning performance (accuracy and length of the agent path to the goal). (C) Length of sweeps used for control / action selection. (D) Decision (un)certainty during control / action selection.

More »

Expand

Fig 3.

Behavioral results of the simulations.

(A) Preferences of the MB-RL agent after Cue Conditioning and after Contextual conditioning (i.e., CX test). In the CX test, both agent and animals show a preference to visit targets in the room that was previously most rewarded. (B) Sample trajectories of the agents during these tests (after Cue Conditioning, blue; after Context Conditioning, or CX test, red).

More »

Expand

Fig 4.

Neural representation of state transitions in the state-state model.

Latent states developed by the state-transition models averaged across all 10 learners after the Cue Conditioning phase (A); and the changes due to Contextual Conditioning, i.e., the differences between probabilities before and after Contextual Conditioning (B). Each image from the transition model P_M(s′|s,a) encodes the greatest likelihoods P_M(xy(s′)|xy(s), a) across all head-directions and actions to step from the location xy(s) of a given latent state (place) s to nearby places xy(s′) located within the range of an(y) action from the location of the current place, following any of the available actions, i.e., the (probabilistic) location of the successors of every state. The locations xy(s) and xy(s′) of s and s′ are decoded using an inverse of the function providing input to the Dirichlet model. Note that, as expected, the decoding procedure is not perfect—hence the gaps in the maps.

More »

Expand

Fig 5.

Neural representation of value in the state-value model.

Latent states developed by the state-value model averaged across all 10 learners, after the Cue Conditioning phase (A); and how they change after Contextual Conditioning, i.e., the differences between probabilities before and after Contextual Conditioning (B); and the activity of three sample vStr neurons drawn from the experiment in [19] (C). The images in (A, B) represent the greatest value P_V(r = 1|g,xy(s)) across all head-directions attributed to a given spatial position xy(s) for a given target g. Each image represents one target and its location in the plot corresponds to its location in the Y-maze. The central image represents the combined value function across all targets. It is rotationally symmetric after the rotationally symmetric reward delivered during Cue Conditioning (as some of the vStr rodent-neurons; see insets C) and becomes asymmetric during the context conditioning phase. The spatial positions xy(s) are decoded from the latent states s using an inverse of the function providing input from the grid cells.

More »

Expand

Fig 6.

Changes in the state-state and state-value models after contextual conditioning.

This figure shows what changes the Contextual Conditioning procedure produces in the probability values of states in the state-state or transition model (black), and in the state-value model, where states correspond to the most rewarded (red) or less rewarded (blue) rooms. For clarity, we only show the changes of states having a probability that is greater than 0.05 (for the state-state model) and 0.5 (for the state-value function). This choice of thresholds in motivated by the fact that in the state-value function we are interested in verifying changes in the states carrying significant value information (e.g., those regarding the goal states or their neighbor’s), not in the many states that have a low probability value in all situations (see Fig 5).

More »

Expand

Fig 7.

Analysis of sweeps.

Length of sweeps during control / action selection, for each target (separate Y-maze images) and each spatial location (dots in the Y-mazes). Sweep length is color-coded. (A) Sweep length after Cue Conditioning. (B). Change of sweep length after Context Conditioning.

More »

Expand

Fig 8.

Shallow control mechanism.

It exploits the state-transition and state-value models and local maximization to predict the expected value of each action primitive aⁱ and select the most valuable one. For each action primitive, the mechanism first finds the latent state that most likely would be achieved by applying that action and then finds among those states the predictably most valuable one. The action bringing to that state is the selected one: a_t = argmax_a P_v(r = 1|g,argmax_s′P_M(s′|s_t,a)).

More »

Expand

Fig 9.

Controller using look-ahead prediction (or forward sweeps).

The mechanism includes k sweeps, one for each available action primitive, each consisting of l steps. At each step j, the mechanism first iteratively predicts the next latent state of each sweep and then accumulates the predicted value for that state: . The first transition of the i-th sweep departs from the current latent state s_t and applies action primitive aⁱ while the following transitions recursively depart from the predicted state in the previous step and use any action that maximize the predicted reward. Finally, the mechanism selects the action that maximizes the cumulative predicted value: .

More »

Expand

Fig 10.

Information-driven adaptive sweep-depth.

At each depth j is calculated discriminative certainty for the two currently most valuable sweeps. Sweep depth increases until the selection certainty exceeds a given threshold: d_j > d_thr.

More »

Expand

Fig 11.

Learning the latent state-space, the state-transition and state-value models.

Given (1) the last input x_t−1 and latent state s_t−1, (2) performed action a_t−1, (3) observed new input x_t, reward r_t, and inferred latent state s_t, learning consists of (5) adjusting the categorization model to make it more congruent with the state-transition model and updating the conjugate priors and of the state-transition and state-value models to accommodate the internal perception of the experienced behavioral evidence (see the text for details). Notably, the update of the state-value conjugate is a Bayesian analog of TD-learning using predicted discounted future value accumulated in (4) a forward sweep.

More »

Expand