^{1}

^{*}

^{1}

^{2}

Conceived and designed the experiments: NJG NDD. Performed the experiments: NJG. Analyzed the data: NJG. Contributed reagents/materials/analysis tools: NJG NDD. Wrote the paper: NJG NDD.

The authors have declared that no competing interests exist.

Reinforcement learning (RL) provides an influential characterization of the brain's mechanisms for learning to make advantageous choices. An important problem, though, is how complex tasks can be represented in a way that enables efficient learning. We consider this problem through the lens of spatial navigation, examining how two of the brain's location representations—hippocampal place cells and entorhinal grid cells—are adapted to serve as basis functions for approximating value over space for RL. Although much previous work has focused on these systems' roles in combining upstream sensory cues to track location, revisiting these representations with a focus on how they support this downstream decision function offers complementary insights into their characteristics. Rather than localization, the key problem in learning is generalization between past and present situations, which may not match perfectly. Accordingly, although neural populations collectively offer a precise representation of position, our simulations of navigational tasks verify the suggestion that RL gains efficiency from the more diffuse tuning of individual neurons, which allows learning about rewards to generalize over longer distances given fewer training experiences. However, work on generalization in RL suggests the underlying representation should respect the environment's layout. In particular, although it is often assumed that neurons track location in Euclidean coordinates (that a place cell's activity declines “as the crow flies” away from its peak), the relevant metric for value is geodesic: the distance along a path, around any obstacles. We formalize this intuition and present simulations showing how Euclidean, but not geodesic, representations can interfere with RL by generalizing inappropriately across barriers. Our proposal that place and grid responses should be modulated by geodesic distances suggests novel predictions about how obstacles should affect spatial firing fields, which provides a new viewpoint on data concerning both spatial codes.

The central problem of learning is

The rodent brain contains at least two representations of spatial location. Hippocampal place cells fire when a rat passes through a confined, roughly concentric, region of space

Here we investigate the appropriateness of the brain's spatial codes for learning value functions, guided by the influential use of RL models across many varieties of decision problems in computational neuroscience

Importantly, this exercise views the brain's spatial codes less as a representation for location per se, and instead as basis sets for approximating other functions across space. In particular, most RL models work by learning to represent a

Thus, it is intuitive (and our simulations, below, verify) that low-frequency basis functions can speed up spatial RL by allowing experience about rewards to generalize over larger distances. However, we argue that considering generalization in the RL setting suggests a crucial and underappreciated refinement of this idea: in general, value functions are

We formalize this idea in a model of grid and place cell responses. The model and its simulations suggest novel predictions about how grid cell and place cell firing fields should behave in the presence of obstacles and other navigational constraints: in effect, these should locally warp the geometry of the representation. These predictions offer a new perspective on existing results, such as the unidirectionality of place fields on the linear track

Pyramidal neurons in the rat hippocampus have long been known to have firing fields in localized areas of space

Grid cell neurons in dorsomedial entorhinal cortex, a principal input to the hippocampus, have firing fields whose hallmark is a regular triangular lattice

The discovery of grid cells spurred a great deal of computational modeling, mostly targeted at understanding their

A great deal of modeling work in neuroscience and psychology concerns the brain's mechanisms for RL, founded on the observation that dopaminergic neurons in the primate midbrain appear to carry a reward prediction error signal as used in temporal-difference (TD) RL algorithms

Here, we revisit this architecture, focusing on the role of both the hippocampal and entorhinal spatial codes as bases for building the value function, in order to connect neural observations to work in RL on advantageous representations for value function approximation

First, we used TD(λ) learning in three simple environments (

(A) Each column displays the gridworld configuration whereby individual squares are discrete states, thick black lines are walls, and the star indicates the goal state with reward of 1. (B) Each column shows performance measured as the mean number of steps to goal over 10,000 runs for the environment in the corresponding column in A. The width of each line occupies at least the 95% confidence intervals on the means (range 3.9–4.4 steps). Within a given gridworld the different colored lines represent different basis sets with black for tabular, blue for grid cells, and red for place cells.

As

In each figure A–C, the column titles indicate the representation used to learn the value functions for a given gridworld configuration, and each row corresponds to an environment. White lines are walls, discrete squares indicate states, and the gray scale from dark to light indicates low to high value, respectively. To ease comparison between spatial representations within a given gridworld, the image brightness was normalized with respect to the optimal value function. (A) Snapshot of value representation after 15 learning trials. (B) Snapshot of value representation after 25 learning trials. (C) Snapshot of value representation after 50 learning trials. Notice that for both grid cells and place cells, the value representation bleeds across walls, indicated by red arrows where the estimated value is too low (relative to ground truth) on the side of a wall nearer a reward or too high on the far side.

While this flaw does not notably degrade performance in these simple tasks, it can be detrimental when fine navigational precision is required. To demonstrate this, we tested the models in three environments that required the agent to navigate narrow halls or openings, and thus learn precise state value representations (

(A) Each column displays the gridworld configuration whereby individual squares are discrete states, thick black lines are walls, and the star indicates the goal state with reward of 1. (B) Each column shows performance measured as the mean number of steps to goal, over 10,000 runs for the environment in the corresponding column in A. The width of each line occupies at least the 95% confidence interval on the means (range 3.2–4.5 steps). Notice that the collapse of learning, present in the Euclidean grid cells (labeled

In general, as can be seen directly in the recursive definition of the value function, (Equation 1 in

These considerations suggest that for efficacious representation of value functions over state space, the brain should adopt basis functions that are smooth along geodesic rather than Euclidean distances. In the open field there should be no difference between geodesic and Euclidean representations, since these metrics coincide there. However, if an environment has barriers, then Euclidean and geodesic firing fields will differ. The effect of such a difference should be to introduce geometric distortion into geodesic firing fields nearby obstacles, where geodesic and Euclidean metrics differ. Such a distortion can be characterized (and indeed implemented) by mapping the original Euclidean vector coordinates through an additional transform that accounts for geodesic distance. However, in the present work our goal is to investigate the brain's spatial representations through the lens of their downstream computations; thus, in contrast to much work on the hippocampal system

In particular, we modeled how basis functions would appear in environments with barriers, if they followed a geodesic metric, by evaluating Euclidean grid or place fields (characterized by spatial grids or Gaussians) over a new set of x–y coordinates, chosen such that their pairwise Euclidean distances approximated the states' geodesic distances (see

(A) Geodesic coordinates for different environments. (B) Single grid-cell using respective geodesic coordinates. Each grid cell generated using the same spacing, orientation, and relative spatial phase. (C) Single place-cell using respective geodesic coordinates. Each place cell generated using the same mean and variance.

We tested the geodesic bases in the environments that stressed importance of along-path generalization (

In each figure A–C, the column titles indicate the representation used to learn the value functions for a given gridworld configuration (denoted by row). White lines are walls, discrete squares indicate states, and the gray scale from dark to light indicates low to high value, respectively. To ease comparison between spatial representations within a given gridworld, the image brightness was normalized with respect to the optimal value function. (A) Snapshot of value representation after 25 learning trials. (B) Snapshot of value representation after 25 learning trials. (C) Snapshot of value representation after 50 learning trials. In contrast to Euclidean bases, the geodesic representation does

The foregoing simulations suggest that to support efficient navigation, the brain's spatial representations should generalize according to a geodesic rather than a Euclidean metric. Of course, these two representations coincide in the open field, where most studies have been conducted. However, we believe our model's predictions are consistent with a number of studies where researchers recorded neurophysiological activity while rats foraged in environments containing barriers. Here we compare our model to examples from three studies

Skaggs & McNaughton

(A) Data adapted and replotted from

In another study

Muller & Kubie

Similar results were also seen in a recent study of how place cell firing fields changed when mazes were reconfigured

(A) Two example environments used in

Finally, Derdikman et al.

Derdikman et al.

Although researchers widely assume that reinforcement learning methods such as temporal difference learning subserve learned action selection in the brain

The present study extends this idea to consider such generalization in light of work on efficient representation in machine learning

By contrast, since our argument is primarily one about learning efficiency (which is difficult to quantify behaviorally, since it is affected by many factors), our model does not make categorical behavioral predictions. Our simulations (

The concept of geodesic generalization provides a formal perspective on spatial representation which is different from, but complementary to, much other work in this area. Whereas much experimental and theoretical work on the hippocampal formation concerns essentially sensory-side questions—how place or grid cells combine different sorts of inputs to produce their instantaneous representations, or to learn them over time—we attempt to isolate the downstream question of how the resulting representations serve downstream learning functions. To this end, we do not address the input-side question of how the hypothesized distorted spatial representations are themselves produced from more elementary inputs. We only assume, abstractly, that the basis functions are computed on the fly from a learned map of the barriers in the environment. In sparse environments such maps could easily be learned from observation in a single trial, and may implicate the “border cells” of entorhinal cortex

More generally, unlike idealized RL models

The behavior of the entorhinal representation also raises interesting questions about the relationship between input- and output-side considerations. To start, it is often assumed that the place code is built up by linear combinations of grid cell inputs, e.g. by a sort of inverse Fourier transform

In this respect, the recent results of Derdikman et al.

Our simulations also demonstrate that the grid representation itself is a suitable basis for value function learning, even without an intermediate place cell representation. On one level, these results serve to underline the generality of our points about geometry and generalization, using a rather different basis. More speculatively, they point to the possibility that the grid representation might actually serve such a role in the brain, echoing other work on the usefulness of this Fourier-like basis for representing arbitrary functions

Finally, although for simplicity and concreteness we have focused on the principles of value function generalization in the context of a particular task (spatial navigation) and algorithm (TD(λ) learning), many of the same considerations apply more generally. First, across domains, in computational neuroscience, the need for (temporally) smooth basis functions been suggested to improve generalization also in learning about events separated in time rather than space

Second, across algorithms, TD-like learning mechanisms also likely interact with additional ones in the brain, and the core considerations we elucidate about efficient generalization due to appropriate state space representations crosscut these distinctions. For instance, value functions may also be updated using replay of previously experienced trajectories (e.g., during sleep)

Another issue arises when considering the present model in light of model-based RL. One of the hallmarks of model-based planning (and the behavioral phenomena that Tolman

We simulate value function learning in a gridworld spatial navigation task in order to compare linear function approximation over several different spatial basis sets

To simplify notation, we omit the dependence of these quantities on the action policy throughout. The model learns approximations to these values by learning a set of N linear weights _{1…N}_{1…N}(s)

We use a simple temporal-difference algorithm with eligibility traces _{i}_{i}

This is just the version of the familiar TD(λ) rule for linear value function approximation, with free parameters α (learning rate),

We tested the model in 20-by-20 (M = 400 states) gridworlds in which the agent could move in any of the four cardinal directions, unless a wall blocked such a movement. Agents were started at a random location (i.e. state) at each trial, and had to reach the goal state, which was the only state with a reward,

For simplicity, as described above the agent learns the value function over states and uses this to guide actions toward the goal, rather than directly learning the full Q-function over states and actions. This is because, in a spatial gridworld task, the state-action-state transition model is transparent, so we assume the agent evaluates the value

The agent chooses actions according to a softmax policy, i.e. ^{d/c}

We compare the model's learning using several different linear basis sets. Each basis is an M (states)×N (basis functions) matrix, with each column _{i}_{2} norm. This ensures that the learning rate parameter

The tabular basis is the M-by-M identity matrix, with one function corresponding uniquely to each state. It is easy to verify that using the identity basis that the value prediction and update equations (Equations 2 and 3, respectively) reduce to standard TD(λ) learning. In other words, the tabular basis is 1 at the current state and zero for all other states, thus the learned weights correspond directly to the values learned through standard TD(λ).

We used isotropic 2D Gaussian basis functions at different standard deviations to model a multiscale place cell basis. Such a representation ignores the possibility that individual basis functions have multiple fields

We used the sum of three 2D spatial cosine waves to model a hexagonal grid cell-like basis, akin to previous models of grid cell responses

Here, the state _{i}_{i,j}, orientation _{i}_{i}. Together, the grid orientation and spacing determine the vectors _{i,j} onto which the planar cosine wave is projected. In particular, to produce a given grid orientation, _{i}_{i,j} are taken as _{i}_{i}_{i}_{i}_{i}3^{0.5}) _{i} controls the space between simulated firing fields. For the waves to interfere constructively and produce a grid pattern, the three phases relate as _{i,1}_{i,2}_{i,3}

We produced a basis set of 400 grid cell-like functions, using all combinations of four orientations θ_{i} (0, π/12, π/6, and π/4), four node spacings λ_{i} (4/(3_{i,1} and p_{i,2} each between 0 and 2π. We also included a constant function, for a total of 401 bases. Finally, we ensured that the basis functions were non-negative (directly representable with firing rates) by adding an appropriate constant

To modify basis sets to respect the wall layout of a particular grid task, the Euclidean x–y coordinates for each state were transformed such that their pairwise distances approximately reflected geodesic distances (i.e. distances along paths that respect boundaries) in the gridworld. The basis functions were then evaluated at these transformed coordinates. Specifically, coordinates were transformed in a manner analogous to the ISOMAP algorithm

Next, we estimated a set of Euclidean coordinates (i.e., an x–y pair for each of the M states) whose Euclidean inter-state distances approximated the geodesic distance matrix. This was accomplished by applying non-classical multidimensional scaling (Matlab, mdscale) to the dissimilarity matrix, using Sammon's nonlinear stress criterion

We computed this transformation once for each environment, producing a static basis set over which to perform reinforcement learning. Realistically, the animal would have to learn the state transition function (i.e., the location of barriers) in order to compute the basis, and the firing fields would be expected to change as this state transition model was learned. However, since in our environments obstacles are sparse and observable from a distance, the true transition matrix (and the basis implied) should be entirely learned during the first trial in any of our environments.

Ground-truth value functions were computed for the optimal policy using dynamic programming over a tabular basis.