Prefrontal cortex creates novel navigation sequences from hippocampal place-cell replay with spatial reward propagation

As rats learn to search for multiple sources of food or water in a complex environment, they generate increasingly efficient trajectories between reward sites, across multiple trials. This optimization capacity has been characterized in the Traveling Salesrat Problem (TSP) (de Jong et al (2011). Such spatial navigation capacity involves the replay of hippocampal place-cells during awake states, generating small sequences of spatially related place-cell activity that we call “snippets”. These snippets occur primarily during sharp-wave-ripple (SWR) events. Here we focus on the role of replay during the awake state, as the animal is learning across multiple trials. We hypothesize that snippet replay generates synthetic data that can substantially expand and restructure the experience available to make PFC learning more optimal. We developed a model of snippet generation that is modulated by reward, propagated in the forward and reverse directions. This implements a form of spatial credit assignment for reinforcement learning. We use a biologically motivated computational framework known as ‘reservoir computing’ to model PFC in sequence learning, in which large pools of prewired neural elements process information dynamically through reverberations. This PFC model is ideal to consolidate snippets into larger spatial sequences that may be later recalled by subsets of the original sequences. Our simulation experiments provide neurophysiological explanations for two pertinent observations related to navigation. Reward modulation allows the system to reject non-optimal segments of experienced trajectories, and reverse replay allows the system to “learn” trajectories that is has not physically experienced, both of which significantly contribute to the TSP behavior. Author Summary As rats search for multiple sources of food in a complex environment, they generate increasingly efficient trajectories between reward sites, across multiple trials, characterized in the Traveling Salesrat Problem (TSP). This likely involves the coordinated replay of place-cell “snippets” between successive trials. We hypothesize that “snippets” can be used by the prefrontal cortex (PFC) to implement a form of reward-modulated reinforcement learning. Our simulation experiments provide neurophysiological explanations for two pertinent observations related to navigation. Reward modulation allows the system to reject non-optimal segments of experienced trajectories, and reverse replay allows the system to “learn” trajectories that it has not physically experienced, both of which significantly contribute to the TSP behavior.

capacity has been characterized in the Traveling Salesrat Problem (TSP) (de Jong et al (2011). Such 23 spatial navigation capacity involves the replay of hippocampal place-cells during awake states, 24 generating small sequences of spatially related place-cell activity that we call "snippets". These 25 snippets occur primarily during sharp-wave-ripple (SWR) events. Here we focus on the role of replay 26 during the awake state, as the animal is learning across multiple trials. We hypothesize that snippet 27 replay generates synthetic data that can substantially expand and restructure the experience available 28 to make PFC learning more optimal. We developed a model of snippet generation that is modulated 29 by reward, propagated in the forward and reverse directions. This implements a form of spatial credit 30 assignment for reinforcement learning. We use a biologically motivated computational framework 31 known as 'reservoir computing' to model PFC in sequence learning, in which large pools of prewired 32 neural elements process information dynamically through reverberations. This PFC model is ideal to 33 Introduction 52 Spatial navigation in the rat involves the replay of place-cell subsequences (snippets) during awake 53 and sleep states in the hippocampus during sharp-wave-ripple (SWR) events (Carr et  . While it appears likely that 58 replay contributes to this learning behavior, the underlying neurophysiological mechanisms remain to 59 be understood. 60 One obvious advantage of replay would be to provide extra training examples to otherwise slow 61 reinforcement learning systems. This approach has been previously exploited with good results 62 (Johnson & Redish 2005). We will go beyond this by prioritizing replay based on a spatial gradient of 63 reward proximity that is built up during replay. We hypothesize (a) that snippet replay allows 64 recurrent dynamics in prefrontal cortex (PFC) to consolidate snippet representations into novel efficient sequences, by rejecting other sequences that are less robustly coded in the input, and (b) that a 66 form of reward-modulated replay in hippocampus implements a simple and efficient form of 67 reinforcement learning to achieve this (Singer & Frank 2009). 68

INSERT Figure 1 HERE 69
An example of the behavior in question is illustrated in Figure 1. Panel A illustrates the optimal path 70 linking the 5 feeders (ABCDE) in red. Panels B-D illustrate navigation trajectories that contain 71 subsequences of the optimal path (in red), as well as non-optimal subsequences (in blue). In the 72 framework of reward modulated replay, snippets from the efficient subsequences in panels B-D will 73 be replayed more frequently, and will lead the system to autonomously generate the optimal sequence 74 as illustrated in panel A. We thus require a sequence learning system that can re-assemble the target 75 sequences from these replayed snippets. For this we choose a biologically inspired recurrent network 76 model of prefrontal cortex (Dominey 1995, Enel et al 2016) that we believe will be able to integrate 77 snippets from examples of non-optimal trajectories and to synthesize an optimal path. 78 For sequence learning, recurrent networks provide inherent sensitivity to serial and temporal structure. 79 Modification of recurrent connections requires different methods of unwinding the recurrent 80 connections in time, which limits the full dynamics of the recurrent system over extended time. To 81 avoid this temporal cut-off and the space and time complexity required in the calculation of credit 82 assignment to recurrent connections we used the framework of reservoir computing in which input and 83 recurrent connections are fixed, and learning-related plasticity occur outside the reservoir network 84 (Dominey 1995). The readout connections from the recurrent network learn the statistical structure of 85 the data that the system is trained on, which places requirements on the mechanism that trains the 86 model. We test the hypothesis that the structure of snippet replay from the hippocampus will provide 87 the PFC with constraints that can be integrated in order to contribute to solving the TSP problem. 88 Two principal physical and neurophysiological properties of navigation and replay are exploited by 89 the model and contribute to the system's ability to converge onto an acceptable solution to the TSP. 90 First, during navigation between baited food wells in the TSP task, non-optimal trajectories by 7 navigation data. The principle concept is that TSP behavior can be characterized as illustrated in 143 Figure 1, where a system that is exposed to trajectories that contain elements of the efficient path can 144 extract and concatenate these subsequences in order to generate the efficient trajectory. 145

Place-cells 146
The modeled rat navigates in a closed space of 2x2 meters where it can move freely in all direction 147 within a limited range (+ 110° left and right of straight ahead), and encodes locations using 148 hippocampus place-cell activity. A given location = ( , ) is associated with a place-cell activation 149 pattern by a set of 2D Gaussian place-fields: 150 is the index of the place-cell 153  ( ) is the mean firing rate of the ℎ place-cell 154  is the ( , ) coordinate of the ℎ place-cell 155  = 2 −log( ) is a constant that will constrain the highest activations of the place-cell to be 156 mostly contained in a circle of radius , centered in 157  is the radius of the ℎ place-field 158  is the radius threshold which controls the spatial selectivity of the place-cell 159 Parameter is a manner of defining the variance of the 2D Gaussian surface with a distance to center 160 related parameter . We model a uniform grid of 16x16 Gaussian place-fields of equal size 161 (mimicking dorsal hippocampus). In Figure 2 the spatial position and extent of the place fields of 162 several place-cells is represented in panel A by red circles. The degree of red transparency represents 163 the mean firing rate. 164

INSERT Figure 2 HERE 165
Wednesday, November 07, 2018 8 A mean firing rate close to one will result in an bright circle if the location is close to the place-field 166 center of the place-cell . For a more distant place-field center of place-cell , the mean firing 167 rate will be less important and the red circle representing this mean firing rate will be dimmer. 168 Thus at each time step, the place-cell coding that corresponds to a particular point in a trajectory is 169 defined as the projection of this ( ) point through radial basis functions (i.e. Gaussian place-fields 170 spatial response) 171 Each coordinate of the input vector ( ) represents the mean firing rate of hippocampus place-cells 173 and its value lies between 0 and 1. Figure 2 represents in panel B the ABCED trajectory ( 1 → ) 174 and the corresponding place-cell mean firing rate raster Hippocampus replay 176 The hippocampus replay observed during SWR complexes in the active rest phase (between two trials 177 in a given configuration of baited food wells) is modeled by generating condensed (time compressed) 178 subsequences of place-cell activation patterns (snippets) that are then replayed at random so as to 179 constitute a training set. The sampling distribution for drawing a random place-cell activation pattern 180 might be uniform or modulated by new or rewarding experience as described in (Carr et al 2011). 181 Ambrose et al (2016) show that during SWR sequences place-cell activation occur in reverse order at 182 the end of a run. In particular, we model a random replay that is biased by reward. We will 183 demonstrate an innovative method for spatial propagation of reward during replay that yields a 184 computationally simple form of reinforcement learning. 185 We define a snippet as the concatenation of a pattern of successive place-cell activation: 186

187
Where: 189  is the number of place-cell activations. 190 We define a time budget noted that corresponds to the duration of a replay episode (experimentally, 191 typically 70-100 ms). A replay episode is a set of snippets of length :  This is a convex combination of the current estimate of the reward information ( ) at the next time 232 step and the instantaneous reward information ( −1 ) + ( −1 ) based on the previously observed 233 reward signal ( −1 ) and delayed previous reward estimate ( −1 )). Equation 7 implements a 234 form of temporal difference learning. It is sufficient to define a coarse reward signal as: 235 The snippet generation procedure is simply the repetition of the steps a and c of procedure (8) with 236 used instead of until the sum of time subsequences durations overflows the fixed 237 budget duration. These snippets will serve as inputs to the reservoir model of PFC described next. As 238 illustrated in Figure 2 D and E, the replay is biased by proximity to reward, which has spatially 239 propagated. generates dynamic state trajectory that will allow overlapping snippets to have overlapping state 250 trajectories. This property will favor consolidation of a whole sequence from its snippet parts. At each 251 time-step, the network is updated according to the following schema: 252 The hippocampus place-cells project into the reservoir through feed-forward synaptic connections 254 noted . The projection operation is a simple matrix-vector product. Hence, the input projection 255 through feed-forward synaptic connections is defined by: 256 Where: 258  is a fixed connectivity matrix whose values do not depend on time. The same sign convention as in equation (9) applies for the recurrent connectivity matrix. Self-connections (i.e. , with ∈ 1 … ) are forced to zero.
is also fixed and its values do not 272 In this article, we will consider a contiguous assembly of neurons that share the same time constant. 281 The inverse of the time constant is called the leak rate and is noted ℎ. By choosing the Euler's forward 282 method for solving equation (12), the membrane potential is computed recursively by the equation: 283 This is a convex combination between instantaneous contributions of afferents neurons ( ) and The influence of the history is partially controlled by the leak rate. A high leak rate will result in a 288 responsive reservoir with a very limited temporal memory. A low leak rate will result in a slowly 289 varying network whose activation values depend more on the global temporal structure of the input 290

sequence. 291
Finally, the mean firing rate of a reservoir's neuron is given by: 292 Where: 294  σ is the non-linear activation function of the reservoir neurons 295  is a bias that will act as a threshold for the neuron's activation function. 296 We choose a σ ≡ ℎ hyperbolic tangent activation function with a zero bias. Negative firing rate 297 values represent the inhibitory/excitatory connection type in conjunction with the sign of the synaptic 298 weight. Only the product of the mean firing rate of the afferent neuron by its associated synaptic 299 weight is viewed by the leaky integrator neuron. See S1 High dimensional processing in the reservoir 300 for more details on interpreting activity in the reservoir. 301

Learning in Modifiable PFC Connections to Readout 302
Based on the rich activity patterns in the reservoir, it is possible to decode the reservoir's state in a 303 supervised manner in order to produce the desired output as a function of the input sequence. This 304 decoding is provided by the readout layer and the matrix of modifiable synaptic weights linking the 305 reservoir to the readout layer, noted and represented by dash lines in Figure 3. 306 The readout activation pattern ( ) is given by the equation: 307

308
Where: 309  σ is the non-linear activation function of the readout neurons 310  is a bias that will act as a threshold for the neuron's activation function 311 We choose a σ ≡ ℎ hyperbolic tangent activation function with a zero bias. 312 Notice that the update algorithm described above is a very particular procedure inherited from 313 feedforward neural networks. We chose to use it because it is computationally efficient and 314

deterministic. 315
Once the neural network states are updated, the readout synaptic weights are updated by using a 316 stochastic gradient descent algorithm. By deriving the Widrow-Hoff Delta rule (Widrow & Hoff 1960) 317 for hyperbolic tangent readout neurons, we have the following update equation: 318 Where: 320  is a small positive constant called the learning rate 321 In this study, we will focus on the prediction of the next place-cell activation pattern: 328 This readout is considered to take place in the striatum, as part of a cortico-striatal learning system. 330 This is consistent with data indicating that while hippocampus codes future paths, the striatum codes 331 actual location (van der Meer et al 2010). 332

Training 333
After each trial, the model is trained using a dataset that is generated online by the snippet replay 334 mechanism described above in the paragraph on Hippocampus replay. The readout synaptic weights 335 are also learned online by using the learning rule described in the Learning in Modifiable PFC 336 Connections to Readout section. The model does not receive any form of feedback from the 337 environment and it learns place-cell activation sequences based only on random replay of snippets. 338 Between each sequence of the training set (snippets in our case), the states of the reservoir and readout 339 are set to a small random uniform value centered on zero. This models a time between the replays of 340 two snippets that is sufficiently long for inducing states in the neural network that are not correlated 341 with the previous stimulus. This is required for having the same effect as simulating a longer time after 342 each snippet but without having to pay the computational cost associated to this extra simulation time. produce. This sequence is called the target sequence. Then the model's ability to generate a place-cell 347 activation sequence is evaluated by injecting the output prediction of the next place-cell activation 348 pattern as the input at the next step. In this iterative procedure, the system should autonomously 349 reproduce the trained sequence pattern of place-cell activations. 350 Predicted place-cell activation values might be noisy, and the reinjection of even small amounts of 351 noise in this autonomous generation procedure can lead to divergence. We thus employ a procedure 352 that determines the location coded by the place-cell activation vector output, and reconstructs a proper 353 place-cell activation vector coding this location. We call this denoising procedure the spatial filter as 354 referred to in Figure 3. 355 We model the rat action as 'reaching the most probable nearby location'. Since only the prediction of 356 the next place-cell activation pattern is available, we need to estimate the most probable point 357 * ( +1 ) = ( * ( +1 ), * ( +1 )). From a Bayesian point of view, we need to determine the most 358 probable next location ( +1 ), given the current location ( ) and the predicted place-cell activation 359 pattern ( ). We can state our problem as: * ( +1 ) = ( +1 ) ( ( +1 )| ( ), ( )) +

361
Where: 362  is a noise function sampling a uniform distribution (0, ) 363 is useful at least in degenerate cases when a zero place-cell activation prediction generates an invalid 364 location coding. It is also used for biasing the generation procedure and to explore other branches of 365 the possible trajectories the model can generate as described in section Evaluating Behavior with 366 Random walk. 367 The system is then moved to this new location * and a new noise/interference free place-cell 368 activation pattern is generated by the place-field model. We refer to this place-cell prediction/de-369 noising method as the spatial filter, which emulates a sensory-motor loop for the navigating rat in this 370 study. Trajectories are superimposed and summed, resulting in a two-dimensional histogram representing the 382 space occupied by trajectories. Figure 4 shows an example of random walk trajectories, illustrating the 383 model's ability to autonomously generate a long and complex sequence when learning without 384 snippets.

INSERT Figure 4 HERE 386
In cases where small errors in the readout are reinjected as input, they can be amplified, causing the 387 trajectory to diverge. It is possible to overcome this difficulty by providing as input the expected 388 position at each time step instead of the predicted position. The error/distance measurement can still be 389 made, and will quantify the diverging prediction, while allowing the trajectory generation to continue. 390 This method is called non-autonomous generation and it evaluates only the ability of a model to 391 predict the next place-cell activation pattern, given an input sequence of place-cell activations. 392 Comparing produced and ideal sequences using Discrete Fréchet distance 393 The joint PFC-HIPP model can be evaluated by comparing an expected place-cell firing pattern with 394 its prediction by the readout layer. At each time step, an error metric is computed and then averaged 395 over the duration of the expected neurons firing rate sequence. The simplest measure is the mean 396 square error. This is the error that the learning rule described in equation (16) minimizes. 397 Although the model output is place-cell coding, what is of interest is the corresponding spatial 398 trajectory. A useful measurement in the context of comparing spatial trajectories is the discrete Fréchet 399 distance. It is a measure of similarity between two curves that takes into account the location and 400 ordering of the points along the curve. We use the discrete Fréchet distance applied to polygonal 401 curves as initially described in Eiter and Mannila (1994). In (

404
Where (. , . ) is the Euclidean distance, is the number of steps of the curve A, is the number of 405 steps of the curve B, and , are reparametrizations of the curves A and B. Parameterization of this 406 measure is described in more detail in S1 Frechet distance parameters.

408
For robustness purposes, results are based on a population of neural networks rather than a single 409 instance. The population size is usually 1000 for evaluating a condition and the metrics described 410 above are aggregated by computing their mean (. ) and standard-deviation (. ). For convenience, we 411 define a custom score function associated to a batch of coherent measurements as: 412 Results having a low mean and standard deviation will be reported as low score whilst other possible 414 configurations will result in a higher score. We choose this method rather than Z-score, which 415 penalizes low standard deviations. We first established that the model displays standard sequence 416 learning capabilities (e.g. illustrated in Figure 4) and studied parameter sensitivity (see S1 Basic 417 Sequence learning and parameter search), and then addressed consolidation from replay. 418

Consolidation from snippet replay 419
The model is able to learn and generate navigation sequences from place-cell activation patterns. The 420 important questions is whether a sequence can be learned by the same model when it is trained on 421 randomly presented snippets, instead of the continuous sequence. 422

INSERT Figure 5 HERE 423
In this experiment, no reward is used, and thus each snippet has equal chance of being replayed. The 424 only free parameter is the snippet size. In order to analyze the reservoir response, we collect the state-425 trajectories of reservoir neurons when exposed to snippets. Recall that the internal state of the 426 reservoir is driven by the external inputs, and by recurrent internal dynamics, thus the reservoir adopts 427 a dynamical state-trajectory when presented with an input sequence. Such a trajectory is visualized in 428 Figure 5D. This is a 2D (low dimensional) visualization, via PCA, of the high dimensional state 429 transitions realized by the 1000 neurons reservoir as the input sequence corresponding to ABCDE is 430 presented. Panels A-C illustrate the trajectories that the reservoir state traverses as it is exposed to an 431 increasing number of randomly selected snippets generated for the same ABCDE sequence. We observe that as snippets are presented, the corresponding reservoir state-trajectories start roughly from 433 the same point because of the random initial state of the reservoir before each snippet is replayed. 434 Then the trajectories evolve and partially overlap with the state-trajectory produced by the complete 435 sequence. In other words, snippets quickly drive the reservoir state from an initial random activation 436 (corresponding to the grey area at the center of each panel) onto their corresponding locations in the 437 reservoir activation state-trajectory of the complete sequence. Replaying snippets at random thus has 438 no negative impact because the reservoir states overlap when snippet trajectories overlap. 439 Thus, we see that the state trajectories traversed by driving the reservoir with snippets overlaps that 440 from the original intact sequence. See further details of sequence learning by snippet replay in S1 441 Sequence complexity effects on consolidation. 442

Longer paths are rejected 443
Here we examine how using reward proximity to modulate snippet replay probability distributions (as 444 described in the hippocampal replay description) allows the rejection of longer, inefficient paths 445 between rewarded targets. In this experiment, 1000 copies of the model are run 10 times. Each is 446 exposed to the reward modulated replay of two sequences ABC and ABD having a common prefix AB 447 as illustrated in Figure 6. The model is exposed to a random replay of sequences ABC and ABE. The 448 random replay is not uniform and takes into account the reward associated with a baited feeder when 449 food was consumed. Snippets close to a reward have more chance to be replayed and thus to be 450 consolidated into a trajectory. 451

INSERT Figure 6 HERE 452
Panel A in Figure 6 illustrates the distribution of snippets selected from the two sequences, ABC in 453 pink and ABD in blue. At the crucial point of choice at location B, the distribution of snippets for 454 sequence ABC largely outnumbers those for sequence ABD. This is due to the propagation of rewards 455 respectively from points C and D. Per design, rewards propagated from a more proximal location will 456 have a greater influence on snippet generation. Panel C shows the 2D histogram of autonomously generated sequences when the model is primed with the initial sequence prefix starting at point A. We 458 observe a complete preference for the shorter sequence ABC illustrated in panel E. 459 The snippet generation model described above takes into account the location of rewards, and the 460 magnitude of rewards. Panel B illustrates the distribution of snippets allocated to paths ABC and ABD 461 when a 10x stronger reward is presented at location D. This strong reward dominates the snippet 462 generation and produces a distribution that strongly favors the trajectory towards location D, despite 463 its farther distance. Panel F illustrates the error mesures for model reconstruction of the two sequences 464 and confirms this observation. This suggests an interesting interaction between distance and reward 465 magnitude. For both conditions, distances to the expected sequence have been measured for every 466 trajectory generated (10 000 for ABC and 10 000 for ABD). Then a Kruskall Wallis test confirms (p-467 value ~= 0) for both cases that trajectories generated autonomously are significantly more accurate for 468 the expected trajectories (i.e. ABC when rewards are equal and ABD when reward at D is x10). 469

Novel efficient sequence creation 470
Based on the previously demonstrated dynamic properties, we determined that when rewards of equal 471 magnitudes are used, the model would favor shorter trajectories between rewards. We will now test the 472 model's ability to exploit this capability, in order to generate a novel and efficient trajectory from 473 trajectories that contain sub-paths of the efficient trajectory. That is, we determine whether the model 474 can assemble the efficient subsequences together, and reject the longer inefficient subsequences in 475 order to generate a globally efficient trajectory. Figure 1 (Panel A) illustrates the desired trajectory that 476 should be created without direct experience, after experience with the three trajectories in panels B-D 477 that each contain part of the optimal trajectory (red), which will be used to train the model. 478

INSERT Figure 7 HERE 479
The reward-biased replay is based on the following trajectories: (1) ABCED that contains the ABC 480 part of the ABCDE target sequence, (2) EBCDA that contains the BCD part of the ABCDE target 481 sequence, and (3) BACDE that contains the CDE part of the ABCDE target sequence. Figure 7 A  482 illustrates how the hippocampal replay model generates distributions of snippets that significantly favor the representation of the efficient subsequences of each of the three training sequences. This is 484 revealed as the three successive peaks of snippet distributions on the time histogram for the blue 485 (ABCED) sequence, favoring its initial part ABC, the yellow (EBCDA) sequence, favoring its middle 486 part BCD, and the pink (BACDE) sequence, favoring its final part CDE. When observing each of the 487 three color-coded snippet distributions corresponding to each of the three sequences we see that each 488 sequence is favored (with high replay density) precisely where it is most efficient. Thus, based on this 489 distribution of snippets that is biased towards the efficient subsequences, the reservoir should be able 490 to extract the efficient sequence. 491 This is shown in panel B, which illustrates the autonomously generated sequences for 1000 instances 492 of the model executed 10 times each. The spatial histogram reveals that the model is able to extract 493 and concatenate the efficient subsequences to create the optimal path, though it was never seen in its 494 entirely in the input. Panel C illustrate the significant differences in performance between the favored 495 efficient sequence vs. the three that contain non-efficient subsequences. A Kruskal-Wallis test 496 confirms these significant differences reconstruction error for the efficient vs non-efficient sequences 497 (maximum p = 5.9605e-08). 498

Reverse replay 499
In (Carr et al 2011), hippocampus replay during SWR is characterized by the activation order of the 500 place-cells which occurs in backward and forward direction. We hypothesize that reverse replay 501 allows the rat to explore a trajectory in one direction but consolidate it in both directions. This means 502 that an actual trajectory, and its unexplored reverse version, can equally contribute to new behavior. 503 Thus fewer actual trajectories are required for gathering information for solving the TSP problem. A 504 systematic treatment of this effect on learning can be seen in S1 Analysis of different degrees of 505 reverse replay. 506 We now investigate how reverse replay can be exploited in a recombination task where some 507 sequences are experienced in the forward direction, and others in the reverse direction, with respect to 508 the order of the sequence to be generated. We use the same setup as described above for novel sequence generation, but we invert the direction of sequence EBCDA in the training set. Without 510 EBCDA, the model is not exposed to sub trajectories linking feeders B to C and C to D and the 511 recombination cannot occur. We then introduce a partial reverse replay, which allows snippets to be 512 played in forward and reverse order. This allows the reservoir to access segments BC and CD (even 513 though they are not present in the forward version of the experienced trajectory. 514 INSERT Figure 8 HERE 515 Figure 8 illustrates the histogram of sequence performance for 10000 runs of the model (1000 models 516 run 10 times each) on this novel sequence generation task with and without 50% reverse replay. We 517 observe a significant shift towards reduced errors (i.e. towards the left) in the presence of reverse 518

replay. 519
We then examine a more realistic situation based on the observation of spontaneous creation of 520 "shortcuts" described in (Gupta et al 2010). The model is exposed to a random replay of snippets 521 extracted from two trajectories having different direction (clockwise CW and counter clockwise 522 CCW). The system thus experiences different parts of the maze in different directions. We examine 523 whether the use of reverse replay can allow the system to generate novel shortcuts. 524

INSERT Figure 9 HERE 525
The left and right trajectories used for training are illustrated in Figure 9A and B. In A, the system 526 starts at MS, head up and to the left at T2 (counter clockwise) and terminates back at MS. In B, up 527 and to the right (clockwise) again terminating at MS. Possible shortcuts can take place at the end of a 528 trajectory at MS as the system continues on to complete the whole outer circuit rather than stopping at 529 MS. We can also test for shortcuts that traverse the top part of the maze by starting at MS and heading 530 left or right and following the outer circuit in the CW or CCW direction, thus yielding 4 possible 531 shortcuts. The model is trained with snippets from the sequences in A and B using different random 532 replay rates, and evaluated in non-autonomous mode with sequences representing the 4 possible types 533 of shortcut. Figure 16 C shows with no reverse replay, when attempting the CCW path, there is low 534 error until the system enters the zone that has only been learned in the CW direction. In panel D, with 50% reverse replay, this error is reduced and the system can perform the shortcut without having 536 experienced the right hand part in the correct direction. Thus, in the right hand part of the maze, it is 537 as if the system had experienced this already in the CCW direction, though in reality this has never 538 occurred, but is simulated by the reverse replay. This illustrates the utility of mixed forward and 539 reverse replay. Panel E illustrates the difficulty when 100% reverse replay is used. Figure 16F  from rats trying to optimize spatial navigation in the TSP task. In the prototypical TSP behavior, in a 552 given configuration of baited wells, on successive trials the rat traverses different efficient 553 subsequences of the overall efficient sequence, and then finally puts it all together and generates the 554 efficient sequence. This suggests that as partial data about the efficient sequence are successively 555 accumulated, the system performance will successively improve. To explore this, the model is trained 556 on navigation trajectories that were generated by rats in the TSP task. We selected data from 557 configurations where the rats found the optimal path after first traversing subsequences of that path in 558 previous trials. Interestingly, these data contain examples where the previous informative trails 559 include traversal of part of the optimal sequence in either the forward or reverse directions, and 560 sometimes both (see S Rat navigation data). We trained training the model with random replay of 561 combinations of informative trials where informative trials are successively added, in order to evaluate the ability of the model to successively accumulate information. For each combination of informative 563 trials, the random replay is evaluated with 0%, 25%, 50%, 75% and 100% of reverse replay rate in 564 order to assess the joint effect of random replay and combination of informative trials. The model is 565 then evaluated in non-autonomous mode with the target sequences that consist in a set of trajectories 566 linking the baited feeders in the correct order. An idealized sequence is added to the target sequence 567 set because trajectories generated by the rat might contain edges that do not relate the shortest distance 568 between two vertices. Agent's moves are restricted to a circle having a 10 cm radius. to have a positive influence on learning (Johnson & Redish 2005), and here we go beyond this by 590 further exploiting reward structure in the replay. 591 In the behavior of interest, rats are observed to converge quickly to a near-optimal path linking 5 592 baited food wells in a 151cm radius open arena. During their successive approximation to the optimal 593 path, the rats often traversed segments of the optimal trajectory, as well as non-optimal segments. 594 Observing this behavior, we conjectured the existence of neural mechanisms that would allow the 595 optimal segments to be reinforced and the non-optical segments to be rejected, thus leading to the 596 production of the overall near-optimal trajectory. The overall mechanism we propose can be 597 decomposed into two distinct neural systems. The first is a replay mechanism that favors the 598 representation of snippets that occurred on these optimal segments, and that in contrast will give 599 reduced representation to snippets that correspond to non-optimal trajectory segments. Here we 600 demonstrate a simple but powerful method based on spatial reward propagation that implements this 601 mechanism. Interestingly, this characterization of replay is broadly consistent with the effects of 602 reward on replay observed in behaving animals (Ambrose et al 2016). 603 The second neural system required to achieve this integrative performance is a sequence learning 604 system that can integrate multiple subsequences (i.e. snippets) into a consolidated representation, 605 taking into consideration the probability distributions of replay so as to favor more frequently replayed 606 snippets. Here we considered a well-characterized model of sequence learning based on recurrent 607 connections in prefrontal cortex that is perfectly suited to meet the sequence learning requirements. 608 Replay mechanism: Replay is modeled using a procedure that randomly selects a subset of place-cells 609 coding part of a sequence, and outputs this snippet while taking into account the proximity of this 610 snippet to a future reward. Each time a reward is encountered, it is taken into consideration in 611 generating the snippet, and reward value is propagated backwards along the sequence, thus 612 implementing a form of spatio-temporal credit assignment. This can be viewed in the figures 2, 6 and 613 7 illustrating the snippet probability densities. The replay mechanism also implements a second feature 614 observed in animal data, which is a tendency to replay snippets in reverse order. These two features of based on reward information. This is a form of spatio-temporal credit assignment that allows to take 643 advantage of the reservoir network ability to combine multiple snippets into a whole sequence. We 644 showed that it is possible to consolidate multiple sequences featuring parts of the same underlying 645 optimal sequence into one efficient sequence and to generate it autonomously. Second, when the 646 snippet replay likelihood is learned, a non-zero reverse replay rate allows the prefrontal cortex to be 647 exposed to sequences of place-cell activations in both forward and reverse direction. This results in 648 sequence learning in both directions while having experienced a place-cell activation sequence in one 649 direction only. These results can be tested experimentally by recording place cells activities in SWR 650 during the task. 651 Conclusions and limitations: The model we studied here is able to mimic the rat's ability to find good 652 approximations to the traveling salesperson problem by taking advantage of recent rewarding 653 experiences for updating a trajectory generative model using hippocampus awake replay. We showed 654 that reverse replay allows the agent to reduce the TSP task complexity by considering an undirected 655 graph where feeders are vertices and trajectories are the edges instead of a directed graph. In this case, 656 autonomous sequence generation is no longer possible but the information available in each prediction 657 of the prefrontal cortex contains the expected locations. This allows the building of a navigation policy 658 taking into account the salient actions suggested by the prefrontal cortex predictions, which are learned 659 from hippocampus replay.

743
B depicts the ABCED trajectory, the two randomly selected reward sites 1 and 2 and a snippet randomly drawn between 1 744 and 2 . The snippet length is = 5. Panel C represents the raster of the place-cell activation along the ABCED trajectory.

745
The time index where feeders A,B ( 1 ),C,D and E( 2 ) are encountered during the ABCED trajectory are tagged above the