Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

doi:10.1371/journal.pcbi.1003024

Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons

Figure 4

Maze navigation learning task.

A: The maze consists of a square enclosure, with a circular goal area (green) in the center. A U-shaped obstacle (red) makes the task harder by forcing turns on trajectories from three out of the four possible starting locations (crosses). B: Color-coded trajectories of an example TD-LTP agent during the first 75 simulated trials. Early trials (blue) are spent exploring the maze and the obstacles, while later trials (green to red) exploit stereotypical behavior. C: Value map (color map) and policy (vector field) represented by the synaptic weights of the agent of panel B after 2000s simulated seconds. D: Goal reaching latency of agents using different learning rules. Latencies of simulated agents per learning rule are binned by 5 trials (trials 1–5, trials 6–10, etc.). The solid lines shows the median of the latencies for each trial bin and the shaded area represents the 25th to 75th percentiles. For the R-max rule these all fall in the time limit after which a trial was interrupted if the goal was not reached. The R-max agent were simulated without a critic (see main text).

doi: https://doi.org/10.1371/journal.pcbi.1003024.g004