Reinforcement Learning Using a Continuous Time Actor-Critic Framework with Spiking Neurons
Figure 6
A: Cartpole swing-up problem (schematic). The cart slides on a rail of length 5, while the pole of length 1 rotates around its axis, subject to gravity. The state of the system is characterized by , while the control variable is the force
exerted on the cart. The agent receives a reward proportional to the height of the pole's tip. B: Cumulative number of “successful” trials as a function of total trials. A successful trial is defined as a trial where the pole angle was maintained up (
) for more than 10s, out of a maximum trial length
. The black line shows the median, and the shaded area represents the quartiles of 20 TD-LTP agents' performance, pooled in bins of 10 trials. The blue line shows the number of successful trials for a single agent. C: Average reward in a given trial. The average reward rate
obtained during each trial is shown versus the trial number. After a rapid rise (inset, vertical axis same as main plot), the reward rises in a much slower timescale as the agents learn the finer control needed to keep the pole upright. The line and the area represent the median and the quartiles, as in B. D: Example agent behavior after 4000 trials. The three diagrams show three examples of the same agent recovering from unstable initial conditions (top: pole sideways, center: rightward speed near rail edge, bottom: small angle near rail edge).