Fig 1.
Overview of the dynamic routing network concept.
(A) Gap junctions allow ions to flow from one neuron to another neuron, which can be represented as asymmetric resistors that connect states (B). (C) State nodes (numbered circles) are connected to each other and ground via gap junctions. A state node is pulled high when the corresponding world state (upper triangle stars) occurs. The yellow highlight marks a current flow from the present state 1 to a target state 3 (connected to the ground) via state 2 and 4. This current flow activates action nodes (rectangles). The connections between state nodes and from state node connections to action nodes are plastic.
Fig 2.
(A) There are 25 locations in the environment, here we number them from 0 to 24. (B) Because the car can only move to adjacent locations without obstacles on its way, the possible transitions between the locations are constrained. (C) Our model learns the possible transitions between locations after training. The conductance strength and direction between nodes are shown by arrow width and direction. The state nodes with high potential (e.g. the current state in the example, 16) are in yellow and the state nodes with low potential (e.g. the goal state, 20) are in purple. The resulting current strength is shown by the brightness of the arrow, in this case creating flow from 16 to 20, guiding the car to take this route.
Fig 3.
The learned state network of the full taxi domain task.
This creates four disconnected graphs (one for each destination, which is unique to an episode) each consisting of four subgraphs, for four possible locations of the passenger (the fifth location of the passenger, at their destination, ends the episode, so is not included). Each subgraph represents the topology of locations in the environment.
Fig 4.
Total reward during each episode in Taxi-v3 task for the dynamic routing model, using an infinite step limit per episode.
The reward here is calculated with the inclusion of negative reward per step, although only the positive reward at the final step is used in training the model. Blue line: Episode reward. Yellow line: 100 episode average reward. (A) Episode reward per episode. (B) Episode reward (same data as A) but plotted against the steps making up each episode (which differ in duration) to show how reward changes with time. Inset plots are zoomed in regions (changed y-axis) of outer plots, showing how the reward level stabilises around -5.
Fig 5.
Result of using Q-learning with a similar training configuration to that used for our model, i.e., maximum 100000 steps for each episode and sparse reward.
Blue line: Episode reward. Yellow line: 100 episode average reward. The Left shows the reward per episode and the right reward per step. Please note the y-axis is not in the same scale with Fig 4. The average episode reward suggests that the Q-learning’s performance decreased in the early episodes of training and failed to converge.
Fig 6.
Number of steps per episode in Taxi-v3 task (A) is the number of steps with the dynamic routing model.
The inset figure is a zoomed-in version of the outer plot showing convergence to around 20 steps. (B) is the number of steps taken per episode by Q-learning in the same training configuration same as Fig 5. Q-learning does not converge.
Fig 7.
(A) Locations are marked using yellow dots and numbered. The corresponding Voronoi diagram is in blue, and corresponding the Delaunay triangulation is in yellow. (B) A maze generated by removing randomly selected walls from the Voronoi diagram.
Fig 8.
(A) The rendered image of the task. The region coloured in red is the goal location. (B) The learned state network for the Voronoi world task. The darker the edge colour, the stronger the connections. Connections with weights below a threshold are not shown.
Fig 9.
Cumulative episode reward in the Voronoi world task by the dynamic routing model.
10000 step limit. Only the reward at the final step is fed to the model. Blue line: Episode reward. Yellow line: 100 episode average reward. (A) is reward per episode, (B) is reward per step. The y-axes are in linear-scale between -10 to 10, but log-scale out of this range.
Fig 10.
Result of using Q-learning with a similar training configuration to solve Voronoi World.
That is, maximum 10000 steps for each episode and sparse reward. Q-learning did not converge in such a training configuration. (A) is reward per episode, (B) is reward per step.
Fig 11.
An example Petri dish (“AON”).
A represents amylacetate, O represents 1-octanol, M represents a mix of odours, N specifies no reinforcer.
Table 1.
The odours and reinforcers in Petri dishes.
Table 2.
Action nodes.
Table 3.
The activated state node given a perception.
Table 4.
The sequence of Petri dishes used during training.
Note each protocol pairs one odour with either reward or punishment, and the other with no reinforcer, alternating 3 times between these conditions.
Table 5.
Protocol for training and testing.
Fig 12.
Boxplots of maggot learning index at every step in the testing.
(A) The learning index with our model in the task. (B) The results in [39]. FN: Trained with fructose and tested without reinforcer. FF: Trained and tested with fructose.
Fig 13.
The state network with a group of state nodes (0, 1, 2) activated and another group of state nodes (8, 9) set as a target.
Fig 14.
(A) A minimal state network: state node i and state node j have different potentials (note ‘potential’ here refers to potential in a resistive network, not the membrane potential of a neuron). The potential difference Vi,j can cause current Ii,j from i to j if there is a connection from i to j and i has a higher potential than j. The weight of the connection wi,j is the conductance and the current Ii,j follows Ohm’s law. (B) A minimal circuit including an action node: an action node k can be influenced by the state node connection it is attached to. The influence can be described as a function of a potential difference Vi,j, current Ii,j, and weight from the connection to the action node wi,j,k. See text for details.
Table 6.
Parameters used for each task.