Adaptive algorithms for shaping behavior

doi:10.1371/journal.pcbi.1013454

Fig 1.

A: Teaching using our OCL framework can be visualized using a difficulty landscape (here, parameterized by two skill axes), which quantifies the student’s success probability for each difficulty level.

A student assigned an extremely difficult task will not learn, since they are unlikely to succeed and thus receive little reinforcement. The teacher’s purpose is to adaptively assign tasks (shown in yellow) based on the student’s progress and the difficulty landscape. A successful curriculum charts a short path through the landscape, lowering the difficulty while promoting quick learning. B: Tasks from a pre-defined set are ordered based on their difficulty, as measured by the success probability of a naive agent. An autonomous teacher decides the student’s task () based on the student’s transcript of successes and failures (represented here as 0s and 1s respectively) on previously assigned tasks. C: We apply our OCL framework to three biologically relevant goal-oriented tasks involving delayed rewards: a generic sequence learning task, an odor-guided trail tracking task and a plume-tracking task involving localization to a target based on sparse cues.

More »

Expand

Fig 2.

A: The sequence learning setup.

In the full task, the student is required to take a sequence of N correct actions to get reward. In intermediate levels of the task, the reward is delivered if the student takes correct actions. is the innate bias of the student to take the correct action at the ith step, prior to training. We assume for all i unless otherwise specified. B: The incremental teacher (INC) fails once . C: The q values (in grayscale) for the correct action at each step shown for (top) and (bottom). The red line shows the assigned task level. Note the striped dynamics in the top row caused due to alternating reinforcement and extinction. In the bottom row, ε is too small, forcing learning to stall. D: Time series of q values for actions at the first (solid black) and third (dashed gray) steps for the two examples shown in panel C.

More »

Expand

Fig 3.

A: An overview of the POMCP teacher, which cycles between inferring the student’s q values, innate bias and learning rate based on the transcript and planning using a Monte Carlo tree search.

B: The adaptive heuristic (ADP), which employs a simple decision rule to stay, increment or decrement the current difficulty based on the estimated success rate (computed using an exponential moving average over past transcripts). C: POMCP and ADP are comparable and significantly outperform other algorithms [39] when the task is non-trivial (low ε), including when INC fails (). Here N = 10. Note that planning using POMCP is intractable when . Barplot means are estimated from 10 repeats. D,E: POMCP and ADP adaptively alternate between difficulty levels, thereby preventing catastrophic extinction. Note the drop in difficulty levels after significant extinction in both cases. Here .

More »

Expand

Fig 4.

Deep reinforcement learning agents trained using a curriculum solve navigation tasks with delayed rewards.

A: The trail tracking paradigm. A sample trajectory of a trained agent navigating a randomly sampled odor trail. The colors show odor concentration. The inset shows the egocentric visuospatial input received by the network, where the agent’s location is in red and odor detections are in green. B: Sample trails from the six difficulty levels. C: ADP outperforms INC and RAND (each teacher-student interaction is a step). The agent does not learn the task without a curriculum. Results are plotted from 5 repeats. D: The success rate of the agent in finding the target over training (black dashed line) for INC and ADP. The curriculum is shown in red. Note the significant forgetting shown by the student trained using INC approach compared to ADP. E-G: As in panels A-D for a localization task. The agent is required to navigate towards a source which emits Poisson-distributed cues whose detection probability decreases with distance from the source (colored in green on a log scale). Results are plotted from 15 repeats.

More »

Expand

Fig 5.

Algorithms for designing continuous curricula A: Decision tree showing the continuous version of ADP which includes actions that “grow” and “shrink” the increments between continuously parameterized difficulty levels.

See the text for more details of the task in the continuous setting. B: ADP significantly outperforms INC when the task is difficult (low ε). Barplot means are estimated from 10 repeats. C,D: The q values plotted as in Fig 3D, 3E. Similar to the discrete setting, INC shows catastrophic extinction and never learns the task for sufficiently small ε. Continuous ADP first decreases increment size and smoothly increases the difficulty level while balancing reinforcement and extinction.

More »

Expand