Adaptive algorithms for shaping behavior
Fig 3
A: An overview of the POMCP teacher, which cycles between inferring the student’s q values, innate bias and learning rate based on the transcript and planning using a Monte Carlo tree search.
B: The adaptive heuristic (ADP), which employs a simple decision rule to stay, increment or decrement the current difficulty based on the estimated success rate (computed using an exponential moving average over past transcripts). C: POMCP and ADP are comparable and significantly outperform other algorithms [39] when the task is non-trivial (low ε), including when INC fails (
). Here N = 10. Note that planning using POMCP is intractable when
. Barplot means are estimated from 10 repeats. D,E: POMCP and ADP adaptively alternate between difficulty levels, thereby preventing catastrophic extinction. Note the drop in difficulty levels after significant extinction in both cases. Here
.