Environmental uncertainty and the advantage of impulsive choice strategies
Fig 5
Explore-Exploit bandit task with novelty.
A) Example sequence of trials in the Explore-Exploit task. In this example, each option (A, B, C) is a picture with an underlying reward rate. The agent or participant must learn the values of the three options through experience of choosing the options and receiving or not receiving a reward on each trial. In this example, the agent has learned the approximate values of the options over the course of multiple trials (not all are shown) and then a novel option is substituted for one of the options (option A). The novel option substitution rate (penv) affects the number of trials the agent has to learn about an option. When penv is high, it is harder for the agent to learn about the underlying values of the options. B) State space representation of the Explore-Exploit task. Each option can be represented with a separate subtree. Thus, for an example sequence of choices such as C, A, B, A, A, B, the agent progresses through 1 step of the tree for C, three steps for A, and two steps for B. The agent progresses to an upper branch or lower branch depending on whether the choice was rewarded. Rewards are shown for this example sequence as 0s or 1s. Thus, the agent progresses through the uppermost branches of the subtree for option A, as it was rewarded all three times it was chosen. The introduction of a novel option causes the position in the subtree for that option to reset. When the novel option is presented at the end of this sequence and replaces option A, the agent jumps back to the start for that option, as reward history no longer represents the underlying value of the novel option. C) Bar plots of average raw reward (left) and average selection of the novel option upon first appearance (right) in low and high certainty environments. On the left, average reward for the non-impulsive (black) and impulsive (red) agents at penv values of 0.04 and 0.20 are shown. On the right, average selection of the novel option upon first appearance is shown for the same values of penv. *** indicates p<0.0001. Error bars on plots s.e.m. across iterations. Error bars above plots represent the standard deviation of the differences between the mean values for the non-impulsive and impulsive agents. D) Average % of maximum possible reward and average novel option choice behavior across a range of novel option substitution rates for both the non-impulsive (black) and impulsive agent (red). On the left, the plot of average reward shows a that when the novel option substitution rate (penv) is low, the non-impulsive agent collects more reward than the impulsive agent, but when penv is high (greater than 0.1), the impulsive agent collects more reward than the non-impulsive agent. The plot of novel choice behavior shows that for all novel option substitution rates tested, the non-impulsive agent selects the novel option significantly more often than the impulsive agent on the first trial it appears. Error bars above the graphs represent the standard deviation of the differences between the mean values for non-impulsive and impulsive agents. E) Average % of maximum reward and choice behavior for a range of discount factors and pagent = 0.08. F) Average % of maximum reward and choice behavior for discount factors shown in (C) and (D) with pagent = 0.04, 0.08, 0.16. Image credit: Wikimedia Commons (scene images).