Fig 1.
Agent and Environment Interactions in Reinforcement Learning (RL), Markov Decision Process (MDP) framework.
A schematic of how an agent interacts with the environment and learns to maximize rewards in a MDP framework. An agent selects actions, At, which lead to changes in state, St and rewards, Rt, where t indicates the trial or time point. The agent’s internal model of the environment and weighting of future rewards, or discount factor, γ, affect the actions taken. The stability of the environment is captured by transition probabilities to future states p (St+1|St, At) as well as the probability of receiving reward p (Rt); these also affect reward outcomes. An example reward distribution tree for a binomial bandit is shown on the left for a bandit option in a choice task. As an agent selects the option that gives a probabilistic reward, it traverses the tree based on outcomes. Each node in the tree represents a choice point where that option was chosen. The node shape and shading indicate whether a node represents a unique state. Circle nodes are unique. Other shapes or shading of nodes indicate duplicate states that have multiple choice paths to them. While MDPs are independent of time and history, these factors often affect decision-making behavior. Each upper branch in the state space tree represents a rewarded choice, and each lower branch represents an unrewarded choice. Thus, the number at each node indicates the posterior over the number of rewards and the number of times the option has been chosen. Traversal through this tree leads to the accumulation of evidence for whether an option is highly rewarding or not rewarding, which in turn affects the agent’s future actions. Image credit: Wikimedia commons (bust image); Openclipart.org (map image).
Fig 2.
Temporal Discounting task and performance of impulsive and non-impulsive agents in different task environments.
A) Task schematic for the Temporal Discounting task. Participants or agents are given a series of questions with two offers, one for a small immediate reward and the other for a larger, delayed reward. B) The state space tree for one pair of options in the task. The agent starts on the far left with a choice between the immediate reward or delayed reward. If the immediate reward is chosen, the agent proceeds on the upper branch to the immediate reward state (sIR) and always collects the immediate reward. If the agent chooses the delayed reward, the agent proceeds through the lower branch towards the delayed reward state (sDR). Along this branch are a sequence of intermediate transition states (sb) which the agent progresses through with probability δ. At each transition state, the agent might proceed to a terminal, non-rewarding state (sa) with probability 1-δ. The number of transition states is defined by the delay to the larger reward. C) Average reward collected and choice behavior across simulated trials in certain and uncertain task environments for impulsive and non-impulsive agents in the Temporal Discounting task. “High certainty” is when δenv>δagent and “low certainty” is when δagent<δenv. The non-impulsive agent (black) has a discount factor of γ = 0.99 and the impulsive agent (red) has a discount factor of γ = 0.6. Left: Average reward collected for the two agents. Right: Average proportion of trials in which an agent selected the larger, delayed option. Error bars are s.e.m. across 10 iterations of 100 trials using variable reward sizes and delays. *** indicates p<0.0001 paired t-test. D) Difference in average reward across a range of δenv and δagent values. The heatmap shows domains where the non-impulsive agent performs better (more blue), the impulsive agent performs better (more red) or there are marginal differences between the two agents (red). The values shown in each box on the heatmap is the difference in average reward for the two agents. The white boxes indicate the task regimes shown in Fig 2C. See S1 Fig for other discount factors. Image credit: Openclipart.org (coins image, money image).
Fig 3.
Information Sampling (Beads) Task and examples of agent performance in high and low certainty environments.
A) Task schematic for the Beads task. In this task, the objective is to correctly guess the majority color of beads (e.g. orange or blue) in the urn. The participant or agent has the option to draw one bead at a time (for a cost, e.g. $0.10) to accumulate evidence. The agent’s goal is to accumulate sufficient evidence to make a confident guess without incurring maximum draw cost. Once the maximum number of draws is reached, the agent is forced to guess a color. An agent receives a reward for a correct guess (e.g. $10) or a cost for an incorrect guess (e.g. -$12). B) The state space tree for the beads task up to 3 draws. Each node represents the number of orange and blue beads that have been drawn thus far and a decision point, where the agent can either draw again, guess orange, or guess blue. If the agent draws another bead, they stochastically transition to the next state according to a binomial probability. At the start of the tree, the variance in the probability distribution over the majority probability is highest and decreases with increasing numbers of draws. Note that the states with the same number of orange and blue beads after 3 draws are the same state. We draw repeated states as separate for clarity,. Repeated states are illustrated by the shape of the node. Circular nodes are unique, nodes of other shapes indicate a repeated state. C) Two example bead draw sequences in certain and uncertain task environments and the behavior of impulsive and non-impulsive agents. On the left, a sequence of 20 draws is shown from a set of task parameters that creates an environment where there is high certainty the majority color is orange (qagent = 0.55, qenv = 0.7, Cdraw = 0.10, Rcorrect = 10, Rincorrect = -12). The plot shows the action values for guessing orange and guessing blue which are identical for both agents. The plot also shows the corresponding action values for drawing a bead for the non-impulsive agent (black) and the impulsive agent (red). Because the agents always select the largest action value on each time step, the agents only guess a color when the action value for guessing blue or orange surpasses the action value to draw another bead. In the case on the left, the non-impulsive agent guesses orange correctly after 11 draws (black arrow), whereas the impulsive agent guesses blue incorrectly after the first draw (green arrow). In the uncertain case (right), the task parameters create an environment where there is low certainty about the majority color (qagent = 0.55, qenv = 0.54, Cdraw = 0.10, Rcorrect = 10, Rincorrect = -12). The same traces for the action values are shown. The non-impulsive agent draws until it is forced to guess and incurs maximum draw cost (black arrow). The impulsive agent guesses correctly after 5 draws (green arrow). Below each plot of action values are the corresponding truncated state space trees, showing traversal through the state space for the example bead sequences. Only the top half of the state space tree is expanded through the first 10 bead draws.
Fig 4.
Performance and choice behavior of impulsive vs. non-impulsive agents in the Information Sampling (Beads) Task.
A) Left: Average reward collected across simulated trials in certain and uncertain task environments for impulsive and non-impulsive agents in the Beads task. The non-impulsive agent (black) has a discount factor of γ = 0.99 and the impulsive agent (red) has a discount factor of γ = 0.55. Error bars are s.e.m. across 100 iterations of 100 trials using bead sequences from two different task parameters (qenv = 0.75 certain environment, qenv = 0.54 uncertain environment, qagent = 0.55. Cdraw = 0.1). Right: Average number of bead draws before guessing a color for each model in each task environment. In both task environments, the impulsive agent (red) draws similarly often, but significantly less than the non-impulsive agent (black). *** indicates p<0.0001 ** indicates p<0.001. B) Model performance across a range of parameter values. Each panel is a heatmap showing the differences in average reward for a pair of non-impulsive and impulsive agents, indicated by the discount factors on the far left. Each column has a set of heatmaps for the expected majority fraction of beads, qagent. Each row has a set of heatmaps for a pair of discount factors (impulsive & non-impulsive). The x-axis of each heatmap is the draw cost and the y-axis is the difference between the model input qagent and the majority fraction used to generate the bead draws, qenv. The color of the heatmap indicates whether the impulsive agent (red) or non-impulsive agent (black) collected more reward. More blue values indicate the non-impulsive agent collected more average reward and more red values indicate the impulsive agent collected more reward. As qagent increases (left to right), the domain in which the non-impulsive agent performs better expands. The white boxes in the heatmap in the top left panel highlight the data used to create the bar plots Fig 4A (left). All heatmaps were generated using Rcorrect = 10, Rincorrect = -12. See S1 Fig for (B) with Rcorrect = 10, Rincorrect = -10.
Fig 5.
Explore-Exploit bandit task with novelty.
A) Example sequence of trials in the Explore-Exploit task. In this example, each option (A, B, C) is a picture with an underlying reward rate. The agent or participant must learn the values of the three options through experience of choosing the options and receiving or not receiving a reward on each trial. In this example, the agent has learned the approximate values of the options over the course of multiple trials (not all are shown) and then a novel option is substituted for one of the options (option A). The novel option substitution rate (penv) affects the number of trials the agent has to learn about an option. When penv is high, it is harder for the agent to learn about the underlying values of the options. B) State space representation of the Explore-Exploit task. Each option can be represented with a separate subtree. Thus, for an example sequence of choices such as C, A, B, A, A, B, the agent progresses through 1 step of the tree for C, three steps for A, and two steps for B. The agent progresses to an upper branch or lower branch depending on whether the choice was rewarded. Rewards are shown for this example sequence as 0s or 1s. Thus, the agent progresses through the uppermost branches of the subtree for option A, as it was rewarded all three times it was chosen. The introduction of a novel option causes the position in the subtree for that option to reset. When the novel option is presented at the end of this sequence and replaces option A, the agent jumps back to the start for that option, as reward history no longer represents the underlying value of the novel option. C) Bar plots of average raw reward (left) and average selection of the novel option upon first appearance (right) in low and high certainty environments. On the left, average reward for the non-impulsive (black) and impulsive (red) agents at penv values of 0.04 and 0.20 are shown. On the right, average selection of the novel option upon first appearance is shown for the same values of penv. *** indicates p<0.0001. Error bars on plots s.e.m. across iterations. Error bars above plots represent the standard deviation of the differences between the mean values for the non-impulsive and impulsive agents. D) Average % of maximum possible reward and average novel option choice behavior across a range of novel option substitution rates for both the non-impulsive (black) and impulsive agent (red). On the left, the plot of average reward shows a that when the novel option substitution rate (penv) is low, the non-impulsive agent collects more reward than the impulsive agent, but when penv is high (greater than 0.1), the impulsive agent collects more reward than the non-impulsive agent. The plot of novel choice behavior shows that for all novel option substitution rates tested, the non-impulsive agent selects the novel option significantly more often than the impulsive agent on the first trial it appears. Error bars above the graphs represent the standard deviation of the differences between the mean values for non-impulsive and impulsive agents. E) Average % of maximum reward and choice behavior for a range of discount factors and pagent = 0.08. F) Average % of maximum reward and choice behavior for discount factors shown in (C) and (D) with pagent = 0.04, 0.08, 0.16. Image credit: Wikimedia Commons (scene images).