Skip to main content
Advertisement

< Back to Article

Fig 1.

Environment design.

(a) The two-dimensional gridworld environment used in Experiment 1. (b) To study the properties of the optimal reward, we made several modifications to the gridworld environment. Top row: In the one-time learning environment, the agent could chose to stay in the food location constantly after reaching it. In the lifetime learning environment, the agent was teleported to a random location in the gridworld as soon as it reached the food state. Middle row: In the stationary environment, the food remained in the same location throughout the agent’s lifetime. In the non-stationary environment, the food changed its location during the agent’s lifetime. Bottom row: We used a gridworld of size 7 × 7 to simulate a dense reward setting. To simulate a sparse reward setting, we increased the size of the gridworld to 13 × 13.

More »

Fig 1 Expand

Table 1.

Categories of reward functions.

The reward functions we consider can be classified into seven categories. First is ‘Objective only’, where the reward function depends only on the first component, w1. Similarly, ‘Expect only’ is the function that depends only on the second component, w2, and ‘Compare only’ is the function that depends solely on the third component, w3. Then, we have the functions that are a combination of two components—‘Objective+Expect’, ‘Objective+Compare’, and ‘Expect+Compare’. Finally, we have the reward function, ‘All’, that depends on all three components.

More »

Table 1 Expand

Fig 2.

Comparison improves learning in simple dense, stationary environments.

(a) Mean cumulative objective reward attained by the different agents in a distribution of 7 × 7 stationary environments requiring one-time learning (lifetime = 2500 steps). Here, relative comparison significantly improve performance and the ‘Compare only’ agent obtains the highest cumulative objective reward. (b) Average visit counts of the ‘Objective only’, ‘Expect only’, and ‘Compare only’ agents (darker color represents higher visit counts and vice-versa). Compared to the ‘Objective only’ and the ‘Expect only’ agents, the ‘Compare only’ agent spends very little time visiting the non-food states in the world. (c) Simulation of the agents’ behavior in a simple 4-state environment. The ‘Compare only’ agent assigns a negative value to any non-food state it visits (due to its aspiration level) which encourages it to visit novel states in the environment. This allows the agent to find the food location very quickly. The ‘Objective only’ and ‘Expect only’ agents primarily rely on random exploration and find the food location more slowly (α = 0.1 for all agents). (d) Mean cumulative objective reward attained by the different agents in a distribution of 7 × 7 stationary environments, requiring lifetime learning (lifetime = 12500 steps). The ‘Compare only’ agent again obtains the highest cumulative objective reward. (e) Time course of the cumulative objective reward attained by the different agents in the 7 × 7 environment requiring lifetime learning (left) and the time course of cumulative subjective reward experienced by the different agents (right). (f) Left: Mean cumulative objective reward attained by the ‘Compare only’ agent as a function of its aspiration level in the lifetime learning environment. The performance of the agent drops if the aspiration level is set to be too high or too low (the optimal aspiration level is marked in yellow). Right: Mean cumulative subjective reward of the agent as a function of its aspiration level (optimal aspiration level agent is shown in yellow marker).

More »

Fig 2 Expand

Fig 3.

Prior expectation and comparison make an agent robust to changes in the environment.

(a) Mean cumulative objective reward attained by the agents in a distribution of 7 × 7 non-stationary environments (lifetime = 5000 steps). Both prior expectation and relative comparisons are helpful in dealing with non-stationarity. (b) Agents’ behavior in a simple 4-state non-stationary environment. By step = 50, the ‘Objective only’ agent assigns a considerably higher value to the food state compared to the ‘Compare only’ and ‘Expect+Compare’ agents. At step = 51, when the food changes its location, the ‘Objective only’ agent receives a subjective reward of 0 at state s4 and takes a long time to lower the value of this state. Even by step = 100, it is not able to discover the new food location. In contrast, after the food changes location, the ‘Compare only’ and ‘Expect+Compare’ agents receive high negative subjective rewards at state s4 which reduces their value estimate of s4 very quickly. This encourages them to visit other states and enables them to discover the new food location very quickly. (c) Graph showing how the value of the food state changes as a function of the visit counts for the different agents. While the state value converges for all the three agents, the ‘Objective only’ agent ends up assigning a very high value to the food state because it receives a subjective reward = 1 at each visit. The ‘Compare only’ and ‘Expect only’ agents receive lower subjective rewards and hence the converged state value is considerably lower for these agents. (d) Average reward rate of the various agents during their lifetime on the 7 × 7 gridworld environment (the food changes its location after every 1250 steps). (e) Left: Mean cumulative objective reward attained by the ‘Expect only’ agent as a function of the w2 values in the non-stationary environment. The performance of the agent drops if the weight is too high or too low (optimal w2 value is marked in yellow). Right: Mean cumulative subjective reward of the ‘Expect only’ agent as a function of the w2 value.

More »

Fig 3 Expand

Fig 4.

Relative comparisons can lead to undesirable behavior in sparsely rewarded environments.

(a) Mean cumulative objective reward attained by the agents in a distribution of 13 × 13, stationary environments requiring lifetime learning (lifetime = 12500 steps). While the ‘Compare only’ agent performs relatively well, it is significantly outperformed by the ‘Expect+Compare’ and the ‘All’ agent. (b) Visualization of the visit counts and the learnt policy of the ‘Compare only’ agent for states near to the food state. The agent does not visit the food state as often as it visits some of the nearby non-rewarding states (highlighted in yellow). The agent’s learnt policy suggests that it has developed a form of aversion to the food state as it takes a needlessly long route to reach the food state. (c) Graph showing how the value of the starting state (which provides an objective reward of 0) changes as a function of the visit counts for the ‘Compare only’ and ‘Expect+Compare’ agents. As the ‘Compare only’ agent keeps re-visiting the starting state, it keeps assigning a lower value to this state (due to its aspiration level). On the other hand, due to prior expectations, the ‘Expect+Compare’ agent prevents this value from becoming too negative. (d) Development and prevention of aversion in the simple 4-state environment (the agent is teleported to s1 after reaching food state). Each interaction shows the agent’s current estimation of the best action to take at each state and its estimated Q-value of taking that action at that state. Here, the aspiration level of the agents is deliberately set to be very high. The ‘Compare only’ agent develops an aversion to the food state (at step = 60 and = 80) whereas the ‘Expect+Compare’ agent does not exhibit any aversion behavior. (e) Visualization of the visit counts of the ‘Compare only’ and the ‘Expect+Compare’ agent (darker shade represents greater visit counts and vice-versa). At the 6000th timestep, the visit counts of the two agents are comparable. At the 8000th timestep, the ‘Compare only’ develops aversions and visits states near to the food state more often than it visits the food state.

More »

Fig 4 Expand

Fig 5.

Results of the multi-armed bandit experiments.

(a) 10-armed bandit simulation where the mean of the 9 sub-optimal arms is drawn from a uniform distribution on range [−1, 0.9]. The graph plots how frequently the agents select the best arm in their lifetimes. The ‘Fixed compare’ agent learns faster than the ‘Objective’ agent and selects the optimal action at a higher rate, especially early during training. The ‘Dynamic compare’ agent selects the optimal action at a higher rate throughout its lifetime compared to these two agents. (b) Bandit task where the arms are very close to each other. Here, the comparison-based agents and the ‘Objective only’ agent select the optimal action at a similar rate throughout their lifetime (and the UCB selects the optimal action at a higher rate). (c) Plot of the average subjective reward of the agents in the previous bandit task. Compared to the ‘Objective only’ and the UCB agent, the comparison-based agents experience lower subjective rewards (due to their aspiration level). This seems needless since comparisons do not help the agents make better choices. (d) Non-stationary bandit task where the reward distribution changes abruptly during the agent’s lifetime. Compared to the ‘Objective only’ agent, the comparison-based agents select the optimal action at a higher frequency, especially after step = 2500 i.e., when the environment changes. (e) Non-stationary bandit task where the reward distribution changes constantly during the agent’s lifetime. Early during training, the ‘Fixed Compare’ agent selects the optimal action at a relatively good rate but it is then comfortably outperformed by the other agents. The rising aspirations of the ‘Dynamic Compare’ agent allows it adapt to the changes in the environment and it selects the optimal action at a very high rate throughout the lifetime. (f) Despite accumulating high objective rewards, the subjective rewards experienced by the ‘Dynamic Compare’ agent keep decreasing due to its constantly increasing aspiration.

More »

Fig 5 Expand