Emergence of belief-like representations through reinforcement learning

doi:10.1371/journal.pcbi.1011067

Fig 1.

Associative learning tasks with probabilistic rewards and hidden states.

A. Trial structure in Starkweather et al. (2017) [7]. Each trial consisted of a variable delay (the intertrial interval, or ITI), followed by an odor, a second delay (the interstimulus interval, or ISI), and a potential subsequent reward. Reward times were sampled from a discretized Gaussian ranging from 1.2–2.8s (see Materials and methods). B-C. In both versions of the task, there were two underlying states: the ITI and the ISI. In Task 1, every trial was rewarded. As a result, an odor always indicated a transition from the ITI to the ISI, while a reward always indicated a transition from the ISI to the ITI. In Task 2, rewards were omitted on 10% of trials; as a result, an odor did not reveal whether or not the state transitioned to the ISI.

More »

Expand

Fig 2.

Observations, model representations, value estimates, and reward prediction errors (RPEs) during Task 2.

A. State transitions and observation probabilities in Task 2. Each macro-state (ISI or ITI) is composed of micro-states denoting elapsed time; this allows for probabilistic reward times and minimum dwell times in the ISI and ITI, respectively. B. Observations emitted by Task 2 during two example trials. Note that omission trials are indicated only implicitly as the absence of a reward observation. C. Example representations (b_t, z_t) and value estimates () of two models (Belief model, left; Value RNN, right) for estimating value in partially observable environments, after training. D. After training, both models exhibit similar RPEs.

More »

Expand

Fig 3.

RPEs of the Value RNN resemble both mouse dopamine activity and the Belief model.

A. Average phasic dopamine activity in the ventral tegmental area (VTA) recorded from mice trained in each task separately. Black traces indicate trial-averaged RPEs relative to an odor observated at time 0, prior to reward; colored traces indicate the RPEs following each of nine possible reward times. RPEs exhibit opposite dependence on reward time across tasks. Reproduced from Starkweather et al. (2017) [7]. B-C. Average RPEs of the Belief model and an example Value RNN, respectively. Same conventions as panel A. D. Mean squared error (MSE) between the RPEs of the Value RNN and Belief model, before and after training each Value RNN. Small dots depict the MSE of each of N = 12 Task 1 RNNs and N = 12 Task 2 RNNs, and circles depict the median across RNNs.

More »

Expand

Fig 4.

Value RNN activity readout was correlated with beliefs and could be used to decode hidden states.

A. Example observations, states, beliefs, and Value RNN activity from the same Task 2 trials shown in Fig 2. States and beliefs are colored as in Fig 2, with black indicating ITI microstates, and other colors indicating ISI microstates. Note that the states following the second odor observation remain in the ITI (black) because the second trial is an omission trial. Bottom traces depict the linear transformation of the RNN activity that comes closest to matching the beliefs. Total variance explained (R²) is calculated on held-out trials. B. Total variance of beliefs explained (R²), on held-out trials, using different trained and untrained Value RNNs, in both tasks. Same conventions as Fig 3D. C. In purple, the cross-validated log-likelihood of linear decoders trained to estimate true states using RNN activity. Same conventions as Fig 3D. Black circle indicates the log-likelihood when using the beliefs as the decoded state estimate (i.e., no decoder is “trained”).

More »

Expand

Fig 5.

Value RNN dynamics resembled belief dynamics in each task.

A. Dynamics of beliefs in Task 1 (top) and Task 2 (bottom). Black arrows indicate transitions between states in the absence of observations (⌀) as a function of elapsed time, t, following an odor observation. ‘X’ indicates an unconstrained duration, and a dashed arrow indicates a transition that happens only when ‘X’ is finite. B. RNN activity at each time step (small black dots with connected lines) during an example trial in a 2D subspace identified using PCA, for two example networks trained on Task 1 (top) and Task 2 (bottom). Putative ITI fixed point indicated as purple circle. Vectors indicate the response to odor (black) and reward (red). Activity during an omission trial is shown in cyan, though note that omission trials were present in training data only for Task 2. C-D. Average normalized distance of each model’s activity from its fixed point following an odor (panel C) or reward (panel D) observation, over time. To allow comparing distances across models, each model’s distances were normalized by the maximum distance following each observation. E. Difference between each RNN’s odor memory and reward memory, for Untrained RNNs and Value RNNs trained on each task. An RNN’s odor memory is defined as the number of time steps after an odor that the RNN’s activity returns to its ITI (see panel C); reward memory is defined similarly (see panel D). Same conventions as Fig 3D.

More »

Expand

Fig 6.

Value RNNs with larger capacity had more belief-like representations.

A. Error between the RPEs of the Value RNN and Untrained RNN relative to the RPEs of the Belief model (“RPE MSE”; see Fig 3D) during Task 2, as a function of the number of units in the RNN. Each dot indicates the error for a single Value RNN. Circles indicate the median across the N = 12 Value RNNs (dark purple) and N = 12 Untrained RNNs (light purple) with the same number of units. Remaining panels use the same conventions. B. Total variance explained (R²) of beliefs on held-out trials (see Fig 4B). C. Cross-validated log-likelihood of the state decoder using each RNN’s activity to estimate the true state (see Fig 4C). D. Difference between each RNN’s odor memory and reward memory (see Fig 5E).

More »

Expand

Fig 7.

Value RNNs trained on Babayan et al. (2018) [10] reproduce Belief RPEs and learn belief-like representations.

A. Task environment of Babayan et al. (2018) [10]. Each trial consists of an odor and a subsequent reward. The reward amount depends on the block identity, which is resampled uniformly every five trials. B. Average phasic dopamine activity in the VTA of mice trained on the task at the time of odor (left) and reward (right) delivery. Activity is shown separately as a function of the trial index within the block (x-axis) and the current/previous block identity (colors). Reproduced from Babayan et al. (2018) [10]. C. Average RPEs of the Belief model (dashed lines) and an example Value RNN (solid lines). Same conventions as panel B. D. Total variance of beliefs explained (R²) using a linear transformation of model activity. Same conventions as Fig 4B. E. Cross-validated log-likelihood of linear decoders trained to estimate true states using RNN activity. Same conventions as Fig 4C. F. Dynamics of beliefs in the absence of observations. Same conventions as Fig 5A. G. Trajectories of an example Value RNN’s activity, in the 2D subspace identified using PCA, during an example trial from Block 1 (left) and Block 2 (right). These two dimensions explained 68% of the total variance in the Value RNN’s activity across trials. Putative ITI states indicated as purple circles. Same conventions as Fig 5B.

More »

Expand

Fig 8.

Untrained RNNs can be used to estimate value, read out beliefs, or decode hidden states, but do not resemble belief dynamics.

A. Time-varying activations of 20 example units in response to an odor input, in an untrained RNN with 50 units, initialized with a gain of 0.9 (see Materials and methods). B. Same as panel A, but for an initialization gain of 1.9. C. RPE MSE (see Fig 3D) as a function of initialization gain, after training each Value ESN’s value weights to estimate value during Starkweather Task 2. Circles depict the median across N = 12 Value ESNs initialized with the same gain. Dashed line indicates median across Task 2 Value RNNs with the same number of units. Same conventions for panels D-G. D. Belief R² (see Fig 4B) as a function of initialization gain. E. Cross-validated log-likelihood of state decoders (see Fig 4C) as a function of initialization gain. F-G. Number of time steps it took each Value ESN’s activity to return to its fixed point following an odor (panel F) or reward (panel G) observation, as a function of the initialization gain. H. Difference between each model’s odor memory and reward memory (see Fig 5E), for Value ESNs initialized with a gain of 1.9 (red) and Value RNNs (purple); same conventions as Fig 3D.

More »

Expand