A recurrent neural network framework for flexible and adaptive decision making based on sequence learning

doi:10.1371/journal.pcbi.1008342

Fig 1.

Sequence learning and decision making.

a. Pavlovian learning. Different cues predict different reward outcomes. b. Instrumental learning. Different actions lead to different reward outcomes. c. An example match-to-sample task. Moving the lever leftward after a pair of matching cues leads to a reward. d. Reversal learning. Two options are presented. The left choice is initially rewarded, but the reward switches to the rightward choice in the second trial. Notice both the contingencies between events within each trial and events across trials are essential for the learning. e. In more complicated decision making, the contingencies can be between many different types of sensory, action, and reward events distributed across time. Black brackets indicate contingencies that exist.

More »

Expand

Fig 2.

The network framework and Task 1: Probabilistic reasoning.

a. The framework diagram. The network has three layers: the input layer, the hidden layer, and the output layer. The input layer receives the input sequences of sensory, action, and reward events. The hidden layer has 128 gated recurrent units. The output layer units mirror the input layer units and represent the prediction of future events. The diagram illustrates the particular input and output units for Task 1. b. Task 1: the reaction-time version of probabilistic reasoning task. The subject fixates at a central point and views a series of shapes to make a response by moving the eyes toward one of the two choice targets on the peripheral. Each shape confers information regarding which target will be rewarded. The optimal strategy is to integrate the information and to make a choice when the integrated information hits a bound. The inset shows the sampling distributions.

More »

Expand

Fig 3.

Task 1: Behavioral analyses.

a. The psychometric curve. The model more often chooses the target supported by the accumulated evidence. The black curve is the fitting curve from the logistic regression. b. The leverage of each shape on choice revealed by the logistic regression is consistent with its assigned logLR. c. Reaction time. The bars show the distribution of reaction time, quantified by the number of observed shapes (right y-axis). Green and red indicate the left and right choices, respectively. The lines indicate the mean total logLR (left y-axis) at the decision time, grouped by reaction time. Trials with only 1 shape or more than 16 shapes comprise less than 0.1% of the total trials and are excluded from the plot. d. The leverage of the first 3, the second and third from the last, and the middle shapes on the choice. Only trials with more than 6 shapes are included in the analysis. No significant differences are found between any pair of the coefficients of shape regressors (two-tailed t-test with Bonferroni correction). The error bars in all panels indicate SE across runs. Some error bars are smaller than the data points and not visible.

More »

Expand

Fig 4.

Task 1: Model training with limited dataset.

Same conventions as in Fig 3. The training dataset contains only 1000 unique sequences. a. The psychometric curve. b. The leverage of each shape on choice. c. Reaction time distribution (bars, right y-axis) and the mean total logLR (lines, left y-axis) at the decision time. Green and red indicate the left and right choices, respectively. d. The leverage of the first 3, the second and third from the last, and the middle shapes on the choice. The error bars in all panels indicate SE across runs. Some error bars are smaller than the data points and not visible.

More »

Expand

Fig 5.

Task 1: Network units’ responses.

a. An example unit that prefers the left target. Its activity increases when the evidence supporting the left target grows and decreases when it drops. The unit’s responses converge when the network chooses its preferred target. The trials are grouped into quartiles by the total logLR in each epoch. The colors indicate the quartiles, and the error bars indicate the SE across trials. b. Population responses of the units that are selective to the total logLR. The trials are grouped based on the total logLR supporting each unit’s preferred target in each epoch. The error bars in panels b, c, and d indicate the SE across units. c. Urgency units. Their activities ramp up (upper panel) or down (lower panel) regardless of choice. d. Network unit response variability. The neurons’ response variability increases initially (blue curve) but decreases before the choice, more so when the preferred target is chosen (black) than when the non-preferred target is chosen (grey). Only the trials with more than five shapes are included in panel a, b, and d.

More »

Expand

Fig 6.

Task 1: Which and when units.

a. The connection weights between the eye movement output units and the when units (upper panel) and the which units (lower panel). b. The reaction time of the network choice when the outputs of different groups of units are inactivated. c. Lesions to the when and which units affect choices differently. The blue bars indicate the proportion of correct trials. The orange bars indicate the proportion of trials in which the choice is consistent with the sign of the accumulated evidence at the time of choice. The green bars indicate the percentage of trials in which the model chooses the left target. d. Speed-accuracy tradeoff. We suppress the output of a different proportion of +when/-when units (see Methods). As more +when units’ outputs are suppressed, the model’s reaction time (black curve, right y-axis) increases along with the accuracy (blue curve, left y-axis). However, the proportion of trials in which the choices are consistent with the evidence (orange curve, left y-axis) stays the same except for the extreme cases. e. The maximum flow (upper panel) and the inverse of the geodesic distance (lower panel) between different unit groups. The smaller maximum flow and the larger geodesic distance between when/which units and other units suggest the relatively tight connections between the when and which units. ※ indicates a significant difference (p<0.05, Two-tailed t-test with Bonferroni correction). The error bars in all panels indicate the SE across runs.

More »

Expand

Fig 7.

Task 1: Sensory predictions.

a. The normalized subthreshold activities of 10 shape output units. We show the shape output units’ activities at the time step immediately before each shape onset for all epochs in all trials. The sum of the activities of all shape output units is normalized to 1. Data are divided into 10 groups by the total logLR before the shape onset, which is indicated by the color. b. The Kullback-Leibler (KL) divergence between the normalized subthreshold activities (as shown in Fig 7A) and the sampling distributions (shown in Fig 2B inset). Data are grouped by the total logLR. The error bars indicate the SE across runs.

More »

Expand

Fig 8.

Task 2: Multisensory integration.

a. The psychometric curve. The model is trained with the unimodal conditions and tested with both the unimodal (green: visual, orange: vestibular) and the bimodal (black) conditions. Each data point represents the proportion of the left choice for a given motion direction. The model performs better and shows a steeper psychometric curve for the bimodal condition. b. The performance thresholds. The model’s thresholds (blue) are compared against the thresholds calculated with the optimal Bayesian inference (orange). The thresholds under the bimodal condition are significantly lower than those under either unimodal condition. The differences between the thresholds of the network and the thresholds calculated with Bayesian inference are not significant. (two-tailed t-test with Bonferroni correction, p-value threshold = 0.05) The error bars indicate the SE across runs.

More »

Expand

Fig 9.

Task 3: Post-decision wagering.

a. The proportion of trials in which the model chooses the sure target. The color indicates the motion strength. The frequency of the sure target choices decreases with the motion strength and the duration of motion viewing. b. The accuracy. Solid lines are the trials with the sure target, and dashed lines are the trials without the sure target. The accuracy is higher when the sure target is available but not chosen. The error bars in panels a and b are SE across runs. c. Activities of choice-selective units. The responses are aligned to the movie onset (left) and the choice (right). The color of each line denotes the choice and motion direction. The dashed lines are trials in which the model chooses the sure targets. The units have an intermediate activity level in trials that the sure target is chosen. The error bars are SE across units with choice selectivity in all runs.

More »

Expand

Fig 10.

Task 4: Two-step decision task.

a. The two-step task. The thick and thin lines denote the common and rare transitions, respectively. The contingencies indicated by the dashed lines are reversed across blocks. b. The switching behavior. Trials are aligned to the block switch (trial 0). The performance first drops to below the chance level but then gradually recovers. c. The probability of repeating the previous choice. The stay probabilities of the subsequent trials are higher for the CR and the RU trials than the RR and the CU trials. d. Trial history effects. The choice in the current trial is affected by the trial types in the previous trials. Solid dots indicate significant effect (Bonferroni correction, p < 0.05). e. Factors affecting the choices. ※ indicates significance (p<0.01). f. Units in the hidden layer encode the difference between the estimated Q-values of the two actions. Greys lines represent the predictions based on the units’ activities in each run, and the black line is the average across runs. g. The response difference between the two choice output units is correlated with the difference between the estimated Q-values of the two actions.

More »

Expand

Fig 11.

Task 4: Model testing with different block size.

Same conventions as in Fig 10. Models are trained with 50-trial blocks but tested with 70-trial blocks. a. The switching behavior. Trials are aligned to the block switch (trial 0). b. The probability of repeating the previous choice. c. Trial history effects. Solid dots indicate significant effect (Bonferroni correction, p < 0.05). d. Factors affecting the choices. ※ indicates significance (p<0.01).

More »

Expand