Credit Assignment during Movement Reinforcement Learning

doi:10.1371/journal.pone.0055352

Figure 1.

The experimental setup.

A): The cartoon illustration of the setup. Participants sat before a desk and made movement trajectories on the horizontal desktop with a hand-held stylus. The hand displacement was registered by a robot. The feedback was provided via a computer monitor placed on the desk. B): A graphical representation of how trajectories were varied in both direction (α) and curvature (β). C): The learning progress of matching trajectories to a hidden target trajectory within a session of 25 trials.

More »

Expand

Figure 2.

The implementation of the Bayesian model.

The implementation assumes a two-dimension probability map that is updated iteratively trial by trial. It is a 200×200 matrix to code the probability of each α-β combination. Each value in the matrix is normalized such that the sum of all possibilities on the map equals 1. The pink cross denotes the current target direction and curvature. The gray dot denotes the best solution before the current trial t and the black dot denotes the best solution after finishing the current trial. The map from a previous trial is degraded by memory decay and it then serves as the prior before the current trial. The prediction error, the difference between the predicted reward based on the direction and curvature used in the current trial and the actual received reward, serves as likelihood distribution to update the probability map. By combining the prior and the likelihood, the probability map is updated to form the posterior distribution. The learning is demonstrated in that the best solution of the posterior, compared to that of the prior, becomes closer to the target solution. The data are from a typical trial (the 4^th trial in a 25 learning sequence) from a single participant.

More »

Expand

Figure 3.

Learning data from typical participants from the α-reward condition (upper panels, A) and the β-reward condition (lower panels, B).

A): Five individual trial trajectories (blue) along with its corresponding hidden target trajectory (red) are shown in separate panels from left to right. The rightmost panel displays the absolute error in trajectory direction (α) and curvature (β). The direction of targets is learned earlier than curvature. B): Panels follow the same format as the upper panels. Curvature learning occurs early but the errors in the direction of targets remain high throughout the session.

More »

Expand

Figure 4.

Learning-related changes.

A): The monetary rewards, averaged over all target trajectories, are plotted as a function of trials. The two shaded curves stand for mean±SEM over participants for two reward conditions separately. B): The learning curves of two trajectory properties, direction (α) and curvature (β), are plotted in black and red, respectively. The green and the gray lines denote their corresponding exponential fits. The results from the two reward conditions are presented in two separate subplots. C): The same learning curves from B) are re-plotted in the α-β space. The arrows indicate directions of changes and their sizes are proportional to magnitude of changes.

More »

Expand

Figure 5.

Quiver plot of average changes in movement properties with a α- (A) and a β- (B) reward function.

The x- and y-axis denote the absolute errors in two movement properties, respectively. The background temperature plots display the reward functions with the maximum achievable reward centered at zero alpha-beta error. The higher the temperature the higher the reward. White vectors are average parameter changes from one trial to the next across all target trajectories and all participants. Black vectors represent the Bayesian model’s predictions.

More »

Expand

Figure 6.

Average learning curves over each of 10 target trajectories are shown for the α-reward condition (A) and the β-reward condition (B) separately.

Displayed in different panels, learning curves are based on α error, β error and actual monetary rewards.

More »

Expand