Dynamic Integration of Value Information into a Common Probability Currency as a Theory for Flexible Decision Making

doi:10.1371/journal.pcbi.1004402

Fig 1.

Encoding the order of policies in sequential movements.

A: Probability distribution of time to arrive at vertex j starting from the original state at time t = 0 and visiting all the precedent vertices. Each color codes the segments and the vertices of the pentagon as shown in the right inset. The pentagon is copied counterclockwise (as indicated by the arrow) starting from the purple vertex at t = 0. The gray trajectories illustrate examples from the 100 reaches generated to estimate the probability distribution of time to arrive at vertex k given that we started from vertex k − 1, . B: Probability distribution P(vertex = j|x_t), which describes the probability to copy the segment defined by the two successive vertices j − 1 and j at state x_t. This probability distribution is estimated at time t = 0 and when arriving at the next vertex, we condition on completion, and P(vertex = j|x_t) is re-evaluated for the next vertices.

More »

Expand

Fig 2.

The architectural organization of the theory.

It consists of multiple stochastic optimal control schemes where each of them is attached to a particular goal presented currently in the field. We illustrate the architecture of the theory using the hypothetical scenario of the soccer game, in which the player who is possessing the ball is presented with 3 alternative options—i.e., 3 teammates—located at different distances from the current state x_t. In such a situation, the control schemes related to these options are triggered and generate 3 action plans (u₁ = π₁(x_t), u₂ = π₂(x_t) and u₃ = π₃(x_t)) to pursue each of the individual options. At each time t, desirabilities of the each policy in terms of action cost and good value are computed separately, then combined into an overall desirability. The action cost of each policy is the cost-to-go of the remaining actions that would occur if the policy were followed from the current state x_t to the target. These action costs are converted into a relative desirability that characterizes the probability that implementing this policy will have the lowest cost relative to the alternative policies. Similarly, the good value attached to each policy is evaluated in the goods-space and is converted into a relative desirability that characterizes the probability that implementing that policy (i.e., select the goal i) will result in highest reward compare to the alternative options, from the current state x_t. These two desirabilities are combined to give what we call “relative-desirability” value, which reflects the degree to which the individual policy π_i is desirable to follow, at the given time and state, with respect to the other available policies. The overall policy that the player follows is a time-varying weighted mixture of the individual policies using the desirability value as weighted factor. Because relative desirability is time- and state- dependent, the weighted mixture of policies produces a range of behavior from “winner-take-all” (i.e., pass the ball) to “spatial averaging” (i.e., keep the ball and delay your decision).

More »

Expand

Fig 3.

Relative desirability function in reaching movements with multiple potential targets.

A: A method followed to visualize the relative desirability function of two competing reaching policies (see results section for more details). B: Heat map of the log-transformed action cost for reaching the left target (gray circle) starting from different states. Red and blue regions correspond to high and low cost states, respectively. The black arrows describe the average hand velocity at a given state. C: Similar to panel B but for the right target. D: Heat map of the relative desirability function at different states to reach to the right target, when both targets provide the same amount of reward with equal probability. E: Similar to D, but for a scenario in which the right target provides the same amount of reward with the left one, but with 4 times higher probability. F: Similar to D, but for a scenario in which the mean reward provided by right target is 4 times higher than then one provided by the left target.

More »

Expand

Fig 4.

Rapid reaching movements in tasks with competing targets.

Top row illustrates experimental results in rapid reaching tasks with multiple potential targets [12, 22, 23] (images are reproduced with permission of the authors). When the target position is known prior to movement onset, reaches are made directly to that target (black and green traces in A), otherwise, reaches aim to an intermediate location, before correcting in-flight to the cued target (red and blue traces in A). The competition between the two reaching policies that results in spatial averaging movements, is biased by the spatial distribution of the targets (B), by recent trial history (C) and the number of targets presented in each visual field (D). The bottom row (E-H) illustrates the simulated reaching movements generated in tasks with multiple potential targets. Each bottom panel corresponds to the reaching condition described on the top panels.

More »

Expand

Fig 5.

Saccadic movements in tasks with competing targets.

A: Simulated saccadic movements for pair of targets with 30° (gray traces) and 90° (black traces) target separation. B: A method followed to visualize the relative desirability function of two competing saccadic policies (see results section for more details). C: Heat map of the relative desirability function at different states to saccade to the left target, at a 30° target separation. Red and blue regions corresponds to high and low desirability states, respectively. Black traces correspond to averaged trajectories in single-target trials. Notice the strong competition between the two saccadic policies (greenish areas). D: Similar to panel C, but for 90° target separation. In this case, targets are located in areas with no competition between the two policies (red and blue regions). E: Examples of saccadic movements (left column) with the corresponding time course of the relative desirability of the two policies (right column). The first two rows illustrate characteristic examples from 30° target separation, in which competition results primarily in saccade averaging (top panel) and less frequently in correct movements (middle panel). The bottom row shows a characteristic example from 90° target separation, in which the competition is resolved almost immediately after saccadic onset, producing almost no errors. F: Percentage of simulated averaging saccades for different degrees of target separation (red line)—green, blue and cyan lines describe the percentage of averaging saccades performed by 3 monkeys [24].

More »

Expand

Fig 6.

Sequential movements.

A: Examples of simulated trajectories for continuously copying a pentagon. B: Time course of the relative desirability values of the 5 individual policies (i.e., 5 segments) in a successful trial for copying a pentagon. The line colors correspond to the segments of the pentagon as shown in the top panel. The shape was copied counterclockwise (as indicated by the arrow) starting from the gray vertex. Each of the horizontal discontinuous lines indicate the completion time of copying the current segment. Notice that the desirability of the current segment peaks immediately after the start of drawing that segment and falls down gradually, whereas the desirability of the following segment starts rising while copying the current segment. Because of that, the consecutive segments compete for action selection frequently producing error trials, as illustrated in panel C. Finally, the panels (D) and (E) depict examples of simulated trajectories for continuously copying an equilateral triangle and a square, respectively, counterclockwise starting from the bottom right vertex.

More »

Expand