Fig 1.
AR-RL and performance-gated deliberation.
(A) Task setting. Left: Within trial state, St evolves over trial time t in successive trials indexed by k. The decision ‘A’ is reported at the decision time (red cross), determining trial reward, Rk, and trial duration, Tk. Right: Sketch of cumulative reward versus cumulative duration. Context-conditioned reward rate (slope of red line), varies with alternating context (labelled 1 and 2) around average reward, ρ (dashed line). (B) Decision rules based on opportunity costs of commitment,
, and deliberation,
. The AR-RL rule (black ‘x’) finds t that minimizes
. The PGD rule (black cross) finds tdec at which they intersect,
. (C) Schematic diagram of each algorithm’s dependency. PGD computes a decision time directly from the two opportunity costs, while AR-RL uses both to first estimate a value function, whose maximum specifies the decision time. (D) Loss (error in performance with respect to the optimal policy, (ρ* − ρ)/ρ*) over learning time in a patch-leaving task (AR-RL: brown, PGD: black). The arrow indicates when the state labels were randomly permuted.
Fig 2.
Non-stationary opportunity cost.
(A) Top: Dynamics of trial performance (; blue) with its distribution as well as dynamics of between context-conditioned averages of performance (
; orange), and the effectively stationary average performance (
; purple). Bottom: these are decomposed into a hierarchy by filtering reward history on trial, context, and long timescales, respectively. (B) Two hypothetical forms for context-specific trial opportunity cost. Top: Trial-unaware cost in which context varies the slope around ρ. Bottom: Trial-aware cost in which context variation is through a bias (c.f. Eq 5).
Fig 3.
PGD agent performs the tokens task for periodic context switching.
(A) A tokens task trial. Left: Tokens jump from a center to a peripheral region (gray circles). Right: The tokens difference, Nt, evolves as a random walk that accelerates according to α (here 3/4) post-decision time, tdec. The trial duration is T, which includes an inter-trial interval. (B) Decision dynamics in cost space obtained from evidence dynamics in (A). Commitment cost trajectories (gray lattice; thick gray: trial-averaged) start at and end at 0. Trajectory from (A) shown in black. tdec (black cross) is determined by the crossing of the commitment and deliberation cost. (C) Incentive strength switches between two values every 300 trials. (D) Expected rewards filtered on τlong (
, purple) and τcontext (
, green). Black dashed lines from bottom to top are ρα=1/4, ρ, and ρα=3/4.
Fig 4.
PGD model fit to NHP behaviour for non-stationary α-dynamics reported in Ref. [25].
(A) Block length sequence used in the experiment. (b,c) decision times (dots) aligned on the context-switching event type (fast-to-slow in gray; slow-to-fast in color) and averaged. Shaded regions are the standard error bounds of the models’ average decision times. (D) Error evaluated on a -plane cut through the parameter space at the best-fitting
and
(gray area indicates timescales within an order of magnitude of the end of the experiment). Contours show the first 10 contours incrementing by 0.01 error from the minimum (shown as a circle marker). Colors refer to subject, as in (B) and (C). (E) Same for
at
and
. (F) Same for
at
and
.
Fig 5.
Context-conditioned analysis of PGD and comparison to AR-RL models.
(A) Shown is the reward rate as a function of incentive strength, α. The AR-RL solution with no augmented cost (c = 0) interpolates between the wait-for-certainty strategy (brown) and the one-and-done strategy (red). We also show the slow and fast context-conditioned reward rates for the two primates (blue and orange circles) and the PGD model fitted to them (crosses). For reference, we show the mean+/-std.dev. of a forthcoming dataset of 32 humans. Reward rates for the human and non-human primates are squarely in between the best (black dashed) and uniformily random (gray) strategy. (b,c) The distribution over trials of differences in decision times between model and data, |Δtdec| = |tdec,data − tdec,model|, conditioned on slow and fast block contexts. Solid lines are for PGD. Dotted lines are for the AR-RL solution using the cost rate, c*, with the lowest mean error. The residual sum of squares (RRS) for each model/block combination is displayed. (d-g) Interpolated state-conditioned survival probabilities, P(tdec = t|Nt, t), over slow (d,f) and fast (e,g) blocks. White dotted lines show the P(tdec = t|Nt, t) = 0.5 contour. (h,i) State-conditioned decision time frequencies (cross size) from AR-RL optimal decision boundaries across different values of the cost rate, c (colored crosses) for slow (h) and fast (i) conditions. Only samples with Nt < 0 and Nt > 0, respectively, are shown. For comparison, the reflected axes shows as gray crosses the state-conditioned decision time frequencies of the data.
Fig 6.
Comparing neural urgency and collapsing decision boundaries.
(A) Top: Rising belief (blue) meets collapsing decision boundary (black dashed) in belief space. Middle: Falling commitment cost (blue) meets rising deliberation cost (black-dashed) in cost space. Bottom: Belief/commitment cost is encoded (blue) into a low-dimensional neural manifold, with the addition of an urgency signal (orange) (c.f. Figure 8 in [8]). The decision (red circle) is taken when the sum passes a fixed threshold (black-dashed). (B) Deliberation cost maps onto the urgency signal extracted from zero-evidence conditioned cell-averaged firing rate in PMd (200ms time steps).