Reframing dopamine: A controlled controller at the limbic-motor interface

Pavlovian influences notoriously interfere with operant behaviour. Evidence suggests this interference sometimes coincides with the release of the neuromodulator dopamine in the nucleus accumbens. Suppressing such interference is one of the targets of cognitive control. Here, using the examples of active avoidance and omission behaviour, we examine the possibility that direct manipulation of the dopamine signal is an instrument of control itself. In particular, when instrumental and Pavlovian influences come into conflict, dopamine levels might be affected by the controlled deployment of a reframing mechanism that recasts the prospect of possible punishment as an opportunity to approach safety, and the prospect of future reward in terms of a possible loss of that reward. We operationalize this reframing mechanism and fit the resulting model to rodent behaviour from two paradigmatic experiments in which accumbens dopamine release was also measured. We show that in addition to matching animals’ behaviour, the model predicts dopamine transients that capture some key features of observed dopamine release at the time of discriminative cues, supporting the idea that modulation of this neuromodulator is amongst the repertoire of cognitive control strategies.


Introduction
Evolution has endowed animals with behavioural tendencies such as approaching and engaging with sources and predictors of food, and freezing or withdrawing from sources and predictors of punishment [1].Such 'Pavlovian' inductive biases [2] provide an effective way to obviate the need for learning in situations that are common and occasionally critical [3], and exert a powerful influence on behaviour [4,5].However, they can also lead to counterproductive Pavlovian-instrumental conflict-'Pavlovian misbehaviour' [2]-if animals must act vigorously to avoid predicted punishment, or withhold actions to gain potential reward [2,[6][7][8][9].They then need to be suppressed or supplanted, in what can be interpreted as a form of cognitive control.
One contributor to the Pavlovian-instrumental conflict may be the neuromodulator dopamine (DA), in a clash between its dual roles in positive reinforcement and motivational vigour [10][11][12].Evidence from canonical versions of these conflict paradigms suggests that DA in the core of the nucleus accumbens (NAc), at least when performance is successful, follows its motivational, rather than its reinforcement, role, with enhanced DA concentrations being observed during active avoidance [13,14] and suppressed DA concentrations when behavioural suppression is required to gain reward [15].
Partly inspired by the two-factor theory of active avoidance [16,17], Boureau and Dayan [18] suggested that such DA dynamics might be conceptualized as arising from a shift of the origin in a valence-action space.In the active avoidance case, a shift to a negative valence corresponding to expected punishment means that a neutral outcome (avoidance) appears positive; enhanced release of DA associated with the prospect of safety could then play a role in energizing the necessary active avoidance response.Conversely, when behavioural suppression is required to gain reward, a shift of the origin to the associated positive valence means that a neutral outcome (no reward) appears negative; suppression of DA release associated with the prospect of this loss may promote behavioural inhibition.
Subsequent work [19,20] elaborated on this suggestion in relation to active avoidance, but did not provide a process account of the reframing required to turn a situation that is, at best, neutral into one that appears positive.Furthermore, an account of the opposite reframing-to turn a situation that is, at worst, neutral into one that appears negative-has been lacking.
In the current work, we address these shortcomings via a modelling framework that characterizes the putative reframing operations of [18], and associated effects on DA signalling, as resulting from internal cognitive control actions.As modelling targets, we focus on two recent experimental studies in rodents, both involving measurement of NAc DA release: a study by Gentry et al. [21] involving active avoidance, and a study by Syed et al. [15] involving behavioural suppression to obtain reward.After briefly outlining the main idea of the model, we describe these experiments and their principal findings, and show how our model may account for certain critical features of the observed dynamics of DA release associated with cue and control (leaving out the outcome).We also consider the important issue of how the putative reframing mechanism could remain stable given the plasticity typically associated with DA release.

Model
In instrumental or operant conditioning [1], animals learn to make responses given particular sensory stimuli.These responses are based, at least initially [22], on the outcomes that are contingent on those responses-animals typically prefer responses that lead to greater rewards or lesser punishments, and avoid responses that lead to greater punishments or lesser rewards.Conversely, in Pavlovian or classical conditioning, animals learn the predictive relationship between sensory stimuli and affectively important outcomes, and then those stimuli come to elicit a set of automatic, conditioned, responses irrespective of the contingency between those responses and the outcomes [1].Pavlovian responses include approach and engagement for appetitive predictors, and withdrawal and inhibition for aversive predictors.
The involuntary nature of conditioned responses implies that difficulties can arise in situations such as active avoidance (in which animals avoid a punishment only if they act in a short time after a predictive cue) and omission schedules (in which animals will receive a reward following a cue only if they do not act).This is because the instrumental requirement (acting or withholding, respectively) is directly opposed by the Pavlovian conditioned response (freezing to the punishment predictor, or engaging to the reward predictor) [2,23,24].
In a simple case in which stimuli s are potentially associated with the emission ('Go') or withholding ('NoGo') of active responses, this was operationalized in [25] by state-action values (Q-values [26]) Q(s, Go) and Q(s, NoGo), which respectively capture the long-run benefit of responding or not, being additively corrupted by a quantity ωV(s) proportional to the predicted affective value of the state.
First consider the case of the omission schedule.In this case, provided that performance is at least somewhat successful, s will predict some reward, and so V(s) > 0 will be positive.Then, assuming that ω > 0, the Pavlovian factor will boost the propensity to act/'Go'-and so may in some cases interfere with the very behaviour (suppression of action) that led to success in the first place.
Noting (i) that the TD error [27] typically associated with state s t+1 , is just the value of that state, V(s t+1 ), if there is no extrinsic reward at that moment (r t = 0) and no precise prior expectation (V(s t ) = 0), and (ii) the dopaminergic realization of this TD error [10,[28][29][30], then one contributor to this Pavlovian effect might be the incentive salienceassociated release of DA [31]; this would energize action [32], perhaps via its action on direct and indirect pathways in the striatum [12].Conversely, in the case of active avoidance, if the animal is at least partially incompetent and so receives some shocks, V(s) < 0 will be negative.In this case, the Pavlovian factor, ωV(s), will suppress the propensity to act (i.e., favours 'NoGo'), e.g., by promoting freezing or withdrawal.It is notably less clear that this arises only from below-baseline DA at V(s t+1 )<0 as opposed, for instance, to the activity of a separate opponent system [18,33] or a nondopaminergic boosting of the indirect pathway in the striatum [12,34].
Our central conceit is that the coupling of DA with action provides both the opportunity and need for a form of cognitive control in which DA release is manipulated by a reframing of values.This generalizes the suggestion [18,20] that the origin of the valence axis of the affective circumplex can be adjusted.Such control might fully determine a particular trajectory for DA release; however, we explore a more limited construct in which it induces new, counterfactual [35], states associated with default expectations, with an effect on dopamine concentrations associated with the discriminative cues in these experiments.This induction then influences DA.
We provide the essence of the model in the results below; full modelling details can be found in the Methods.

Active avoidance
A particular example involving active avoidance is provided by Gentry et al. [14], who used fast-scan cyclic voltammetry (FSCV) to examine DA release in the core of the NAc during performance of a mixed-valence task.In one class of trials, rats heard a tone telling them that they had to press a lever within a 10 s response window to avoid a shock (Fig 1a).Half the animals often struggled to respond actively in time (Fig 1b and 1c), and showed higher rates of freezing -a typical example of Pavlovian-instrumental conflict.However, across the population, on trials when they did press, cue-elicited DA release was similar just after shock or reward cues (Fig 1d), notably being stimulated rather than suppressed.
For convenience, we write s pre for the state before the tone that indicates trial type, with V(s pre ) ' 0 (from long, subjectively uncertain, inter-trial intervals), and s χ , χ 2 {rew, neu, shk}, for the different states entered depending on the tone.Then, for shock trials and imperfect avoidance, V(s shk ) < 0. Thus, specifying the terms in Eq (1), and noting that r t = 0 at the time the stimulus is presented, we would have promoting Pavlovian inhibition; as positive feedback, this would make avoidance harder, thus making V(s shk ) more negative and exacerbating the problem.
Our assumption is that for shock trials, the deployment of cognitive control instills a counterfactual state s fail that substitutes for s pre , with V(s fail ) � 0 quantifying the full explicit cost of the shock.Then, promoting Pavlovian action.This relocation of the origin of the affective circumplex to the negative affective value associated with presumed failure and thus the shock (Fig 1g) harmonizes Pavlovian and instrumental control in the service of active responding-and would explain the positive ("damned if you don't") DA transient for successful avoidance in [14].
To test this, we fitted a model (see Methods) that incorporates Pavlovian influences, via an effective value of ω, and a probability of employing control to the animals' behaviour (Fig 1b and 1c).Averaging over the resulting mixture of differential TD errors for no-control vs. control shock trials (Fig 1e) then indeed implies a net-positive DA signal on trials where animals successfully avoid shock (Fig 1f), assuming that the TD signal is conveyed by DA transients (noting that the modelled concentrations are arbitrarily scaled in the figure).The predicted suppression of DA release for poor avoiders on failed avoidance trials (Fig 1h) would be consistent with such failures of control, and with observations [36,37] that enhanced or suppressed DA release given a warning cue predicts successful or failed active avoidance.Of course, given small ω, successful avoidance could be achieved without control.
We briefly note some discrepancies between the data and model behaviour.Firstly, the relative magnitude of DA release on poor-avoidance neutral trials in the model is lower than is observed in the data, a characteristic also noted in [20].One possibility is that there is partial confounding of cues, leading to a degree of generalization in the DA response [38,39].By contrast, we assumed perfect knowledge of the relationship between cues and trial types.Secondly, the model appears to predict greater cue-evoked DA release (on average) during successful avoidance for poor avoidance sessions (cf.One thing to note here, however, is that the plots in Fig 1d depend on splitting sessions according to a particular operationalization of good vs. poor avoidance (in [21], this was based on the relative performance in neutral and shock trials, rather than on performance in shock trials alone); such categorization may have obscured relationships that would be apparent by instead considering avoidance on a continuum.Indeed, an intriguing observation made by Gentry et al. [21] was of a significant negative correlation between the rate of successful avoidance and the magnitude of shock cue-evoked DA release on avoidance trials-i.e., the worse the avoidance performance, the larger the DA release in response to the shock cue on avoidance trials.Given the relative paucity of data, we were not able to fit individual animals/sessions, but we certainly find this correlation suggestive.
Finally, we only set out to model the DA transients associated with the cue (i.e., the 5 s period between cue onset and lever insertion, ending at the shaded region in Fig 1d).However, Fig 1d suggests that even within this limited time window, cue-evoked DA release on press trials is prolonged for the reward cue relative to the shock cue.As discussed in our previous work [20], one possibility is that this arises from incomplete 'predicting away' of the rewarding outcome on those trials, something which we have not addressed here.Indeed, relatively persistent DA release in response to cues and outcomes that are (in principle) perfectly predictable appears to be quite common in rodent experiments (e.g., [40][41][42]).We also note the intriguing hint of a second positive DA transient on poor avoidance neutral trials around 7-8 s that presumably coincides with lever retraction-suggesting that the lever may have acquired net negative valence.

Go/No-Go
While Pavlovian influences may take the form of counterproductive behavioural inhibition in the case of active avoidance, they may also appear as unhelpful behavioural activation when suppression would be preferable.Syed et al. [15], used a Go/No-Go task in which one of four auditory cues indicated whether rats had to leave a nose-poke ('Go') and execute an active response (press a lever twice) or stay in the nose-poke until the tone turned off ('No-Go') in order to get a small or large reward (Fig 2a).Animals were reliably successful on Go largereward (GL) trials, but less so on No-Go large-reward (NGL) trials (Fig 2b).We attribute this to Pavlovian misbehaviour caused by the prospect of a large reward, consistent with the faster ultimate reaction time on successful large-reward trials in both Go and No-Go conditions (Fig 2c).Mirroring the case of active avoidance [14], on successful NGL trials, after a minor peak, there was a suppression of DA below baseline during the No-Go period (followed by a rise at movement initiation), despite the prospect of large reward (Fig 2d ); by contrast, on successful GL trials, there was a marked early increase.
Again, we write s pre for the state before the disambiguating cue, with V(s pre ) ' 0, and s χ , χ 2 {gs, gl, ngs, ngl}, for the states inspired by the respective cues.Partial success on NGL trials, and thus large rewards, would make V(s ngl ) > 0, with promoting Pavlovian action, No-Go failure, and ultimately a decrease in V(s ngl )-lessening the misbehaviour.This slow negative feedback could even lead to oscillations.In this case, we consider cognitive control as instilling a counterfactual state s succ with V(s succ ) � 0 quantifying the full value of succeeding in the No-Go requirement.Then

Stability of reframing
An important remaining problem with the proposed reframing is the apparent absence of learning.For instance, if the DA signal of Eq (3) is positive, why does normal plasticity, associated with conventional TD learning, not zero out this egregious prediction error?One possibility is that downstream systems might be informed directly about the counterfactual status of the reframing, and so avoid untoward plasticity.One could only speculate as to how this information could flow and take effect.
A second possibility is that cortico-striatal plasticity is confined to precise temporal windows, occasioned for instance by the activity of tonically active cholinergic neurons [43][44][45][46].This window could be explicitly closed as part of the operation of control and so avoid the undesired plasticity.
Third, there might instead be an active mechanism associated with opponency [18,20,33] (see also [12,34]).That is, rather than which would adjust V(s fail ) towards 0 until δ shk = 0, cancelling out the reframing mechanism, we would have for an opponent prediction error dshk .Then, ΔV(s fail ) = 0 when rendering reframing stable.The same argument can be made for s succ in the No-Go case.This last perspective elucidates other cases with apparently non-zero asymptotic DA.Thus, the evidence from tasks demanding substantial work from subjects is that DA release does not inversely covary (at least strongly) with demands on vigour [47], but that compromising DA (e.g., by selective lesions [11]) compromises the willingness of subjects to overcome substantial effort costs in their active responding.If we imagine that those effort costs are conveyed by opponent terms such as dshk , then the net influence on action in the striatum would depend on d shk À dshk , which would evidently be compromised by DA deficits.
The opponent might also help resolve a tension in our model of active avoidance between the apparent consistency of good-avoiders' behaviour with negligible Pavlovian influence (i.e., ω � 0) and the putative origin of positive DA on shock trials in the deployment of control.
Instead of having no effect, as at present in the model, control might also be necessary for the good avoiders to overcome any possible Pavlovian misbehaviour arising from the opponent.Of course, control might also influence the opponent [48], making for a rich palette of possible interactions.
Key areas for future work include modelling the cost [49], learning [50], and anatomical realization (putatively involving the anterior cingulate cortex [51]) of cognitive control, along with the likely (and recursive) mesocortical DA influence over its prefrontal operation [52,53]; addressing the ultimate habitization, at least for the avoidance case, of the relevant action and obviation of cognitive control [54]; encompassing the known spatial [55,56] and temporal [57] heterogeneity in DA release; capturing the fuller temporal extent of the DA signal rather than just the cue-associated response, including its persistence until the time of reward (cf.Fig 1d ); explaining the effects of pharmacological manipulation in the Go/No-Go task [58,59]; and incorporating the dorsal striatum, with its suggested focus on the instrumental components of control, and its own dopaminergically-impacted bias in favour of action [12].Even more generally, the counterfactual states that are instantiated through the medium of cognitive control could have implications beyond DA-for instance engaging default, Pavlovian, policies that are more specific than just activation or inhibition.
In sum, we have suggested a neurocomputational architecture in which simple rules coupling action and valence are subject to a form of cognitive control whose mode of action exploits this very coupling.

General model
Both tasks are modelled in essentially the same way (Fig 3).First, an 'internal' decision is made when a cue arrives, at state s cue , about whether to apply self-control (C ¼ 1) or not (C ¼ 0).There then follows an 'external' decision about the physical action at state s 1 or s 0 , as appropriate.It is ultimately the physical action that determines success (to s succ ) or failure (to s fail ) for the current trial.Following the inter-trial interval (ITI), and any additional time penalty for failure, the next trial begins.
The recurrent nature of the tasks, and the fact that faster responses on the current trial can generally increase the rate of rewards (or punishments) by hastening opportunities to earn future outcomes in subsequent trials, mean that it is natural to employ an average-reward framework for the analysis.As shown in previous work (e.g., [19,60]), the average reward rate ρ then becomes an important factor in promoting vigour by effectively determining the opportunity cost of time.
We denote trial type by χ.For simplicity, we assume that the initial, internal decision is made only on the trial types of most interest, i.e., on shock trials (χ = shk) in the mixed-valence task, and on No-Go large-reward (NGL) trials (χ = ngl) in the Go/No-Go task.Thus, we assume that the default choice is always C ¼ 0, only possibly deviating on these particular trial types.
The choice of whether to apply self-control or not is modelled in a very simple way.For shock and NGL trials, we simply assume that there is a fixed probability κ χ of applying selfcontrol, with this probability being fit to summary measures of the data as described below ('Modelfitting').
As described in the main text, the importance of this internal choice is its effect on the cueelicited temporal difference (TD) error.For C ¼ 0, we assume that s 0 is in essence an extension of the cue state, and so there is really no change to the TD error elicited by cue onset, i.e., By contrast, for C ¼ 1, we assume that this is transformed to where b χ is the trial-specific baseline/control signal that implements the putative reframing.As argued in the main text, we assume that b χ is precisely the disutility of the shock that will be experienced on a shock trial if the animal fails to press, or is the utility of the large reward that stands to be lost on a NGL trial if the animal fails to maintain fixation.
In describing the models in greater detail below, we make use of the following common notation: • R χ (s): immediate expected utility in current state s and trial type χ.
• T χ (s)/T χ (s, a, τ): expected time until the next state from current state s and trial type χ, potentially also conditioned on action (a, τ).

Mixed-valence task
In this case [14], the trial types are χ 2 {rew, neu, shk}, and we assume that the animal's external choice is between press and other, where other can be thought of as some alternative activity that the animal may choose to engage in (e.g., grooming) and which may itself be rewarding, but will mean that the animal fails to press on the current trial.
In the experiment of [14], there was a 5 s interval between cue onset and the insertion of the response lever; in the model, for simplicity, we assume that the external choice is made at the time of the cue, and that implementation of that choice only begins at lever insertion.Following our previous work [20], we assume that successfully pressing on a reward trial delivers positive utility r rew = 4, while failing to press on a shock trial leads to delivery of a punishing shock with disutility r shk = −10.Briefly (noting that we do not seek to capture the DA concentration quantitatively), the unpleasantness of the shock is assumed to be greater in magnitude than the pleasantness of the reward based on the (dopaminergic) evidence that neutral trials had relatively positive value for poor avoiders, in spite of an average rate of successful avoidance of around 50%; roughly speaking, for this to hold under the model, the magnitude of the disutility of shock would need to be more than twice the utility of the reward, so that the average reward rate is negative and (therefore) the differential value associated with neutral trials is positive (see [20] for more detail).
If the animal chooses press, we assume it also chooses a latency τ min � τ � τ max for press completion, with τ min = 0.5 s (for rough consistency with the data, although animals can certainly act more swiftly) and τ max = 20 s.The latter may seem implausibly long, but also means we do not exclude the possibility that the animal chose to press but did not manage to complete it (e.g., due to freezing) before the τ D = 10 s deadline.The (differential) value of pressing with latency τ is then where p � 0 is the assumed (hyperbolic) vigour cost associated with choosing to press at latency τ (cf.[60]); ρ is the average reward rate; s 2 {s 0 , s 1 } (noting that the instrumental values are the same for these states);

�
For other, we simply assume, again for s 2 {s 0 , s 1 }, that for some fixed utility rate r O � 0. We assume that both instrumental and Pavlovian factors can affect speed of pressing.For presentational purposes in the main text, we refer to a single parameter ω that mediates the Pavlovian influence.However, in the model, we introduce a set of weights {w}, all w � 0, that allow instrumental and Pavlovian factors to be balanced, but permitting Pavlovian influences on decisions about action vs. latency, and for positive vs. negative TD errors to differ (cf.[20]).This does not alter the central argument, and is a simple stand-in for the complexities of how these influences are mediated.In particular, we assume the distribution of pressing latencies to follow where w t i is an instrumental weight, and ðw t þ v ; w t À v Þ are Pavlovian weights that modulate the effect of positive and negative TD errors, respectively.This means that a positive TD error [δ χ (s)] + will tend to speed up responding (since the value of a shorter τ will be penalized less by the term À w t þ v ½d w ðsÞ� þ t, which is negative), while a negative TD error [δ χ (s)] − will tend to slow responding down (since the value of a longer τ will be boosted more by the term À w t À v ½d w ðsÞ� À t, which is positive).
The overall instrumental value of pressing is then assumed to be where, again, w i is an instrumental weight and (w p+ , w p− ) are Pavlovian weights for positive and negative TD errors; and ΔQ χ (s) � Q χ (s, press) − Q χ (s, other).Thus, a positive TD error will tend to increase task engagement by boosting the probability of pressing, while a negative TD error will tend to decrease task engagement by instead boosting the probability of choosing other.Note that there are two possible reasons for not pressing in the model.One is choosing press but failing to complete the press in time, for example because of a tendency to freeze (which we model implicitly via Eq 14), while the other is by choosing other.The latter choice could be purely instrumental (i.e., the reward and effort associated with the lever press, including savings on opportunity cost of time, is not sufficiently better than the alternative) or also involve Pavlovian factors (e.g., it makes instrumental sense to press the lever, but a negative TD error promotes disengagement via a form of disappointment or frustration-via Eq 16).
The values of success and failure states are respectively where R rew (s succ ) = r rew = 4 for a successful reward trial, and zero otherwise; R shk (s fail ) = r shk = −10 for a failed shock trial, and zero otherwise; and T χ (s succ ) = T χ (s fail ) = 20 s, 8χ, is the ITI.

Go/No-Go task
In this case [15], the task demands a slightly different choice structure.As in the mixed-valence case, there is an initial internal choice about self-control.However, we then assume that the next immediate choice facing the animal is when to leave the nose-poke.That is, the animal simply chooses a time τ to leave the nose-poke.The trial types are χ 2 {gs, gl, ngs, ngl}; the value of leaving at time τ for both s 2 {s 0 , s 1 } is assumed to be where c f � 0 is a cost rate assumed to be associated with maintaining fixation in the nose-poke (e.g., reflecting a decreased ability to monitor for danger); c l � 0 is a cost rate associated with the vigour of leaving (i.e., shorter latencies are assumed more energetically demanding); V out w ðtÞ is the value on Go trials of having exited the nose-poke at time τ; and PðsuccjtÞ ¼ Uð1:7 s; 1:9 sÞ is the probability of success on No-Go trials of exiting the nose-poke at time τ (i.e., the probability that the tone has turned off before exiting-see [15]).We should note here that even if the animal chooses to leave at particular time τ and the tone turns off before this time, we assume the animal will still in fact exit at time τ and incur the full costs c f t þ c l t À rt.In other words, we assume that the unfolding of the animal's leaving is non-interruptible.If we assumed that the animal could be interrupted by the tone's turning off and pay only a fractional cost of what it actually chose (cf.[19]), then it could make sense here to simply choose the slowest possible leaving time (which would have the lowest expected cost and, if implemented, would never result in leaving too early).Under the current formulation, the animal needs to balance, on the one hand, the vigour cost and possibility of failure if it leaves quickly and, on the other, the fixation cost and opportunity cost of time if it leaves slowly.
Just as with pressing latencies in the mixed-valence case (cf.Eq 14), we assume that the distribution of leaving times is influenced by both instrumental and Pavlovian factors, so that a positive TD error promotes leaving earlier, while a negative TD error promotes leaving later.
While the choice of leaving latency determines success or failure on No-Go trials (success is signalled by reward delivery, and failure by the turning on of the houselight-see [15]), we assume that on Go trials, an additional choice is required.That is, on exiting the nose-poke, the animal additionally chooses-as in the mixed-valence task-whether to subsequently press the lever or perform some other activity.Again, we can consider the value of pressing at different latencies, though this now depends on the time at which the animal exited the nose-poke; letting t denote the time that has elapsed since cue onset, the value of choosing to press at latency τ at that time (for χ 2 {gs, gl}) is assumed to be where, again, c p � 0 is the (vigour) cost rate associated with pressing.The instrumental value of choosing other is assumed to be Note that we therefore assume for simplicity that the choice between press and other on exiting the nose-poke is purely instrumentally-governed.The value of having left the nosepoke at time t on a Go trial is then V out w ðtÞ ¼ P t ðpressÞQ w ðt; pressÞ þ ½1 À P t ðpressÞ�Q w ðt; otherÞ: ð26Þ The value of success states are V w ðs succ Þ ¼ r S À rt I ; for small-reward trials; r L À rt I ; for large-reward trials; ( where τ I = 5 s is the ITI.For fail states, we have for all trial types where τ P = 5 s is the penalty timeout.

Dopamine
As described in the main text, we assume that the Pavlovian influence operates via the TD prediction error δ χ (s).While a rather large body of evidence supports the idea that this quantity is signalled by the activity of midbrain DA neurons (see citations in main text), there is also evidence that DA activity may not be its sole representational substrate.Indeed, one interesting feature of the results in [14] is the apparent absence of any immediate dip in DA in response to the shock cue (perhaps before control is deployed) on successful avoidance trials.
We emphasize that we seek to model only DA transients associated with the task cues that differentiate between trial types, and not the later release which is contemporaneous with movements and/or delivery of outcomes.We have therefore focused on particular epochs proximal to cue onset in each experiment: the first 5 s following cue onset, and prior to lever insertion, in the mixed-valence task; and the first 3 s following cue onset in the Go/No-Go task.We shaded off the unmodelled times in Figs 1 and 2. Accounting for the full trajectory of DA release over a trial would require a more complex model, possibly encompassing factors that we have considered elsewhere [61].
We follow [33] in assuming that DA indeed only signals part of the full TD error, and principally signals transitions that are better than expected.In particular, similar (but not identical) to [33], we assume that the dopaminergic component d DA w is given by with α = 0.8 throughout, for simplicity.In [33] and later work [18], it was explicitly suggested that punishers and their predictors would engage non-dopaminergic substrates, notably serotonin.However, current views of the landscape of interactions between dopamine and serotonin point to many complexities (see [62]), and this can at most be a part of the overall picture.We additionally need to consider how the quantity d DA w , putatively represented by the firing of DA neurons, is reflected in changes in DA release measured in the accumbens (NAc).Here, we simply assume that this term is convolved with an alpha function (cf.[61]), with time constant ξ = 0.7 s, so that changes in DA concentration relative to baseline are given by To allow for the possibility that it may take some non-trivial amount of time for control/ reframing to be applied (as possibly hinted by the results of Syed et al. [15] on successful NGL trials), for trials on which this is the case (i.e., C ¼ 1), we assume that we initially have d DA w ðs 0 Þ at cue onset, but that this is followed by d DA w ðs 1 Þ after a short, subsecond delay (we arbitrarily set this to τ delay = 250 ms).
Note that to assess model-derived DA responses on success vs. failed trials separately, we compute the posterior probability of having employed control given success (since success and failure could occur whether or not control was employed): and the probability of having employed control given failure, Thus, the average DA signal on successful trials is a mixture of those on which control was employed (with inferred probability P w ðC ¼ 1jsuccessÞ) and those on which it was not (with probability P w ðC ¼ 0jsuccessÞ ¼ 1 À P w ðC ¼ 1jsuccessÞ).Analogous reasoning allows for derivation of the average DA signal on failed trials.

Model-fitting
For a given set of parameters x, the self-consistent set of differential state values and associated behaviour can be found using value iteration [63].For each task, we fitted parameters (see Tables 1 and 2) to minimize the difference between animal and model behaviour.In particular, we defined an error function where e succ w , e rtc w , and e rte w are the differences between the average performance of the animals and the model (for expected success rates, reaction times for correct trials, and reaction times for error trials, respectively); s rtc w and s rte w are the standard errors from the experimental data (so that the larger the standard error, the less importance we place on precise fitting of that measurement); w rtc � 0 and w rte � 0 are weights that determine the relative importance we give to fitting the reaction times compared to the success rates (in practice, we set these to the same value: for the mixed-valence task, we fixed w rtc = w rte = 0.25; for Go/No-Go, we fixed w rtc = w rte = 0.018); and w reg is a (L1) regularization weight that determines how strongly we wish to discourage large values in the parameters x (we set w reg = 0.01 when fitting good avoidance in the mixed-valence task, since the instrumental weights can grow arbitrarily large in this case, and otherwise set it to zero).The term e DA is an error term that we used only when fitting data from the mixed-valence task, and is specifically the absolute difference between peak DA for food and shock trials produced by the model-this was to better reproduce the striking similarity between food-and shock-trial DA transients observed in the data (we set w DA = 0.1 for both good and poor avoiders).This latter point deserves amplification.For the good avoiders, behaviour is consistent with a purely instrumental policy, with Pavlovian weights � 0 (see Table 1).This implies that behaviour in the model does not vary with δ χ (or therefore the value of κ)-the model reliably presses on shock trials in any case.κ is only identifiable when additionally considering how the model's implied DA pattern compares with the data, hence the fitting weight w DA above.Why though would control be at all necessary if behaviour were purely instrumental, since there would then apparently be no danger of Pavlovian misbehaviour?As mentioned in the main text, one possibility is that an opponent signal might also influence behaviour, which controlled dopamine release would be required to overcome.Indeed, note that good-avoiders still freeze in response to shock cues (cf. Figure 4i of [14]).We leave these subtleties to future work.

Data analysis
The data from [15] were downloaded from https://data.mrc.ox.ac.uk.Following [15], data were smoothed using a 0.5 s moving window and baselined by subtracting the average signal during the 0.5 s period before cue onset.As a basic test of the hypothesis that the cue-evoked DA response would be greater on failed No-Go large-reward (NGL) trials than on successful NGL trials, we integrated the DA signal over the 1 s period immediately following cue-onset,

Fig 1 .
Fig 1. Mixed-valence task of Gentry et al. [14].(a) On each trial, a tone indicated whether a lever press within 10 s would yield reward, have no consequence, or avoid a scheduled shock.(b,c) Good-avoiders (upper) pressed often and quickly; poor-avoiders (lower) pressed less often and more slowly on shock and neutral trials (*indicates significance, p < .05).Filled circles indicate model fit.(d) Average cue-aligned (nanomolar) NAc DA release (±SEM) on press trials for each trial type (vertical dashed lines indicate cue onset at 0 s and lever insertion at 5 s).Our focus is on DA release arising in response to the tone; the shaded region covers lever insertion and subsequent events.(e) Cue-evoked TD errors predicted by model (arbitrary units).(f) Average DA release predicted by model on press trials (arbitrarily scaled).(g) Putative shifting of origin leftwards in the affective circumplex to promote DA release and approach to safety in the active avoidance case (adapted from [18]).(h) Predicted DA release for press vs. no-press shock trials.(Figures b-d adapted from [14].) https://doi.org/10.1371/journal.pcbi.1011569.g001 Fig 1h), something which is not evident in Fig 1d.

Fig 2 .
Fig 2. Go/No-Go task of Syed et al. [15].(a) On each trial, a tone indicated whether the animal should leave the nose-poke and press a lever ('Go') to gain a small (GS) or large (GL) reward; or remain in the nose-poke until the tone turns off ('No-Go') to gain a small (NGS) or large (NGL) reward.On No-Go trials, the duration of the tone was randomly jittered between 1.7-1.9s on each trial.(b,c) Average success rates and RTs (±SEM) for each trial type; the latter are split further into successful (s) and failed (f) trials (*indicates significance, p < .05).Filled circles indicate model fit.(d) Average cuealigned change in DA (±SEM) for each trial type on successful trials; triangles indicate mean RTs.Again, our focus is on DA release associated with the cue, and in this case we simply focus on the first 3 s following cue onset (i.e., before the shaded region).(e) Putative shifting of origin rightwards in the affective circumplex to suppress action in light of predicted reward (adapted from [18]).(f) Cue-evoked TD errors predicted by model.(g) Average DA predicted by model on success trials.Inset: average DA predicted by model on successful vs. failed NGL trials.(Figures b-d adapted from [15].) https://doi.org/10.1371/journal.pcbi.1011569.g002 again harmonizing Pavlovian and instrumental control, this time by facilitating inaction.This amounts to moving the origin of the affective circumplex to the positive value associated with presumed NGL success (Fig 2e), switching the sign of the TD error (Fig 2f) and leading to suppression of DA release.The brief initial increase on successful NGL trials (Fig 2g) might arise before control is exerted, something we would need a finer timescale model to examine.Here, control failure, associated with failed NGL trials, should lead to enhanced DA release (Fig 2g, inset).Apparent trends in this direction were not, however, found to be significant (S1 Fig), though the relatively small sample size and large variability in the voltammetry signal may obscure such differences.

Table 1 . Model parameters used for mixed-valence task and best-fitting free parameter values for good-and poor- avoiders. {r
rew , r neu , r shk }: utilities of outcomes on press trials for each trial type.r O : utility rate associated with action other.c p : vigour cost of pressing.b shk : baseline/control signal optionally applied on shock trials.α: mixture weight determining relationship between full TD error and its dopaminergic component.τ delay : putative delay associated with application of control/reframing.{w i , w t i }: instrumental weights.{w p+ , w p− , w t þ v , w t À v }: Pavlovian weights.κ: probability of deploying control/reframing. https://doi.org/10.1371/journal.pcbi.1011569.t001

Table 2 . Model parameters used for Go/No-Go task and best-fitting free parameter values
. {r S , r L }: utilities of small and large rewards.b ngl : baseline/control signal optionally applied on NGL trials.α: mixture weight determining relationship between full TD error and its dopaminergic component.τ delay : putative delay associated with application of control/reframing.w t i : instrumental weight.fw t þ v ; w t À v g: Pavlovian weights.{β τ , β}: inverse temperatures of softmax functions.κ: probability of deploying control/reframing.r O : utility rate associated with action other.{c f , c l , c p }: costs respectively associated with maintaining fixation, leaving, and pressing. https://doi.org/10.1371/journal.pcbi.1011569.t002