Learning the payoffs and costs of actions

A set of sub-cortical nuclei called basal ganglia is critical for learning the values of actions. The basal ganglia include two pathways, which have been associated with approach and avoid behavior respectively and are differentially modulated by dopamine projections from the midbrain. Inspired by the influential opponent actor learning model, we demonstrate that, under certain circumstances, these pathways may represent learned estimates of the positive and negative consequences (payoffs and costs) of individual actions. In the model, the level of dopamine activity encodes the motivational state and controls to what extent payoffs and costs enter the overall evaluation of actions. We show that a set of previously proposed plasticity rules is suitable to extract payoffs and costs from a prediction error signal if they occur at different moments in time. For those plasticity rules, successful learning requires differential effects of positive and negative outcome prediction errors on the two pathways and a weak decay of synaptic weights over trials. We also confirm through simulations that the model reproduces drug-induced changes of willingness to work, as observed in classical experiments with the D2-antagonist haloperidol.

Introduction What guides rational behavior in a complex environment? Certainly, knowledge of the typical payoffs and costs of acting a certain way is critical for successful action selection. Those payoffs and costs do not only depend on the action that is carried out, but also on the environmental state, henceforth referred to a 'situation'. If payoffs and costs are represented separately in the animal's brain, they can be weighted depending on animal's motivational (i.e. internal) state, which can vary independently of the environmental situation. For example, consider the action 'harvesting fruit from a tree' in the situation 'close to a fruit-bearing tree'. It has a payoff connected with the nutrients in the fruit, but also costs related to the effort, the risk of pain and the exposure to predators associated with climbing a tree. The nutrients in the fruit are only valuable for the animal if it is hungry, i.e. if it is in the appropriate internal state. So, when it is hungry, the payoffs of climbing a tree which was identified as fruit-bearing should be weighted more than the costs, to ensure that the animal searches for food. By contrast, when the animal is not hungry at all, the payoffs should be weighed less than the costs, to make sure that it does not climb the tree without necessity. In summary, the payoffs and costs of a specific action ('climbing a nearby tree') carried out in a certain environmental situation ('near fruitbearing tree') should be weighed against each other according to the motivational state ('hunger') to correctly asses the action's utility.
In all vertebrates, an important role in this process of action evaluation and selection is played by a set of subcortical structures called the basal ganglia [1]. The basal ganglia are organized into two main pathways shown schematically in green and red in Fig 1. The Go or direct pathway is related to the initiation of movements, while activation of the No-Go or indirect pathway results in targeted movement inhibition [2]. These two pathways include two separate populations of striatal neurons expressing different dopaminergic receptors [3]. The striatal Go neurons express D1 receptors and are excited by dopamine, while the striatal No-Go neurons express D2 receptors and are inhibited by dopamine [4]. Thus dopamine changes the balance between the two pathways and promotes action initiation over inhibition.
The competition between Go and No-Go pathways during action selection and the role of dopaminergic modulation are subject of many interpretations and models, e.g. [5][6][7]. In particular, the Opponent Actor Learning (OpAL) hypothesis suggests that the Go and No-Go neurons specialise in encoding the values of actions with positive or negative consequences respectively [8]. We extend the OpAL hypothesis further by proposing that for each individual action, the direct and indirect pathway separately encode the learned positive and negative consequences. As the dopaminergic neurons modulate the Go and No-Go neurons in opposite ways, dopamine controls the extent to which positive and negative consequences affect the activity in the thalamus, through the output of the basal ganglia [8]. For example, when motivation is high, the dopaminergic neurons will excite the Go neurons and inhibit the No-Go neurons. Consequently, positive action values will influence the decision more than negative action values. By contrast, when the motivation is low, the Go neurons tend to be excited to a smaller degree, but the No-Go neurons will be released from inhibition, such that negative values are weighted stronger.
Much research has also focused on how the synapses of Go and No-Go neurons are modified by experience. A systematic investigation revealed that bursts of activity of dopaminergic neurons encode outcome prediction errors, which measure the difference between outcome (typically rewards) obtained and expected [9,10]. Note that we use the phrases 'outcome prediction error' and 'reinforcement' instead of the more common 'reward prediction error' and 'reward' respectively. This use of language emphasizes that in our theory, the feedback signal is informative of both positive and negative events and that not only rewards but any outcome will be compared with predictions. That perspective is well supported by experimental results; see Discussion for a review of evidence for negative prediction errors (e.g. pauses in dopaminergic firing) caused by negative experiences.
Such bursts of dopaminergic activity produce distinct changes in the synaptic weights of Go and No-Go neurons [11]. Several computational models have attempted to describe the learning process of the synapses of Go and No-Go neurons [12][13][14][15]. Among these models, the OpAL model provided simple and analytically tractable rules describing the changes in weights of Go and No-Go neurons as a function of outcome prediction errors [8]. However, no-one so far examined how the basal ganglia might estimate payoff and cost if they are both associated with the same action.
The goal of this paper is to show how the Go and No-Go neurons can learn the payoffs and costs of individual actions through local synaptic plasticity rules. We argue that the payoffs and costs of individual actions are not necessarily correlated (for instance, two actions might have comparable benefits, but very different costs), and strive to construct a model that is able to represent those independent dimensions of reinforcement for every single action. Ultimately, we confront the resulting model with experimental results.
Instead of constructing a new set of learning rules from scratch, we will employ the theory of striatal learning described in [16], which has been shown to account for diverse observations. That theory was originally developed to explain how the mean and the spread of the reinforcement signal could be learned by the basal ganglia network. In this article, we will prove that if the weights of Go and No-Go neurons change according to these rules, they can eventually represent payoff and cost. In summary, we show that a set of learning rules, originally constructed to estimate statistical properties of the reinforcement signal, can be reinterpreted as rules to estimate payoffs and costs. We thus extend both the interpretation of the striatal pathways of Collins and Frank and the striatal learning rules of Mikhael and Bogacz to ultimately obtain a consistent theory of learning the payoffs and costs of actions.
According to the experimental and modeling work mentioned above, dopaminergic activity encodes both information about motivational state and the outcome prediction error. However, if the dopaminergic neurons carried both signals, the striatal neurons would need a way to decode each signal and react appropriately, i.e. change their activity according to the motivation signal, and change the synaptic weights according to the prediction error. The prominent suggestion that motivation might be encoded in the average or tonic dopamine level, and outcome prediction errors in the burst or phasic activity [17] is hotly debated; it seems to be contradicted by the observation of fast-changing dopaminergic activity that encodes motivation [18][19][20]. Note, though, that these apparently divergent views could potentially be reconciled-see e.g. [21]. Anyhow, the motivation and teaching signals could both be provided by other means. For example, the activity of striatal cholinergic neurons may inform what the dopaminergic neurons encode at the moment [20]. In this paper, we assume that striatal neurons can read out both motivation and teaching signals encoded by dopaminergic neurons, and we leave the details of the mechanisms by which they can be distinguished to future work.

Results
Inspired by the OpAL model [8], we assume that synaptic weights within the Go pathway encode positive consequences of actions, that is the positive reinforcement caused by food, drink or other appetitive stimuli obtained through actions. More precisely, we claim that the typical payoff of a particular action a in a particular situation s is encoded in the strength of the connections from the cortical neurons selective for the situation to the striatal Go neurons selective for the action. We denote these weights by G(s, a) (see Fig 1), and propose that after learning, the weights G represent the mean payoff for an action. Mathematically, the collective strength of the weights G corresponds to a single, non-negative number. The negative consequences, on the other hand, are encoded in the synaptic connections of striatal No-Go neurons. Negative consequences should be understood as the negative reinforcement induced by aversive stimuli such as pain, effort or disgust. We denote their weights by N(s, a), and propose that after learning, they represent the mean cost of an action. Just as with G, we mathematically represent the collective strength of the weights N by a single, non-negative number.
To learn the positive and negative consequences of actions respectively, the striatal neurons can take advantage of the fact that these consequences typically occur in different moments in time. Let us consider a situation in which an animal performs an action that involves an effort in order to obtain a reward: Fig 2a sketches a task in which a rat is given the opportunity to press a lever in order to obtain a food pellet. Due to the effort, the instantaneous reinforcement during the course of this action is negative at first, while pressing the lever. Then, it turns positive at the time the payoff is received. Fig 2b sketches the resulting changes in the synaptic weights. The leftmost display shows the initial weights. While making an effort to perform an action, the outcome prediction error is negative. Similarly as in previous models [8,12], we assume that the negative prediction error results in a strengthening of N (compare the red arrows in the middle and the left displays in Fig 2b). This allows the weights N to encode negative consequences. Later, reception of the payoff causes a positive prediction error, which strengthens G. This leads the weights G to encode the positive consequences. Here, we assumed that-at baseline dopamine level-positive prediction errors trigger more plasticity in the Go pathway than in the No-Go pathway, while negative prediction errors affect the No-Go pathway more than the Go pathway. In Discussion, we will review data suggesting that the properties of D1 and D2 receptors allow this assumption. Generally, if an experience involves both positive and negative consequences, both weights are increased during the experience (compare the right and the left displays in Fig 2b).
To mathematically implement these ideas, we need to model the weighs of the Go pathway G(s, a), the weighs of the No-Go pathway N(s, a), and the prediction error. The outcome prediction error, which we denote by δ, quantifies the difference between the expected reinforcement and the received reinforcement r after executing action a in situation s. If r is negative, we shall speak of cost, and when r is positive, we shall speak of payoff or reward. The expected reinforcement, on the other hand, directly corresponds to the expected payoffs and costs, which-according to our theory-are represented by the synaptic weights G and N. We take the expected reinforcement to be the average over the expected payoff and the expected cost. Altogether, we model the outcome prediction error as It should be clarified that this definition of the prediction error differs from one in the original model [16], in that we introduced here a factor 1/2. This factor allows G and N to converge to the exact payoffs and cost, and not to values proportional payoffs and costs, and hence increases the clarity of the exposition. However, since value cannot be measured directly, the overall scaling of values through this factor is not observable, but a mere convention.
Equipped with the quantities δ, G and N, we can formulate our theory of learning payoff and cost. To present the theory, we simply describe how the collective connection strengths G (s, a) and N(s, a) change when a prediction error δ is received after executing action a in situation s; we use ΔG(s, a) and ΔN(s, a) to denote the changes in relevant connection strengths. Note that any update only applies if the resulting weights are still positive-if an update would . In all diagrams, the black circles represent the cortical neurons selective for the state (being in the operant box), and the green and red circles represent the Go and No-Go populations of striatal neurons, respectively, selective for the action (pressing the lever). The thickness of the arrows linking the circles represents the connection strength between the respective neuron populations. The blue shading in the background indicates the strength of the immediate reinforcement, with a colour intensity proportional to the magnitude of reinforcement. https://doi.org/10.1371/journal.pcbi.1006285.g002 Learning payoffs and costs render a weight negative, that weight is set to zero instead. In all other cases, we follow Mikhael and Bogacz [16] in prescribing DNðs; aÞ ¼ af � ðÀ dÞ À lNðs; aÞ; ð3Þ where α is the learning rate, � is the slope parameter and λ the decay rate. The slope parameter � controls the strength of the nonlinearity exhibited by the function f � , which we introduce in Fig 3d and 3e. The nonlinearity of the function f � accounts for the fact that positive and negative prediction errors affect the weights differently. From here on, we drop the dependency of G and N on a and s to simplify notation. The dependency is still implicitly assumed unless stated otherwise. There is a normative intuition for each term in the rules Eqs 2 and 3. These intuitions are most easily gained by following through a couple of steps to reconstruct the rules from scratch. We may start by observing that several models of learning in Go and No-Go neurons assume the effect of the prediction error on G to be opposite to its effect on N [7,8]. We thus propose that ΔG and ΔN might simply be proportional to the prediction error and its negative, respectively. To see whether this proposal works, we formulate it mathematically and simulate the learning of an alternating sequence of costs −n and payoffs p. Fig 3a shows both the mathematical formulation and the simulation. There is a problem: the strengthening of N due to negative prediction error, caused by the cost, is always immediately reversed by the following positive prediction error caused by the payoff. The same is true for the changes in G. As illustrated by the simulation, there is no net effect of learning.
To overcome this problem, we proceed by damping the impact of negative prediction errors (which are usually caused by costs) on G, and the impact of positive prediction errors on N, by introducing a nonlinear transformation of the prediction errors. This transformation is given in form of a piecewise-linear function f � , defined and depicted in panels d and e of Fig 3. The transformation leaves positive prediction errors invariant (f � (δ) is just the identity for δ > 0) but reduces the impact of negative prediction errors by scaling them down (for δ < 0, f � (δ) is linear with slope � < 1). Hence, f � introduces a pathway-specific imbalance between learning from positive prediction errors and learning from negative prediction errors (which, as we point in Discussion, is in accordance with the properties of dopaminergic receptors on these pathways). For the construction at hand, it is also logical, since costs should not alter the estimate G of the payoffs and vice versa. Such damping can be achieved by replacing the simple proportionality to δ in the first proposal by a nonlinear dependence, mediated by the functions depicted in Fig 3e. We update our mathematical formulation accordingly, and again simulate the effects of the previously used reinforcement sequence-both these steps are illustrated in Fig 3b. The simulation shows that, while producing the appropriate tendencies, these rules cause unconstrained, ongoing strengthening of both connections. Such dynamics are neither biologically plausible nor useful to infer the actual payoff and cost.
Finally, to stop unconstrained strengthening and stabilize the weighs, we balance growth with decay. Adding decay terms to the mathematical formulation of the rules yields their final form Eqs 2 and 3. The simulation in Fig 3c suggests that the construction was successful: the final version of the rules allows the weights to converge to p and n respectively.

Mathematical analysis
After providing an intuitive understanding of the learning rules and their mathematical formulation, we proceed to a more rigorous analytical treatment. We saw the potential of Mikhael and Bogacz' [16] rules to learn payoffs and costs. Appropriate choice of parameters is key to unlock that potential, and we shall now investigate how that choice must be made. In particular, we will derive certain relations between parameters that must be satisfied for payoff and cost to be learned.
Originally, the rules Eqs 2 and 3 were meant to describe learning of reinforcements statistics. Mikhael and Bogacz [16] showed that after learning, particular combinations of G and N will encode the mean ER and the mean spread EjR À ERj of the received reinforcements. For further reference, we denote these important statistics by q ≔ ER and s ≔ EjR À qj. How are the mean and the mean spread of received reinforcements related to payoff and cost? Consider the reinforcement statistics of an action that reliably requires effort (corresponding to negative reinforcement) to produce a payoff (which corresponds to positive reinforcement). Assume that these reinforcements are clearly negative and positive respectively, such that one will not be confused for the other even in the presence of noise. Repeat that action multiple times, and record all received reinforcements, the costs as well as the payoffs. Finally, analyze how all these received reinforcements are distributed. If an effort was required to earn the payoff, the distribution of reinforcements will turn out bimodal, as schematically shown in Fig 4. It features two peaks, one centered around the mean payoff p, and one centered around the mean cost −n, respectively. Fig 4 also shows the mean q and the mean spread s of that distribution. We observe that payoffs and costs are both exactly one mean spread s away from the center q of the distribution-the payoff above, and the cost below. This implies that there is, at least in this representative case, a strong connection between payoffs and costs and the reinforcement statistics: This connection allows us to set up conditions for the result of learning: if G and N are to represent payoff and cost, they must approach q + s and −q + s respectively. Equivalently, we can ask for 1/2(G − N) and 1/2(G + N) to approach q and s in the course of learning.
After revealing the link between reinforcement statistics and payoff and cost, we are ready to derive the relations necessary to learn the latter. To that end, we first determine the connection strengths G and N that result from training on stochastic reinforcements. Such uncertain reinforcements are sampled at random from a fixed distribution. Then, we implement the newly identified conditions, demanding for 1/2(G − N) to approximate q and 1/2(G + N) to approximate s after training is finished. From these conditions, we will be able to derive the desired parameter relations.
Working through these steps is simpler after changing variables from G and N to Q ≔ 1/2 (G − N) and S ≔ 1/2(G + N) right away. We saw that the new variables Q and S have a clear computational interpretation: if learning goes as planned, Q and S track the mean q and the mean spread s of the experienced reinforcement. To determine how Q and S change due to prediction errors δ, we simply add and subtract the update rules Eqs 2 and 3. Certain convenient properties of the nonlinear functions f � help to further simplify the resulting equations: Exploiting these properties, we obtain Here, for brevity of notation, we introduced the effective learning rates α Q = α(1 + �)/2 and α S = α(1 − �)/2. Note that the changes of Q and S are proportional either to the prediction error itself or to its absolute value, in contrast to the changes of G and N. Now, let us determine the strengths of the weights G and N, or equivalently of the variables Q and S, after many encounters with an action. When learning the reinforcements of a previously unknown action, Q and S typically change a lot during the first trials. These changes then get smaller and smaller as more experience is integrated-the learning curve plateaus. After enough trials, Q and S stop changing systematically, and start to merely fluctuate about some constant values, which we denote by Q � and S � and refer to as equilibrium points. In mathematical terms, directed learning stops when we may expect Q and S to remain unchanged by another trial, i.e. when EðDQÞ ¼ EðDSÞ ¼ 0. If that stage is reached, the equilibrium points can be inferred by computing the mean value of the fluctuating variables: Q � ¼ EQ and S � ¼ ES. With these identities and the learning rules Eqs 6 and 7, we can determine the equilibrium points Q � and S � : To solve these equations, we shall make the additional assumption that the fluctuations of Q about Q � are small. This assumption is justified whenever α is sufficiently small, and allows us to approximate E jR À Qj � E jR À Q � j. Collecting all those intermediate results, we may solve Eqs 8 and 9 for the equilibrium points. The solutions read with c Q = α Q /(α Q + λ) and c S = α S /λ. Those are the approximate values of Q and S after learning. Next, we need to implement the conditions we inferred from Fig 4. Thanks to our choice of variables, this simply amounts to requiring that Q converge to the mean reinforcement q, and S to the mean spread s, i.e. requiring Q � = q and S � = s. Inserting the approximate values from Eqs 10 and 11 produced by the learning rules, we obtain These equations are central to this publication. Their left-hand side represents the result of learning according to Mikhael and Bogacz' [16] rules. Their right-hand side specifies what needs to be learned if G and N really represented payoffs and costs, as Collins and Frank hypothesized [8]. Equating the left-hand and the right-hand side amounts to merging both theories. It allows us to determine how the parameters would be related if both theories were exactly true: for Eqs 12 and 13 to hold, α, λ and � must take values such that c Q = 1 and c S = 1.
This result evokes several questions: Is it at all possible to satisfy the derived conditions? What do the conditions mean with respect to the parameters α, λ and �? And finally, is there a practical way to determine sets of parameters α, λ and � which-at least approximately-satisfy the conditions? We discuss each of these questions in the following paragraphs.
Firstly, is it possible to satisfy c Q = 1 and c S = 1 exactly? Examining the definition c Q = α Q / (α Q + λ) quickly reveals that letting c Q ! 1 would amount to letting λ ! 0. To see why this is the case, consider that c Q ! 1 amounts to λ/α Q ! 0. However, α Q is an effective learning rate, and so must take values smaller then one. Thus, we really need to let λ ! 0. Now, we derived above that after learning, S will fluctuate about its equilibrium point S � � c S EjR À c Q qj with c S = α S /λ. In order to keep the equilibrium point S � finite as λ ! 0, we would therefore be forced to have α S ! 0 also. This, though, would pose a real problem: α S is the effective learning rate for S-having it vanish would imply stopping learning in S all together. We must conclude that strict satisfaction of the constraints c Q = 1 and c S = 1 is not compatible with non-vanishing learning rates that lead to a finite equilibrium. Specifically, c Q = 1 can only ever hold approximately if the spread s is to be learned in finite time. Nevertheless, no such problem arises when c S is set to 1 exactly. Now, what do the constraints c Q � 1 and c S = 1 mean in terms of the parameters α, λ and �? In the previous paragraph, we saw that c Q � 1 is equivalent to λ/α Q � 0. Since both λ (a decay constant) and α Q (an effective learning rate) are inherently positive, we may rewrite this as The other condition, c S = 1, is easily translated analogously. We need only use the definitions Eqs 14 and 15 constitute the exact relations between the parameters α, λ and � that need to hold for payoffs and costs to be estimated accurately. They cannot be further simplified, but we may use them to gain some more insight into the required magnitudes of the individual parameters: by substituting 2λ according to Eq 15 on the right-hand side of Eq 14, one obtains a condition of the form 1 − � � 1 + �. Now, given that the intended range for � is [0, 1], one quickly reaches the conclusion that � � 1. Reinserting this into Eq 14 yields λ � α. In conclusion, we found that it is necessary (though not sufficient) for accurate learning of payoffs and costs to maintain a small, but non vanishing nonlinearity � in the transmission of the prediction error signal, as well as a non vanishing decay rate λ, which is much smaller than the learning rate α.
Finally, how can such parameters α, λ and � practically be determined? To implement the conditions c Q � 1 and c S = 1, one can for instance express λ and � in terms of α, c Q and c S . It is tedious, but without conceptual difficulty to invert the definitions of c Q and c S in order to yield Then, one chooses α freely at one's convenience, and c Q and c S close (or, in case of c S , equal) to one. Importantly, c Q must be chosen smaller then one to result in a positive λ. From these choices, one finally obtains � and λ to work with the chosen α. Our simulations suggest that even values such as c Q = 0.7 and c S = 0.9, in combination with a learning rate of, say α = 0.3, are close enough to one to allow reasonably accurate estimations of payoff and cost. This can be seen in Fig 3: the simulations shown in there used those exact settings, which equivalently means that � = 0.443 and λ = 0.093.
In summary, we used a statistical argument-the connection between payoffs and costs and the reinforcement statistics-to determine conditions under which payoffs and costs can be learned with the update rules Eqs 2 and 3.

Deterministic reinforcement sequences
In the preceding section, we derived relations that are necessary for successful learning of payoff and cost. If rewards are awarded stochastically, those relations are also sufficient for successful learning. But what happens to the weighs G and N if the received reinforcements follow a strong pattern? Assume, for instance, that an action reliably yields a fixed cost −n followed by a fixed payoff p. Under which additional conditions do G and N then still reflect the magnitudes of payoff and cost after learning?
To answer that question, we must again determine the connection strengths that result from experiencing the action time and again. Now, we do not have to rely on a probabilistic treatment-when the pattern of the reinforcements is fully known, it is possible to determine the evolution of G and N exactly. As in the previous section, we will concentrate on the result of learning rather than on its dynamics. Here, this amounts to determine the fixed points of the learning rules. These fixed points are simply those values of G and N (or equivalently of the alternative variables Q and S we defined above) that are invariant under the updates caused by the action. We denote the fixed points by G � and N � , or Q � and S � . During learning, the variables converge to their respective fixed points and cease to change notably once they arrive in their vicinity.
First, we focus on determining the fixed point of Q. Note that each encounter with the action yields two updates of Q: one due to the cost and one due to the payoff. Mathematically, we can formulate this as To find Q � , demand that these successive updates have no net effect on Q: If Q after action equals Q before action , then Q before action can rightfully be called fixed point. If this is so, the two updates must have canceled each other: This condition, in combination with the update rules Eqs 2 and 3, allows to determine Q � in terms of p, n and the parameters α, � and λ. First, we use the update rule Eq 6 for Q to write (ΔQ) cost as ðDQÞ cost ¼ a Q ðr cost À Q before action Þ À lQ before action ¼ a Q ðÀ n À Q before action Þ À lQ before action : Then, one uses the rule again to write (ΔQ) payoff as ðDQÞ payoff ¼ a Q ðr payoff À Q after cost Þ À lQ after cost ¼ a Q ðp À Q after cost Þ À lQ after cost ¼ a Q ðp À ðQ before action þ ðDQÞ cost ÞÞ À lðQ before action þ ðDQÞ cost Þ: Finally, one substitutes (ΔQ) cost from above into this expression, and then inserts (ΔQ) cost and (ΔQ) payoff into Eq 17. Solving the equation for Q before action , which in case of Eq 17 is identical to Q � , yields where α Q = α(1 + �)/2. Now, recall that the definition of Q in terms of G and N is Q = 1/2(G − N), and that true payoffs and costs of in this model are p and n. If G and N represented the true payoffs and costs after learning, it must be true that G � � p and N � � n, and thereby Just as Eqs 12 and 13, this equation is an interface between the results of Mikhael and Bogacz' [16] update rules on the left-hand side and the requirement that Go and No-Go weights encode payoffs and costs on the right-hand side. For both sides to agree, we must have This is a novel condition for learning the correct magnitudes of payoffs and costs from a deterministic reinforcement pattern. The definition of α Q and the previously derived conditions in Eqs 14 and 15 may be used to transform this novel condition into the simpler form α � 1.
Next, we repeat the same analysis for S. Since we search for additional conditions on the parameters, we are free to use the original conditions in Eqs 14 and 15 to simplify our calculations. The only complication we encounter is the appearance of Q in the update rules of S, which we resolve by substituting Q with Q � , acknowledging that the fixed points of S and Q depend on each other. We arrive at Again, using the definition S = 1/2(G + N) allows comparing the result of learning with the strengths required to represent payoffs and costs. We immediately find that G � � p and N � � n already hold. Thus, Eq 20 is the only additional condition for successful learning of payoff and cost from reinforcements that follow a strong pattern.
From the results presented in this section, we conclude that the learning rules Eqs 2 and 3 facilitate learning of the magnitudes of fixed payoffs and costs that occur reliably one after the other. However, we also saw that this is only true if Eq 20 holds in addition to the conditions that we derived in the previous section.

Summary of analytic results
The analysis above revealed the conditions under which the striatal plasticity rules Eqs 2 and 3 could learn the magnitudes of the payoffs and costs of actions. We identified the conditions in two different paradigms: first, we investigated learning from purely stochastic reinforcements sampled from a fixed distribution. Then, we considered a deterministic pattern of reinforcements. We obtained two key results: • Consider a reinforcement distribution-obtained from multiple encounters with an action-that is shaped by payoffs and cost, as the one shown in Fig 4. If trained on reinforcements sampled from that distribution, the plasticity rules Eqs 2 and 3 will enable learning of the mean payoffs and costs if hold. These conditions imply-but do not follow from-a non-vanishing but small nonlinearity in the transmission of the prediction error, and a non-vanishing but small decay of the connection weights. Here, a small decay is characterized by a decay rate λ which is small compared to the learning rate α.
• If trained on a pattern of reinforcements that alternates between payoffs of magnitude p and costs of magnitude n, the plasticity rules Eqs 2 and 3 will capture those exact payoffs and costs if, in addition to Eqs 22 and 23, holds. In words, unbiased learning of payoffs and costs in deterministic scenarios explicitly requires a small learning rate α.

Simulations of learning
The previous sections revealed what to expect from training the learning rules Eqs 2 and 3 on certain types of reinforcement. Specifically, we investigated the connection strengths G and N after many experiences of either totally predictable or totally random reinforcements. In this section, we aim to confirm and extend those results using numerical simulations rather than analytic methods. But while the cost is again held constant at −n, this time the payoff P is sampled from a fixed distribution (a normal distribution with mean p and non-vanishing variance) in each trial. Thus, the task includes both stochastic and deterministic components: each repetition of an action results in a fixed cost, which is followed by an uncertain reinforcement. The depicted simulations show that under such conditions, N eventually represents the cost n, while G converges towards the mean payoff p ¼ EP.
Finally, Fig 5c and 5d contain simulations of repeated actions with reinforcements drawn completely at random from fixed distributions. In Fig 5c, the obtained reinforcements are valued either p or −n, with probabilities 1/2 each. In Fig 5d, reinforcements are sampled from a normal distribution with mean μ r = 1/2(p − n) and standard deviation of s r ¼ 1=2 ffi ffi ffi ffi ffi ffi ffi ffi p=2 p ðp þ nÞ. We simulate the experience resulting from such actions by sampling reinforcements from a fixed distribution on each trial. The stochastic nature of this procedure causes the evolution of the weights G and N to be different each time the simulation is run. To overcome that effect and segregate random fluctuations from reproducible effects, we collect and average a large number of runs. Each row in Fig 5b-5d contains both a single run of the simulation and an average of 500 successive runs. In the above sections, we proved that in purely stochastic tasks, the weights would approximate key statistics of the reinforcement distribution after convergence. Those statistics are indeed approximated in the simulations, confirming the results of the analytic treatment above.

Simulations of the effect of D2 blocking
In the previous sections, we focused on the change of the synaptic weights associated with a single action during the accumulation of experience. In this section, we redirect our attention. Instead of considering one action during learning, we now consider multiple actions after learning, and ask: can effects of dopamine depletion on choice behavior be explained in terms of payoffs versus costs?
In a classic experiment illustrated in Fig 6a, rats were given a choice between pressing a lever in order to obtain a nutritious pellet and freely available lab chow [22]. Normal animals were willing to work for pellets, but after blocking D2 receptors with the drug haloperidol they Learning payoffs and costs were not any more willing to make an effort and preferred a less valuable but free option. Collins and Frank [8] provided a mechanical explanation for this surprising effect. The theory proposed in this paper accounts for it in a conceptually similar but slightly simpler way. Here, we explain our modeling of the experiment and then describe the simulations-the differences to the account of OpAL model are pointed out in Discussion.
To model the experiment, we need to specify how the striatal weights G and N and the motivation signal transmitted by dopamine affect the output of the basal ganglia system, and how that output then affects choice. We refer to the output of the basal ganglia as the thalamic activity, denoted by T. T depends on the cortico-striatal weights G and N, and dopaminergic motivation signal denoted by D. Even though this relationship might admittedly be complex, we restrict ourselves to just capture the signs of the dependencies by using a linear approximation: In the above equation, the first term DG corresponds to the input from the striatal Go neurons. This term is positive because the projection from striatal Go neurons to the thalamus involves double inhibitory connections (see Fig 1) resulting in an overall excitatory effect. The activity of the Go neurons depends on synaptic weights G. We assume that their gain is modulated by the dopaminergic input D, extrapolated from the observation of an increased slope of the firing-input relationship in the presence of dopamine in pyramidal neurons expressing D1 receptors [23]. These data are replotted in Fig 7a. The second term −(1 − D)N corresponds to the input from the striatal No-Go neurons. It has a negative sign because the projection form the No-Go neurons to the thalamus includes three inhibitory connections. The activity of the striatal No-Go neurons depends on their synaptic weights N, and we assume that their gain is reduced by dopamine, so the synaptic input is scaled by (1 − D). This assumption is based on data showing that agonists reduce the slope of the firing-input relationship of striatal No-Go  [23]. The blue points are recorded from a neuron at a higher level of D1 receptor activation (e.g. with dopamine present), the black points are recorded at a lower level of receptor activation (e.g. without dopamine). Smooth curves have been obtained from the data through LOESS regression to serve as visual guides (black and blue lines). (b) fI-curves of a D2-expressing neuron, replotted from [24]. The blue points are recorded from a neuron at a higher level of D2 receptor activation (e.g. in the presence of the D2 agonist quinpirole), the black points are recorded from a neuron in the control group at a lower level of D2 activation (e.g. in the absence of the agonist). As in panel (a), LOESS curves (black and blue lines) have been added as visual guides. https://doi.org/10.1371/journal.pcbi.1006285.g007 Learning payoffs and costs neurons [24], which are replotted in Fig 7b. Those assumptions about the impact of dopamine on the activity of striatal neurons are backed up by detailed modeling studies [25,26], which predict precisely that dopamine enhances activity in the Go-and inhibits activity in the No-Go pathway. In Eq 25, we further assume that D 2 [0, 1] and that the value of D = 0.5 corresponds to a baseline level of dopamine for which both striatal populations equally affect the thalamic activity.
Although arising from a slightly different induction, the action value defined by Eq 25 is directly proportional to the action value proposed by Collins and Frank, which is defined by Eq 4 of their publication [8]: Q / β G G − β N N. One easily verifies the direct proportionality of the two expressions by rewriting How does thalamic activity affect choice? Again, we use a very simple dependency to capture the key aspects of that relationship: In our model of the experiment, we calculate the thalamic activity for each option. Then, we add some random noise independently to each option. Finally, all options with negative noisy thalamic activity are discarded, and the option with the highest noisy thalamic activity is chosen. If the noisy thalamic activity is negative for all available options, no choice will be made; the model defaults to staying inactive.
Often in similar situations, the softmax rule is the preferred choice procedure. According to that rule, one should first transform the set of different action values (or thalamic activities in this case) into a probability distribution over the available actions, by use of the softmax function. Then, one should sample an action from that distribution, and declare it the choice of that trial. Collins and Frank's OpAL model [8] exemplifies the use of the softmax rule.
We deliberately decided against this conventional approach and in favor of the abovedescribed procedure to accommodate a certain feature of the data presented in [22]: The group with D2 antagonist differed from the control group not only in their willingness to work for food but also in their overall food consumption. The rats with D2 antagonist consumed less food in total (see Fig 8c). We can hope to capture this effect with our model, since it allows for the possibility to make no choice at all, and thus consume neither of the food items. A softmax decision rule, on the other hand, forces a choice on each trial, and must therefore always lead to the same number of consumed food items.
Finally, how does the drug haloperidol affect the thalamic activity, and hence choice? Haloperidol is a D2 antagonist; it blocks the D2 receptors on the medial spiny neurons of the No-Go pathway. This blocking reduces the (inhibiting) impact of dopamine on the activity N of that pathway. To account for this in our model, we introduce another factor κ N 2 [0, 1] into our expression for the thalamic activity: The parameter κ N controls the how much dopamine affects the activity of the No-Go pathway N, and is hence suitable to model D2-blocking: κ N = 1 recovers the normal thalamic activity given in Eq 25, while κ N = 0 (total blocking) fully removes the impact of dopamine on the indirect pathway, leading to completely uninhibited activity N. In the control group of the experiment, κ N is set to 1 (no medication is administered, no blocking happens). In the group that received the medication, κ N is a free parameter that must be fitted to the data. The best fit featured κ N = 0.7507, corresponding to blocking of D2 receptors with an efficiency of roughly 25%. Fig 6b illustrates how the model can account for the behaviour when the dopamine level has a normal baseline value. In the figure, the strength of the cortico-striatal connections is denoted by the labels and the thickness of arrows. Pressing the lever gives a high payoff, so the weights of Go neurons selective for this action are strong, but it also has a substantial cost, so the No-Go weights are also present. On the other hand, the free food is not particularly nutritious so the Go weights are weak, and there is no cost, so the No-Go weight is negligible. When no medication is administered, the positive and negative consequences are weighted equally, so the thalamic neurons selective for pressing the lever have overall slightly higher activity, which ultimately leads to a higher likelihood for this action to be chosen over the free option. By contrast, Fig 6c shows that when the D2 receptors are blocked, costs are weighted more than payoffs, and the thalamic activity associated with pressing the lever decreases. Approaching free food has only negligible cost; therefore, the activity of thalamic neurons selective for this option is now higher, and this action is overall more likely to be chosen.
A quantitative fit of our model to Salamone et al.'s experimental results [22] is illustrated in Fig 8. The panels on the left side in Fig 8 summarize experimental data: the top-left display corresponds to a condition in which both high-valued pellets and the low-valued lab chow were freely available. In this case, the animals preferred pellets irrespective of the dopamine level. The bottom-left panel corresponds to the condition in which the animal had to press a lever in order to obtain a pellet, and as mentioned before, after injections of a D2 antagonist they started to prefer the lab chow.
In our model of the experiment, we run through a sequence of trials mimicking those illustrated in Fig 6: on each trial, the model makes a choice between two actions-pressing a lever or approaching lab chow-or remains inactive. Before the main experiments, the animals were trained to press a lever to obtain rewards and were exposed to the lab chow [22]. To parallel this in simulations, the model was first trained such that it experienced each action a number Learning payoffs and costs of times, received corresponding payoffs and costs, and updated its weights according to Eqs 2 and 3. The weights resulting from that learning are reported in Fig 6b and 6c. Then, the model was tested with and without blocking, e.g. with κ N a variable and κ N fixed to one. As described in Materials and Methods, the parameters of the model were optimized to match experimentally observed behavior. As shown in the right displays in Fig 8, the model was able to reproduce the observed pattern of behavior. This illustrates the model's ability to capture both learning about payoffs and costs associated with individual actions and the effects of the dopamine level on choices.

Robustness
Above, we dedicated a whole section to derive conditions for the parameters of the learning rules Eqs 2 and 3 to guarantee correct (i.e. unbiased) estimation of payoffs and costs. We also pointed out that these conditions cannot be satisfied exactly even in theory; in fact, our own simulations throughout this paper yield parameter settings that significantly violate the conditions. The proposed biological implementation of the rules, certainly imperfect and subject to unpredictable influences, is yet less likely to feature parameters close to the constraint surface. How robust is the presented learning algorithm under parameter detuning? How much variation around the conditions can the rules take without breaking? Here, we first describe the effect of parameter detuning on the values to which Go and No-Go weights converge. Then, we argue that the algorithm will still produce useful results even under substantial detuning of the parameters.
We are interested in the coding of payoffs and costs after learning, and should therefore investigate the equilibrium values G � and N � of G and N. Those equilibrium values may be obtained via combination of the equilibrium values of Q � and S � given in Eqs 8 and 11: Here, we assumed that the average spread around c Q q is approximately equal to the average spread around q, which is a good approximation if the spread of a distribution is comparable to the mean. Next, we can use the relation of payoffs p and costs n to the statistics q and s of the reinforcement distribution they generate. These relations are given in Eqs 4 and 5; inverting and inserting those yields We observe that as long as c Q = c S , the Go and No-Go weights converge to the vicinity of values proportional to the payoffs and costs. Thus, as long as c Q = c S , the payoffs and costs are encoded separately in the two pathways.
Expressed in terms of the elementary parameters α, λ and �, and solved for �, this condition becomes � ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi A second solution of the condition exists; however, it yields � < 0, which is biologically implausible. Hence, we ignore that second solution and focus our attention on Eq 31: if λ/α is very small (i.e. if decay is weak relative to learning), then � approaches one, rendering the learning rules approximately linear. If, on the other hand, λ/α is very large (i.e. decay is very strong compared to learning), then � approaches zero, rendering the learning rules maximally non-linear. This relationship between � and λ is not surprising; in fact, we have seen in Fig 2  that decay is necessary to balance the unconstrained strengthening of the weighs that results from introducing the nonlinearity (compare Fig 2b and Fig 2c). Eq 31 makes this manifest: the stronger the nonlinearity (i.e. the closer � gets to zero), the stronger the decay relative to learning-and vice versa. Now, after investigating the effect of detuning on G � and N � , let us explore the effect of detuning on the thalamic activity T, which is the relevant output of our model as far as action selection is concerned. Substituting the above equations into the definition of thalamic activity in Eq 25 we obtain: When c Q = c S 6 ¼ 1, the thalamic activity becomes scaled by a constant c S , but as this scaling constant is the same for all actions, the network can still select actions on the basis of payoffs and costs modulated by motivation signal D, in the same way as described in the previous subsection. Importantly, the effect of dopamine-to emphasize the payoff when increased, and emphasize the cost when decreased-is present as long as c S > 0 even if c Q 6 ¼ c S . These signature effects of the proposed mechanism are thus robust even under significant detuning. However, the disadvantage of setting parameters such that c Q 6 ¼ c S is that the dopaminergic motivation signal D would have a relatively smaller effect on changing the weighting between payoffs and costs; for example the payoffs or costs could no longer be ignored by setting D to its extreme values of 0 or 1. From this analysis, we may conclude that while action selection is quite robust under violation of the derived conditions, dopaminergic regulation works most effectively if the conditions are met approximately.

An actor-critic variation
So far, we assumed that the outcome prediction is computed by the same striatal neurons that encode the payoffs and costs of actions. Only one network was involved: that which is responsible for the choice of action. We refer to such a network as 'actor' in the remainder of this exposition. In this section, we look at how the theory described above generalizes to the actorcritic framework [27]. That framework assumes that the outcome prediction is not computed by the actor, but by a separate group of striatal patch neurons called the 'critic'. More formally, the purpose of that critic is to learn the value V of the current state. One way to generalize our theory in this direction is to keep the actor network unaltered, while supplementing it with a similar critic network that learns by the very similar rules Eqs 2 and 3: The crucial difference between the actor and the critic is that the critic network is not selective for the action, but only for the situation (note that G critic (s) and N critic (s) depend on s, but not on a, as opposed to G actor (s, a) and N actor (s, a)). It thus learns the value of a situation irrespective of the actions chosen. Importantly, the critic is in charge of supplying the outcome predictions. Those predictions are compared to the actual outcomes to produce the outcome prediction errors δ from which both networks learn.
We take the state value to be encoded in the difference of G critic (s) and N critic (s): V critic (s) = 1/2(G critic (s) − N critic (s)). The change of the state value on each trial can be obtained by subtracting Eqs 33 and 34: The prediction error δ-which teaches the actor as well-is the difference between the obtained reinforcement r and the reinforcement prediction by the critic: What would be learned with that architecture? If the same action is selected on each trial, the actor will learn in exactly the same way as the critic. Then, the prediction error in the actor-critic model is the same as in the actor-only model described above, and the weights of the actor in the actor-critic model converge to exactly the same values as for the actor-only model. However, this reasoning does not seem to apply if more than one action is available: empirically, animals then select the actions that maximize their rewards in their own perception. In the process of learning, they will likely sample all available actions.
If such behavior generates input for an actor-critic model, the critic will integrate the experience of all those trials, and will thus represent a mixture of the expected reinforcements associated with the available actions. This generally interferes with correct learning of the payoffs and costs of the different actions. However, there is a caveat: one of the available actions will eventually prove most useful; as soon as the animal has determined that best action, it will select it in the majority of cases. That, in turn, forces the critic into mainly representing the expected reinforcement of this best action. As a final consequence, also payoff and cost of that best action are inferred correctly.
We confirmed this empirically for the model specified above: in Fig 9, we present simulations of a task in which the subject must choose between two actions. Both actions reliably yield a constant cost followed by a constant payoff each time they are selected. One of the actions is unambiguously superior to the other: its payoff is larger and its cost is lower.
Both an actor-only model and an actor-critic model interacted with that task. On each trial, an action was selected by sampling from a softmax distribution over all available actions: the probability of choosing action a in situation s was proportional to exp (βQ(s, a)), where Q(s, a) = 1/2 (G(s, a) − N(s, a)) was the action value, and β was the softmax temperature. Fig 9 shows the temporal evolution of the involved synaptic weights over the course of learning. Fig 9a and 9b depict the actor-only evolution of the weights G and N that encode the payoffs and costs of actions 1 and 2, respectively. For both actions, payoffs and costs are learned correctly. Learning is notably slower for action 1. This is easily explained: action 1 is the worse of the two options and thus chosen much less frequent. In contrast, the actor-critic driven evolution of the same weights presented in Fig 9d and 9e leads to a correct estimate of the payoff and cost only for the superior action 1. Learning is impaired for the inferior action 2, as anticipated in the qualitative discussion above. The state value, presented in Fig 9c, provides further confidence in the validity of that discussion: Instead of encoding a mixture of the values of all available actions, it converges to the value of the superior action, indicated by the higher of the two dashed lines.
What have we learned in this section? We set out to analyze an actor-critic formulation of our model, where the feedback signal that teaches the actor is computed by a different network called the critic. We found that our formulation (which is by no means the only possible one) enables the actor to learn accurate estimates of the payoffs and costs of the most advantageous action from the critic's feedback. The payoffs and costs of the other actions were not estimated as accurately, which was due to a sampling bias towards more rewarding options. This does not necessarily compromise behavior-after all, one may trust the model to provide accurate information on the actions that are most frequently picked, and thus to be helpful in the majority of cases. However, we believe that a more sophisticated actor-critic variant of our model could conceivably provide good estimates of the payoffs and costs of all actions. The development of this improved actor-critic variant is left to future work; here we merely demonstrate that our model is not meant to compete with actor-critic models, but rather to complement them.

Discussion
This article describes how the positive and negative consequences of actions can be separately learned on the basis of a single teaching signal encoding outcome prediction error. In this section, we relate the theory with data and other models, state experimental predictions, and highlight the directions in which the theory needs to be developed further. Learning payoffs and costs

Relationship to experimental data
The model described in this paper was shown in simulations to avoid actions that require effort when the motivational signal was reduced. The unwillingness to make an effort for reward in dopamine-depleted state has also been observed in other paradigms: During a choice in a T-maze, dopamine-depleted animals were less likely to go to an arm with more pellets behind the barrier, but rather chose the arm with easily accessible but fewer pellets [28]. Parkinson's patients were not willing to exert as much physical effort by squeezing a handle in order to obtain reward as healthy controls, especially if they were off medications [29]. These effects can be explained in an analogous way [8] by assuming that in the dopamine-depleted state the effort of crossing the barrier or squeezing a handle is weighted more, resulting in lower activity of thalamic neurons selective for this option. Both in OpAL and the model proposed here, reducing the dopamine level reduces the tendency to choose actions involving costs, and thus changes preferences.
Let us now consider how the weight changes in our model relate to known data on synaptic plasticity in the striatum. Fig 10b illustrates the weight changes when an animal performs an action involving a cost n in order to obtain a payoff p (Fig 10a), e.g. pressing a lever in order to obtain a pellet. The direction of changes in G and N depending on the sign of δ are consistent with the changes of synaptic weights of Go and No-Go neurons observed at different dopamine concentrations. Fig 10c shows experimentally observed changes in synaptic strengths when the level of dopamine is low (displays with white background) and in the presence of agonists (blue background) [11]. Note that the directions of change match those in the corresponding displays above, in Fig 10b. These directions of changes in striatal weights are also consistent with other models of the basal ganglia [8,12], but the unique prediction of the rules described in this paper is that the increase in dopaminergic teaching signal should mainly affect changes in G, while the decrease in dopamine should primarily affect N. Thus, the dopamine receptors on the Go and No-Go neurons should be most sensitive to increases and decreases in dopamine level respectively. This matches the properties of these receptors. The D2 receptors on No-Go neurons have a higher affinity and therefore are sensitive to low levels of dopamine compared to D1 receptors on Go neurons [31]. This property is illustrated in Fig 10d where the green and red curves show the probabilities of D1 and D2 receptors being occupied as a function of dopamine concentration. The blue dashed lines indicate the levels of dopamine in the striatum predicted to result from the spontaneous firing of dopaminergic neurons [32]. At these levels, most D1 receptors are deactivated. Thus the D1 receptor activation will change when the dopamine goes up, but not when it goes down, as indicated by the black arrows. This is consistent with the stronger impact of positive prediction errors on the weight changes of the Go neurons implemented in Eq 2. By contrast, the D2 receptors are activated at baseline dopamine levels, so their activation is affected by the decreases in dopamine level but little by increases, in agreement with a stronger impact of positive prediction errors on the No-Go neurons implemented in Eq 3.
Our model further requires decay of relevant weights whenever prediction errors are absent. In terms of neural implementation, this translates into mild LTD resulting from coactivation of the pre-and post-synaptic cells at baseline dopamine levels. Recently, this effect has been observed at cortico-striatal synapses in vivo [33]: in anesthetized rats, presynaptic activity followed by postsynaptic activity caused LTD in the absence of induced dopaminergic response.
In summary, the plasticity rules allowing learning positive and negative consequences are consistent with the observed plasticity and the receptor properties.
Recently, there has been a debate concerning the fundamental concept of basal ganglia function, i.e. the relationship between the Go and No-Go neurons: on one hand they have the opposite effects on a tendency to make movements [2], but on the other hand they are co-activated during action selection [34]. The presented theory is consistent with both observations: It  Figures 3E, 3B, 3F and 1H in [11]. (d) Changes in dopamine receptor occupancy. The green and red curves show the probabilities of D1 and D2 receptor occupancies in a biophysical model [30]. The two dashed blue lines in each panel indicate the levels of dopamine in dorsal (60 nM) and ventral (85 nM) striatum estimated on the basis of spontaneous firing of dopaminergic neurons using the biophysical model [32]. Displays with white and blue backgrounds illustrate changes in receptor occupancy when the level of dopamine is reduced or increased respectively. https://doi.org/10.1371/journal.pcbi.1006285.g010 assumes that Go and No-Go neurons have opposite effects on movement initiation. But during action selection, the basal ganglia need to calculate the utility which combines information encoded by both populations, so may require their co-activation.
The proposed model assumes that while an animal makes an effort, the outcome prediction error should be negative, thus the dopamine level should decrease. However, at the time of lever pressing the system needs to be energized to perform a movement, so one could expect an increased level of dopamine. Furthermore, voltammetry studies measuring dopamine concentration in the striatum did not observe a decrease in dopamine level during lever pressing [35]. Nevertheless a recent study recording activity of single dopaminergic neurons that provided a better temporal resolution reported that dopaminergic neurons increased the activity before movement, and then decreased it below baseline during movement [32]. The increase before movement may be related with energizing system for movement, while the decrease during movement may be related with representing effort.
In addition to effort, other negative experiences lead to phasic decreases in dopaminergic activity as well: the unexpected experience of pain [36], aversive stimuli such as air puffs [37] and, for humans, monetary losses (literal costs) [38] all coincide with decreased activity of dopamine neurons. This supports the general idea that the No-Go pathway encode costs of all kinds.

Experimental predictions
A direct test of the proposed model could involve the recording of the activity of Go and No-Go neurons (e.g. with photometry) during a task in which an animal learns the payoffs and costs associated with an action. Assuming that G and N are reflected in the activity of the Go and No-Go neurons while the animal evaluates an action (i.e. just before its selection), one could analyze the changes in the activity of Go and No-Go neurons across trials. One could compare if they follow the pattern predicted by the rules given in this paper, or rather by other rules proposed to describe learning in striatal neurons [7,8,14].
Just as the OpAL model [8], the theory proposes that the positive and negative consequences are separately encoded by the Go and No-Go neurons which are differentially modulated by dopamine. The theory predicts that agonists specific to just one of the striatal populations change the effect of consequences encoded by this population without changing the impact of the other population. For instance, a D1 antagonist would suppress the reception of dopamine in the direct pathway. There, dopamine increases activity. Hence, the D1 antagonist would diminish the impact of the direct pathway, and therefore of learned positive consequences, on choices. However, it would not change the impact of the indirect pathway, i.e. the impact of learned negative consequences. This prediction could be tested in an experiment involving choices between options with both payoff and cost. Consider, for instance, the decision between a neutral option (p = 1, n = 1) and a high-payoff option (p = 2, n = 1). Since a D1 antagonist decreases the impact of payoffs on decisions, it should decrease the preference for the high-payoff option. On the other hand, the avoidance of a high-cost option (p = 1, n = 2) over the neutral option should not be affected by the D1 antagonist, since it does not affect the impact of costs on decisions.
It could also be worthwhile to investigate whether changing the influence of positive and negative consequences on choice can not only be achieved by pharmacological manipulations, but also by changing a behavioral context such as hunger, or reward rate which has been shown to affect the average dopamine level [19].
The theory assumes that the synaptic plasticity rules include a decay term proportional to the value of the synaptic weights themselves. Decay terms are also present in other models of learning in basal ganglia [15,39,40]. This class of models predicts that the synaptic weights of striatal neurons which are already high increase less during potentiation than the smaller weights (an opposite prediction is made by the OpAL model [8], where the weights scale the prediction error in the update rule). This prediction could be tested by observing the Excitatory Post-Synaptic Currents (EPSCs) evoked at individual spines. The class of model including decay predicts that the spines with smaller evoked EPSCs before inducing plasticity should be more likely to potentiate.

Relationship to other theories
The proposed model builds on the seminal work of Collins and Frank [8], who proposed that the Go and No-Go neurons learn the tendency to execute and inhibit movements, and how the level of dopamine changes the influence of the Go and No-Go pathways on choice. The key new feature of the present model is the ability to learn both payoffs and costs associated with a single action. We demonstrated above that when the model repeatedly selects an action resulting first in a cost and then in the payoff, G and N-under certain conditions that we specified-converge to the magnitudes of that payoff and cost. This is not so in the original OpAL model, as we shall show in a brief analysis.
Collins and Frank [8] demonstrated that when the environment is stationary and prediction error δ converges to zero, then the weights G and N in the OpAL model converge to bounded values. However, we will show that Go and No-Go weights converge to zero when an action that results first in a cost and then in the payoff is repeatedly selected.
The OpAL model is based on the actor-critic framework; hence, the prediction error is defined as in Eq 36. The weights of the critic are modified simply as ΔV = αδ. The weights of the actor are modified according to the following equations [8]: Fig 11 shows how the weights change in a simulation of the OpAL model. The weights of the critic approach a value close to the average of payoff and cost. Let us consider what happens in the model once the critic weight stops changing between trials (i.e. from *10th trial onward in Fig 11). The weight of the critic still changes within a trial, i.e. decreases when cost is incurred and increases after a payoff. This happens because the prediction error oscillates around 0, i.e. it is equal to δ = −d while incurring a cost and δ = d while receiving a payoff, where d is a constant. If so, let us consider how a Go weight changes within a trial. According to Eq 37 the weight changes as follows: Substituting Eq 39 into Eq 40 we obtain: G after payoff ¼ G before action À aG before action d þ aðG before action À aG before action dÞd We see that within a trial a Go weight decays proportionally to is value, resulting in an exponential decay across trials seen in Fig 11. Analogous calculations show that the No-Go weight decays in the same way. We conclude that the OpAL model is unable to estimate positive and negative consequences for actions which result in both payoffs and costs. It is worth noting that the decay of actor weights to zero demonstrated above is specific to the version of basal ganglia model proposed by Collins and Frank [8], but would not be present in another version of the model [39] where the learning rules include a special term preventing the weights from approaching zero. On the other hand, nothing in the above calculation depended on G, N and V updating at the same learning rate α-the derivation can be carried out in exactly the same way assuming α V 6 ¼ α N 6 ¼ α G . Hence, we may summarise that even such generalised OpAL models must fail to learn payoffs and costs of actions, irrespective of the specific parameter values unless further terms are added to the learning rules. Our analysis suggests that learning payoffs and costs can be enabled by different effective learning rates after positive versus negative feedback for Go and No-Go synapses, which in our model is achieved by setting � < 1.
To interpret this result, note that we do not claim that the OpAL model is not capable of optimizing the policy. It is set up as a policy improving algorithm, and might even reflect the payoffs and costs of actions in the weights G and N in certain situations. However, as we have shown there is also situations in which OpAL is not able to encode the payoffs and costs. In contrast, we showed above the model presented in this paper does encode payoffs and costs in any situation, given a suitable set of parameters and enough time to learn.
The model described in this paper has been shown to account for the effects of dopamine depletion on the willingness to make effort, which have also been simulated with the OpAL model. To simulate the effects of dopamine depletion on the choice between an arm of a Tmaze with more pellets behind a barrier and an arm with fewer pellets, [8] trained a model on three separate actions: eating in the left arm, eating in the right arm, and crossing a barrier. In this way, it was ensured that each action had just payoff or just cost, and the model could learn them. Subsequently, during choice, the model was deciding between a combination of two actions (e.g. crossing a barrier and eating in the left arm) and the other action. By contrast, the model proposed in this paper was choosing just between the two options available to an animal in an analogous task (Fig 6), because it was able to learn both payoffs and costs associated with each option. This is a useful ability, as most real-world actions have both payoffs and costs.
In the original paper introducing the plasticity rules [16], it was proposed that the rules allow the Go and No-Go neurons to encode reinforcement variability because when an action results in variable reinforcements, both G and N increase during learning. It was further proposed that the tonic level of dopamine controls the tendency to make risky choices, as observed in experiments [41], because it leads to emphasizing potential gains, and under-weighting potential losses. However, here it is proposed that the striatal learning rules primarily subserve a function more fundamental for survival, i.e. learning payoffs and costs of actions. From this perspective, the influence of dopamine level on the tendency to make risky choices arises as a by-product of a system primarily optimized to weight payoffs and costs according to the current motivational state.

Directions for the future work
There are multiple directions in which the presented theory could be extended. For example, the theory has to be integrated with the models of action selection in the basal ganglia to describe how the circuit selects the action with the best trade-off of payoffs and costs. Furthermore, the theory may be extended to describe the dependence of the dopaminergic teaching signal on the motivational state. Learning experiments in which an animal may be deprived of physiologically required substances suggest that both terms in the outcome prediction error encoded by dopamine (i.e. the reinforcement and the expected outcome) are scaled by motivation [42]. It would be interesting to incorporate such scaling in our model, where the direct pathway, as well as the indirect pathway, contribute to the outcome estimate, which is then compared to the experienced reinforcement to compute the prediction error. If dopaminergic modulation is taken into account also at this stage, the dopaminergic motivation signal should affect the outcome estimate, and hence influence learning.
A limitation to our current model is the rudimentary form of the basal ganglia output, given in Eq 25. It is known that the effect of dopamine on the activity in the two pathways is not linear (as assumed in this paper), but exhibits saturation effects. The fact that the reception of dopamine is nonlinear plays a crucial role in the learning part of our model (the piecewise linear functions f � introduce exactly that nonlinearity), and could also be implemented at the decision-making stage, if the activity of Go and No-Go neurons (combined in Eq 25) depended nonlinearly on the dopamine level. In such more elaborate formulation, the finetuning of the baseline dopamine level then becomes critical. Including nonlinear effects of dopamine on activity during choice would allow studying interactions between learning and decision making, which would both be affected by the position of the baseline and the strength of the nonlinearity.
It is intriguing to ask whether the evaluation of actions combining separately encoded positive and negative consequences is also performed by areas beyond the basal ganglia. Indeed, positive and negative associations are encoded by different populations of neurons in the amygdala [43]. Moreover, an imaging study [44] suggests that costs and payoffs are predicted by the amygdala and the ventral striatum respectively, and ultimately compared in the prefrontal cortex. Furthermore, different cortical regions preferentially project to Go or No-Go neurons [45], raising the possibility that the positive and negative consequences are also encoded separately in the cortex. Therefore, it seems promising to investigate if similar plasticity rules could also describe learning beyond the basal ganglia.

Materials and methods
During simulations of an experiment by Salamone et al. [22], the model received payoff p chow = 1 for approaching the lab chow, and payoff p pellet for choosing a pellet. The model was simulated in two conditions differing in the cost of choosing a pellet which was equal to n pellet = 0 in the free-pellet condition, and to n pellet = n lever in a condition requiring lever pressing to obtain a pellet. There was no cost of choosing lab chow (n chow = 0) in either condition.
For each condition, the model was simulated in two operational modes: in the control state, the coupling κ N of dopamine to the D2-expressing neurons was fixed at κ N = 1 during choice (making manifest the assumed fully functioning dopaminergic modulation in the control group). Conversely, in the state corresponding to the presence of the D2-antagonist haloperidol, κ N was treated as a variable valued in [0, 1], now allowing for impaired dopaminergic regulation. The level of dopamine D was kept fixed at D = 0.5 throughout, assuming largely an unaltered baseline level for both groups.
For each condition and state, the behavior of N rats was simulated. Each simulation consisted of 180 training and 180 testing trials (as each animal in the experiment of [22] was tested for 30 minutes, so 180 trials correspond to an assumption that a single trial took 10s). At the start of each simulation, the weights were initialized to G pellet = N pellet = G chow = N chow = 0.1. During each training trial, the model experienced choosing a pellet as well as approaching the lab chow. In detail, it received the cost n pellet , modified the weights G pellet and N pellet , then received the payoff p pellet and modified the weight again, and analogously for the lab chow. During each testing trial, the thalamic activity for each option was calculated from Eq 25), and Gaussian noise with standard deviation σ was added. An option with the highest thalamic activity was selected, and if this activity was positive, the action was executed, resulting in the corresponding cost and payoff and weight modification. If thalamic activity for both options was negative, no action was executed and no weights were updated. The values of model parameters: p pellet , n lever , κ N , σ were optimized to match the choices made by the animals. In particular, for each set of parameters, the model was simulated N rats = 100 times, and the average number of choices c sim i;j;k of option i in dopamine state j and experimental condition k was computed. The mismatch with corresponding consumption in experiment c exp i;j;k was quantified by a normalized summed squared error: Cost ¼ In the above equation Z dataset k is a normalization term equal to the total number of choices or consumption in a particular condition: The values of parameters minimizing the cost function were sought using the Simplex optimization algorithm implemented in Matlab, and the following values were found: p pellet = 15.511751, n lever = 14.510517, κ N = 0.7507 and σ = 1.066246. Subsequently, the model with these optimized parameters was simulated with N rats = 6, which was the number of animals tested by [22]. The resulting mean number of choices across animals are shown in Fig 8.