The dopamine circuit as a reward-taxis navigation system

Studying the brain circuits that control behavior is challenging, since in addition to their structural complexity there are continuous feedback interactions between actions and sensed inputs from the environment. It is therefore important to identify mathematical principles that can be used to develop testable hypotheses. In this study, we use ideas and concepts from systems biology to study the dopamine system, which controls learning, motivation, and movement. Using data from neuronal recordings in behavioral experiments, we developed a mathematical model for dopamine responses and the effect of dopamine on movement. We show that the dopamine system shares core functional analogies with bacterial chemotaxis. Just as chemotaxis robustly climbs chemical attractant gradients, the dopamine circuit performs ‘reward-taxis’ where the attractant is the expected value of reward. The reward-taxis mechanism provides a simple explanation for scale-invariant dopaminergic responses and for matching in free operant settings, and makes testable quantitative predictions. We propose that reward-taxis is a simple and robust navigation strategy that complements other, more goal-directed navigation mechanisms.

Introduction Dopamine transmission in the midbrain has several major biological functions for the regulation of behavior and learning. Dopamine signals encode reward prediction errors (RPEs) [1][2][3][4][5][6] ( Fig 1A). Reward prediction errors are the difference between the experienced and predicted rewards. They play a key role in a method of reinforcement learning called temporal difference learning (TD learning) [5][6][7][8], and theory from reinforcement learning has been pivotal for explaining dopamine function.
In addition to rapid sub-second responses encoding reward prediction errors, dopamine may also slowly ramp up when approaching a reward [9][10][11]. A recent experiment by Kim et al. showed that even on this slower timescale (seconds), dopamine levels track a derivative of the input signal [12]. Kim et al. used virtual reality to manipulate the rate of change of the movement of a mouse as it moved towards a target. Dopamine changed in a way that is consistent with computing a temporal derivative of an input field that decays away from the reward (Fig 1B, such an input field is referred to as the spatially-discounted expected reward).
Considering both the signal processing and movement-invigorating properties of dopamine reveals an intriguing feedback that is inherent to the system (Fig 1D). Since dopamine computes a temporal derivative on a spatial input field, the modulation of movement speed by dopamine should by itself affect dopaminergic output (as directly observed by Kim et al. [12], Fig 1C). Thus, in the context of a moving animal, the different roles of dopamine become tightly coupled. Analyzing such feedback interactions is challenging for current theoretical frameworks for dopamine, which typically model behavior using discrete choice processes occurring in discrete steps (see for example [25]).
Our study aims to use concepts from systems biology to analyze functional properties of the interaction between sensing and movement in the dopamine system. We first develop a minimal mechanistic model of dopamine responses. The model is similar to a continuous version of the classic TD-RPE model, with an important modification based on dopaminergic response curves-the circuit is activated by the logarithm of expected reward. Our model provides a new and simple explanation for the puzzling rescaling of dopaminergic responses [26]. Notably, the model establishes a connection between the dopamine circuit and the concepts of exact adaptation and fold-change detection, which have fundamental importance in the systems biology of signaling circuits [27].
We then use the model to study the interaction between dopamine signaling and movement. We considered one of the best-established empirical behavioral laws-the matching law of operant behavior, where the ratio of responses to concurrent rewarding alternatives is proportional to the ratio of obtained rewards raised to a power β (where β~1). Matching is typically observed in experiments where the animal is allowed to move freely, presenting a challenge to modelling approaches based on discrete choices and time steps. By considering a simple movement model, which we call reward-taxis, we show that the dopamine model predicts matching and provides a quantitative value for β as the ratio of dopamine gain to baseline activity. Matching results from the mathematical analogy between stochastic movement guided by reward-taxis and algorithms for the sampling of probability distributions. We conclude by proposing that reward-taxis is a simple and robust navigation strategy that complements other, more goal-directed navigation strategies employed by animals.

Dopamine release as fold-change detection of expected reward
We begin by developing a minimal model for continuous dopamine dynamics (Fig 2A). Consider a behaving animal exploring an open field for a reward of magnitude u, such as a food or drink. For simplicity, we assume a uniform response amongst all dopaminergic neurons. In reality, there are heterogeneities between and within midbrain structures, and some dopaminergic neurons may be specialized to specific cues [29][30][31]17,[32][33][34][35]. As the animal approaches the reward, there is an increase in expected reward R, which decays with distance from the target [9,12,36].
Here expected reward is defined based on TD learning: the expected temporally discounted sum of present and future rewards (see Methods). According to the TD-RPE theory of dopamine function, the difference between dopamine and its baseline Δd encodes a prediction error signal about expected rewards [4,5]. The prediction error signal allows the agent to learn about the spatial input field R with recursive learning rules [7,37].
What is the quantitative relationship between reward magnitude and the dopaminergic response? Recordings from VTA dopaminergic neurons in mice reveal a sublinear relationship between the magnitude of received rewards u and Δd(u) [30,38] (Fig 2B). The sublinear relationship indicates that dopamine neurons may (at least in some magnitude range) be  [38]) is well-described by the logarithmic relation Δd = μ log(au+b) (red line r 2 = 0.999, best-fit parameters a = 0.5±0.1, b = 1.5±0.05, μ = 4.9±0.45, N = 7 reward magnitudes). When the reward is preceded by an odor cue (gray squares) the response is well-described by subtracting a constant from the uncued response (pink line, r 2 = 0.99, best-fit subtraction constant is −3.2±0.1). (C) Simulation of dopamine and GABA responses to a step increase in expected reward input which corresponds to the presentation of a reward-predicting cue. The step is given by R(t) = R 0 +λθ(t−t 0 ) where θ(t−t 0 ) is a unit step function, R 0 = 1 and λ = 7 for the black line (large reward) and λ = 1.8 for the gray line (small reward). Insets. Average change in firing rates from dopaminergic (type I) and GABAergic (type II) VTA neurons, in response to reward-predicting cues for a small reward (gray) or a large reward (black). Data from Fig 2D of [49]. (DE) Population responses of dopaminergic neurons of two Macaque monkeys to variable size liquid reward, either without a preceding cue (left panels, n = 55 neurons), or with a preceding visual cue that predicts reward delivery with 50% probability (right panels, n = 57 neurons). (D) The expected-reward input following the reward-predicting cue is R = 0.5(b+λu) (where λ is proportional to reward magnitude), which is doubled following reward delivery, R = b+λu, where u is the reward magnitude (we use b = 2 and λ = 10ml −1 ). Dashed lines correspond to reward omission. (E) Experimentally measured average dopaminergic responses, using data from Fig 2A and Fig 4B of [26]. When the reward is delivered without a cue, dopaminergic responses increase with reward magnitude (left panel). When it is given after a cue that predicts reward delivery with 50% probability, dopaminergic responses to reward delivery are identical (right panel), as predicted by the FCD property. This is despite the 10-fold difference in reward magnitude. (F) Simulations of the dopamine model capture the experimentally observed dynamics. All simulation parameters are provided in Table 1

PLOS COMPUTATIONAL BIOLOGY
The dopamine circuit as a reward-taxis navigation system responding to the logarithm of reward, namely Δd(u) = μ log(R(u)), with R(u) = au+b, where a is a scaling factor and b is a magnitude-independent component of the reward activation. The parameter μ is the gain of the dopaminergic response. Logarithmic responses are consistent with widespread logarithmic coding in the brain [39][40][41][42][43][44] as well as with economic notions of utility [45,46]. To test this, we fit the function μ log(au+b) to the average dopaminergic response to a variable water reward in mice [30], finding an excellent fit (r 2 = 0.999), with a gain of μ = 4.94 ±0. 45.
In addition to activation by expected reward, dopaminergic neurons in the VTA are inhibited following the presentation of a predicting cue in a subtractive manner ( [30,38], Fig 2B, the presented cue is the same for all reward magnitudes). The subtractive inhibition is thought to be due to the increase in the activity of adjacent GABAergic neurons [30,38]. We therefore propose the following minimal description of dopamine release dynamics: Where C is the baseline activity of the dopaminergic neurons when log R = g = 0, μ is the gain, R is perceived expected reward, and α is the effectiveness of inhibition by the GABAergic output g. Note that both the expected reward R(t) and the GABAergic output g(t) are dynamical, time-dependent variables. Since our model focuses on the regulation of behavior, rather than on learning or representation, we will assume that the log-transformed expected reward log R is an input signal that is provided to the circuit. Additionally, while subtractive inhibition was established for VTA dopaminergic neurons, we assume that similar regulation is shared among all midbrain dopaminergic neurons.
To complete the model requires a minimal description of the dynamics of GABAergic output g. The mechanisms of interaction between GABAergic and dopaminergic neurons are complex and there are many local and remote interactions [47,48]. However, there are experimental observations that impose constraints on these interactions. Upon presentation of a reward-predicting cue-equivalent to a step increase in R(t)-dopamine d(t) rapidly increases and then drops and adapts precisely to its baseline on a sub-second timescale [5,26,49] (Fig  2C), a phenomenon called exact adaptation, while GABAergic activity g(t) increases to a new steady-state that tracks R(t) [49] (Fig 2C).
Exact adaptation is a well-studied property of biological circuits, which can be implemented by a handful of specific feed-forward and feedback mechanisms [50,51]. Since we do not know the mechanistic implementation of the adaptation property in the dopamine circuit, we make the simple assumption of a negative feedback loop. In this design, inhibitory neuron activity g is given by an integral-feedback equation with respect to dopamine release d: Or, more generally, _ g ¼ FðdÞ where F(d) increases with d and has a single zero at d = d 0 . This feedback loop generates exact adaptation of d: the only steady state solution is d = d 0 , which is the homeostatic activity level of dopaminergic neurons. This is about~5 spikes/s in mice [30,52]. The parameter ω determines the adaptation time of the dopaminergic neurons after a change in R(t). This timescale is on the order of hundreds of milliseconds. For the GABAergic neurons, after a step change in R(t), the steady-state firing rate in the model increases proportionally to the logarithm of R(t), such that g ¼ CÀ d 0 þmlogR a (this is because GABAergic output integrates dopaminergic activity). Finally, dopamine release represents a temporal-derivative-like computation of R(t) as observed by Kim et al. [12] (S1 Fig). Taken together, Eqs 1 and 2 provide a minimal model for dopamine responses to expected reward inputs R(t). The model is similar to the classic TD-RPE model of dopamine function and can explain the classic observations of prediction error signals that occur during learning. However, there is an important difference: in the TD-RPE model dopamine is activated by expected reward, whereas in our model it is activated by the logarithm of expected reward. For learning R, this difference requires a slight modification of the recursive learning rules (Methods). In the following sections we will show that logarithmic sensing has crucial implications for the dynamics of learning and behavioral regulation.

Model predicts scale-invariant dopamine responses
We now show that the model given by Eqs 1 and 2 can explain one of the most puzzling observations on dopaminergic responses-the scale invariant dopamine responses observed by Tobler et al. [26] (Fig 2D and 2E). In their experiment, Tobler et al. recorded midbrain dopamine neurons of primates presented with a visual cue that predicted liquid rewards with 50% probability. Three cues were presented in pseudorandom order, and each of the cues predicted a reward of a different magnitude over a 10-fold range (0.05ml, 0.15ml, 0.5ml, note that this is in contrast to the case presented in Fig 2B where the same cue was used for all reward magnitudes). Both predictive stimuli and reward reception elicited a positive dopaminergic response, as expected from the TD-RPE theory. However, while the response to the predictive stimuli increased with expected reward magnitude (S2 Fig), the response to the reward itself was invariant despite large differences in reward magnitude. This scale invariance is not consistent with the classical TD-RPE model which predicts that responses to rewards should also increase with reward magnitude [53]. In order to explain this puzzling observation, it has been suggested that there is a normalization process that scales dopamine responses, e.g. according to the standard deviation of the reward distributions [26,53].
Here we show that the observations of Tobler et al. [26] can be explained by our model without invoking any additional normalization process ( Fig 2F). The reason for this is that the model has a circuit feature known in systems biology as fold-change detection (FCD) [54,55]. FCD is a property where after adaptation the circuit output depends only on relative changes in the input, rather than absolute changes. FCD circuits output the temporal (logarithmic) derivative of low-frequency input signals [56][57][58]. We therefore call the dopamine model presented here the dopamine-FCD model (see Methods for a proof that the model has the FCD property).
To see why the FCD property can explain the observations of Tobler et al., [26] consider the input function for each of the cue-reward sequences. When a reward-predicting stimulus appears, the expected reward changes from its previous baseline value in a step-like manner to some value 0.5R that depends on predicted reward magnitude, causing a dopamine response that increases with R. At the time point when the reward itself is received, the input function increases by~2-fold (from 0.5R to R, which again for simplicity is modeled as a step increase). Since the dopaminergic response depends only on the fold-change in input, the model predicts identical responses, as observed by Tobler et al. [26] The FCD property causes the learning and behavior-regulating functions of dopamine to be invariant (i.e., not affected by) multiplying the input field by a scalar [54]-in other words, by multiplying all expected rewards by a constant factor λ. The model thus predicts scaleinvariance of the dopamine system. This property may be crucial for the dopamine circuit, since rewards can vary widely in magnitude.

A reward-taxis model for dopamine regulation of behavior
In the following section we will consider whether we can use Eqs 1 and 2 to gain insight into the regulation of behavior by dopamine. To link the dopamine circuit to animal behavior, we first provide an additional equation as a minimal model for dopamine control of motion. The equation is motivated by the well-established role of dopamine as a regulator of action vigor and locomotor activity [13,17,[59][60][61]. The equation posits that dopamine d increases movement speed v: Where v 0 is movement speed at the homeostatic level d = d 0 . Dopamine d in Eq 3 corresponds to the (normalized) dopaminergic activity in the brain, so that loss of dopaminergic neurons (as occurs in Parkinson's disease) reduces d proportionally, which effectively reduces homeostatic movement speed v 0 . We assume a proportional relation between movement speed and d, which is consistent with the gradual increase in movement vigor with dopaminergic activity (see Fig 1J and 1K in da Silva et al. [17]).
When the animal is moving towards higher expected rewards, that is, the input field R(x) increases as the animal moves, d rises above its baseline and movement speed increases; conversely, when the animal moves down R(x), movement speed decreases. The change in d may be transient if the change in the input is step-like or gradual (linear); veering away from the reward, however, will result in an undershoot in d, and such movements will be inhibited. More generally, since d tracks the temporal logarithmic derivative of the input, then as the animal moves its speed will be continuously modulated according the spatial gradients of log R (x), with movements up the gradient invigorated and movements down the gradient inhibited. This movement regulation, defined by Eq 1-3, results in spending longer times near the peaks of the reward field R(x). We thus call this model reward-taxis.

Reward-taxis quantitatively provides the matching law of operant behavior
In the following section we will argue that the reward-taxis model provides a distinct and quantitative explanation for the general matching law of operant behavior, one of the bestestablished behavioral phenomena [62][63][64][65][66][66][67][68][69][70]. Matching is typically observed in concurrent reward schedules where a freely behaving animal harvests rewards that appear stochastically in two separate locations x 1 , x 2 . The rewards are depleted after harvesting and renew after a random time-interval drawn from a memoryless distribution. In the simplest setting, the same reward is provided in both locations but the average renewal time differs between the locations. In more general settings other parameters (e.g. amount or quality of reinforcement) can vary [71]. There is also usually a cost to switching between options. The matching law, in its time-allocation form [72], posits that the long-term average of the relative amount of time the animal chooses each reward location P(x 1 ), P(x 2 ) goes as a power β of the ratio of rewards harvested from the locations R 1 , R 2 (Fig 3A and 3B): where R 1 , R 2 correspond to the expected reward at each location (the product of rate and amount of reinforcement [72]). The parameter k is a bias term which corresponds to the tendency of the animal to prefer one reward over another even when reinforcement is equal (R 1 = R 2 ). The bias term typically varies between experiments. The matching law was originally proposed with perfect matching β = 1 [73]. A large number of studies in various vertebrate species under different experimental conditions observed that β can be somewhat variable, showing slight undermatching (β<1) and overmatching (β>1), with the former more commonly observed [63][64][65]68,69,74,75]. Matching has also been observed in wild animal foraging [76,77]. Matching is a robust property which holds over orders of magnitude of reward ratios (up to~1:500 in pigeons [65]). The overall robustness of the matching law has led authors to suggest that it reflects intrinsic properties of the vertebrate nervous system [69].
Matching is an emergent property of the free behavior of the animal, but its underlying origins are unclear. The continuous, free behavior of the animal contrasts with the discrete choice experiments where the animal can harvest rewards from two locations, which may be baited or empty. Following the consumption of the reward, the location becomes re-baited after some random amount of time, which may be different between the locations. The generalized matching law, amply supported experimentally, is a power-law relation between the reward harvested and the frequency of responses Pðx 1 Þ dopamine. Random walk model simulation for choice between two expected rewards R 1 , R 2 . Expected reward input field is the sum of two Gaussians: R x ð Þ ¼ ). An example of a simulation is presented in panel C. The model was simulated for different ratios of R 1 /R 2 , and the response ratio was estimated by the ratio of the time spent ±2.5cm from A = x 1  trials processes that are typical for reinforcement learning models. Matching is not optimalthe optimal policy would be for the animal to regularly switch between the alternatives [78], whereas behavior in experiments is characterized by a memoryless switching process (that is, fixed switching probabilities) [79]. Previous explanations for the general matching law assumed an underlying choice process, such as competition between the groups of neurons representing each reward location [80][81][82][83].
Here we provide a novel and distinct explanation for the matching law based on the reward-taxis mechanism (Eqs 1-3). We show that matching is a robust emergent property of the dopamine system, and, moreover, we provide an estimate for β in terms of parameters of the dopamine system that can be directly inferred from neuronal recordings (Fig 3C and 3D).
To test whether the model provides the matching law, we model the dynamics of the location of a behaving animal as a stochastic process. Let R(x) be the input field, which is the expected reward R as a function of location x. We assume that R(x) is fixed at x 1 , x 2 by the harvested rewards R 1 , R 2 . To account for the stochastic behavior of the animal in the matching experiments, we model its movement as a biased random walk process, where the animal moves in straight lines at speed v (modulated by dopamine) and reorients at random with some fixed probability τ −1 . Allowing the new direction to be correlated with the previous direction does not affect the conclusions, and a model where τ (rather than v) is modulated by dopamine leads to the same conclusions.
This biased random walk model is analogous to bacterial chemotaxis: bacteria such as E. coli use a run-and-tumble navigation strategy to climb gradients of chemical attractants (Fig 4A-4E) [54,56,84,85]. The bacterial chemotaxis signaling network is based on an FCD circuit that controls run duration [54,86]. It therefore maps onto dopamine release dynamics, where expected-reward inputs play the role of chemoattractant concentration in chemotaxis.
The key difference is that in chemotaxis the input field results from the diffusion of attractant molecules, whereas in the dopamine system the input field (expected reward) is learned by the animal.
At long time-and length-scales, run-and-tumble motion resembles Brownian motion [87], with a diffusion coefficient D�m −1 v 2 τ, where m is the dimension (we assume that d is close to the adapted level d 0 ). Brownian motion in the model is biased by longer runs when the agent is moving up the gradient. To account for this, one can add to the diffusion process an advection term that is proportional to the logarithmic gradient: χr log R(x) [88][89][90]. The advection term corresponds to the average flow at location x. Taken together, the stochastic dynamics of the agent are approximated by a Langevin equation similar to the classic Keller-Segel equation used to model chemotaxis [91]: where W is an m-dimensional Wiener process (see S1 Text for the derivation of Eq 5). For relatively short runs compared with the adaptation time (sub-second timescale), the advection parameter χ, also called chemotactic drift, is given by: w � m À 1 v 2 t � md À 1 0 [88], and thus rises with velocity v, gain μ and mean run duration τ. It is important to note that Eq 5 holds also for the model variant in which dopamine modulates τ (see S1 Text).
Eq 5 captures how animal movement depends on the parameters of the dopamine circuit, as well as on movement parameters τ, v. Decreasing the average run duration τ or average movement speed v (as in Parkinson's disease) decreases both diffusivity D and advection χ, resulting in slower effective motion and gradient climbing. Gradient climbing efficiency (chemotactic drift) increases with md À 1 0 , which is the gain of dopamine neurons normalized by their baseline activity. Other circuit parameters do not affect movement dynamics.
Eq 5 is equivalent to the Langevin Monte Carlo algorithm, a widely-used algorithm from statistical physics for sampling probability distributions and for global optimization [92][93][94][95]. The steady-state distribution can be readily solved, using standard methods of statistical physics, similar to a Boltzmann distribution in a potential field. The motion samples a probability distribution P(x) proportional to a power β of the expected-reward distribution: where the power-law β equals the normalized gain of the dopaminergic neurons: From Eq 6 we infer that for any two expected rewards R 1 , R 2 the response rates P(x 1 ), P(x 2 ) obey the general matching law of Eq 4. We note that this results in matching with k = 1, but any biases in reward preference, which are multiplicative in R 1 or R 2 , will result in a fixed bias term k6 ¼1.
The reward-taxis model therefore predicts operant matching with a power law of b ¼ md À 1 0 . Thus, β is equal to the average ratio of gain to baseline activity in dopaminergic neurons. As mentioned above, these values can be estimated from the neuronal measurements of Eshel et al. [30], μ�5 spikes/s and d 0 �5 spikes/s. These values yield β�1, in agreement with the matching law. Similar parameters are found also in primates (S2 Fig). The agreement is striking since there is no a-priori reason for the gain and baseline to be similar; normalized gain md À 1 0 could in principle have a wide range of values including md À 1 0 � 1 or md À 1 0 � 1. The matching exponent β only depends on parameters that are intrinsic to the dopaminergic neurons; β is independent of movement speed v or run duration τ, which may vary depending on animal physiology and the environmental context, as well as on the number of dopaminergic neurons. This may explain the robustness of the matching phenomena across species and experimental conditions. The logarithmic derivative property is crucial for obtaining the matching law. A non-logarithmic derivative, or absolute responses, do not provide matching (S1 Text). Taken together, the reward-taxis model can provide a physiological mechanism underpinning operant matching.
Variation in intrinsic neuronal parameters can affect gain μ and baseline firing rate d 0 . The model predicts that manipulating the relative gain of dopamine neurons md À 1 0 will change the reward sensitivity parameter β in the matching law. This prediction can be tested by measuring md À 1 0 in genetically modified animals where matching behavior is different from wild-type. One such case is mice that are deficient in the cannabinoid type-1 receptor (CB -/-), which have β that is lower by~30% compared with wild-type mice [96]. In agreement with the model prediction, CB -/mice also have deficient dopaminergic responses relative to baseline [97].
Other disorders can be due to reduction in the number of functional dopaminergic neurons, as in Parkinson's disease, where SNc dopaminergic neurons are lost. Such damage is predicted to change v 0 . If the damage does not sizably affect the intrinsic properties of surviving neurons, such as gain and baseline, they are not expected to change β in matching experiments.

Discussion
In this study we showed that concepts from systems biology, including exact adaptation, foldchange detection and stochastic navigation, can be mapped to the dopamine system in the brain. We showed that the dopamine circuit may implement a 'reward-taxis' mechanism that shares core analogies with bacterial chemotaxis. To show this we developed a mechanistic model of dopamine dynamics based on experimental measurements. The model has similar behavior to the classic TD-RPE model, with a key difference-the circuit is activated by the logarithm of expected reward. The model predicts that dopamine output is invariant to the scale of the distribution of rewards, as observed by Tobler et al. [26], and explains matching in freeoperant behavior. Reward-taxis results from the interaction between sensing and movement and implements a simple strategy for climbing gradients of expected reward.
Scale invariance is a recurring motif in biological sensory systems [55]. The model of dopamine transmission as fold-change detection (FCD) of expected reward is thus in line with the conceptualization of dopamine neurons as sensory neurons for reward [98]. FCD includes the classic Weber's law of sensory systems, which posits that the maximal response to a change in input is normalized by the background level of the signal [99]. FCD is more general than Weber's law in that the entire dynamics of the output, including amplitude and duration, is normalized by the background input level. FCD allows the system to function in a scale-invariant manner across several decades of background input [54]. It also provides a common scale to compare different types of sensory inputs, by referring to their relative (rather than absolute) changes [100].
While the model focused on the average activity of dopaminergic neurons, the proposed mechanism for FCD (inhibition from neighboring GABAergic neurons) may apply at the level of individual dopaminergic neurons or groups of neurons. This raises the possibility that different dopaminergic neurons could become adapted to different expected-reward levels at the same time, consistent with a recent study that demonstrated that a single reward can simultaneously elicit positive and negative prediction errors in different neurons [101].
The FCD model proposes that RPEs become normalized by the scale of the rewards, but does not account for possible effects of the reward distribution. Such distribution-based effects are evident in the dopamine system. In a recent study, Rothenhoefer et al. showed that rare rewards appear to amplify RPEs more than commonly observed rewards [102]. To further disentangle effects of reward distribution vs. reward scale, we propose the following set of experiments. The FCD model predicts that for a reward schedule with a fixed distribution and a shifted mean X, the dopaminergic responses should decay as X is increased. To test this we may consider a reward schedule where the animal is randomly rewarded with rewards of magnitudes r 1 = X+Y, r 2 = X−Y, preceded by a cue that predicts that some reward will be delivered. The FCD model predicts that dopaminergic responses should decay as X increases. For example, when the reward schedule alternates between rewards of magnitude r 1 = 1.5, r 2 = 0.5 (X = 1, Y = 0.5), the dopaminergic response to the reward of magnitude 1.5 should be larger than the dopaminergic response to a reward of magnitude 11.5 under an alternating schedule with rewards of magnitude r 1 = 11.5, r 2 = 10.5 (X = 11, Y = 0.5). This contrasts with models based on std. normalization which predict identical responses in both scenarios. Similarly, the FCD model predicts that dopaminergic responses should increase with Y, while models based on std. normalization predict identical dopaminergic responses in this scenario as well.
The present study unifies two main effects of dopamine, encoding reward-derivative and increasing movement vigor, by mapping them to a reward-taxis navigational circuit. The circuit is analogous to the bacterial chemotaxis circuit, where in the dopamine case navigation is along gradients of expected reward. The mapping is based on mathematical analogies at both the physiological and behavioral levels. At the physiological level both circuits have the FCD property. At the behavioral level, dopamine increases the probability and vigor of movements, thus increasing the duration of correlated motion ("runs") compared with reorientations ("tumbles"). Both aspects map to the well-characterized chemotaxis navigation circuit in bacteria.
The stochastic model is sufficient for explaining matching behavior, and provides an accurate mechanistic estimate for the matching constant β. The estimate is derived under mild mechanistic assumptions-that movement speed (or run length) is controlled proportionally by dopamine levels, and that run times are relatively short compared with adaptation time. Improved experimental characterization of movement control by dopamine will allow us to relax these assumptions to obtain better estimates for β. For example, a nonlinear gain for movement speed regulation by dopamine v/d h would result in multiplication by a constant prefactor β = hμ/d 0 , and longer run times would result in a proportional decrease in β [103]. As long as these effects are mild, we expect our estimate β�1 to hold. Importantly, we still expect the matching constant β to be proportional to μ and inversely proportional to d 0 .
Our study connects between vertebrate motion regulation and the wider family of run-andtumble stochastic navigation circuits, which includes motion regulation in bacteria, algae, and simple animals [104][105][106][107]. Reward-taxis was anticipated in the early work on TD learning, where Montague et al. showed that run-and-tumble dynamics driven by reward prediction errors can explain observations on bee foraging [2].
There are also differences between bacterial chemotaxis and the reward-taxis model for dopamine. The value of β in bacterial chemotaxis is much larger than in the dopamine rewardtaxis model, with β>10 in E. coli [108,109] and β~1 estimated for the dopamine system. The high β value in bacteria indicates a strong preference for higher rewards, akin to an optimization for accumulation near attractant peaks. It also allows for collective migration [91]. A value of β~1 (which results in the matching law) allows a greater range for exploration of submaximal rewards.
The reward-taxis model was presented for whole-body spatial movement, but its assumptions are general and may potentially extend to other aspects of behavior. One such aspect is hippocampal replay, the activation of hippocampal place cells during sharp-wave ripples [110][111][112]. Hippocampal replay consists of a firing sequence of neurons that represents temporally compressed motion trajectories [111,113,114]. It can occur either during sleep or rest ("offline") or when the animal is awake and engaged in a task ("online"). Online replay plays an important role in planning and navigation [114]. The activation of the place neurons corresponds to stochastic movement trajectories with a characteristic speed [115] that are biased towards the location of rewards [114]; when foraging is random, the trajectories are diffusive, resembling Brownian motion [112]. Hippocampal activity during online replay is tightly coordinated with reward-associated dopamine transmission [110]. To map reward taxis to hippocampal replay requires that dopamine transmission modulate the stochastic trajectories of hippocampal replay, for example through the modulation of velocity or reorientation frequency. Another potentially relevant system is eye movements. Eye movements are modulated by dopamine and impaired in Parkinson's disease [34,116,117], and their vigor is modulated by reward prediction errors [118]. Additionally, random walk models capture gaze dynamics during tasks such as visual search [119][120][121]. Since eye movements are commonly studied in behavioral experiments such as reward matching, they may be a good candidate to test the reward-taxis model.
While taxis navigation systems in organisms such as E. coli are based on gradients that are created due to diffusion, for the dopamine system the input field is generated by learning-TD learning is sufficient for generating gradients of expected reward. From the point of view of signal processing, TD learning smooths away the high-frequency (or "phasic") input components [122], leaving a low-frequency input signal that is used for navigation. In this way the dopamine system can allow for gradient-based navigation over fields that are derived from arbitrary sensory inputs.
The reward-taxis model does not assume any explicit choice process-in the model, navigation towards regions of higher expected reward is only due to the modulation of movement statistics by dopamine. This may at first sight appear more primitive than standard reinforcement strategies, where the agent compares the expected reward of different alternatives before acting. However, reward-taxis may be advantageous in certain settings. The first advantage is that reward-taxis is computationally cheap-it only requires activation by a single local scalarwhich allows for efficient continuous modulation of movements, rather than discrete movement adjustments. The second advantage is that it provides effective sampling of the rewards distributed in the environment by implementing a search algorithm (Eq 5) mathematically analogous to the Langevin Monte Carlo (LMC) algorithm for sampling probability distributions [92][93][94][95] and for global optimization [123][124][125][126][127][128]. Sampling allows the animal to incorporate uncertainty on reward magnitude, probability, or location into its navigation. It also allows the animal to efficiently navigate in complex input fields that include many local minima and maxima [109]. Finally, run-and-tumble navigation provides benefits beyond the Langevin Monte Carlo algorithm by boosting gradient climbing only on sufficiently steep reward gradients due to the positive feedback between behavior and sensing. The positive feedback occurs since running along the gradient provides an increasing input that further enhances the run duration [129]. These advantages suggest that reward-taxis may be a useful strategy when the expected reward input field is complex or uncertain.
The relation between the dopaminergic system and sampling, and in particular the relation between dopaminergic parameters and the matching law parameter β, may be relevant to recent findings on dopamine and exploration [130,131]. In mice and humans, dopaminergic antagonists appear to specifically increase random exploration, rather than affect learning [130,131]. Under our modelling framework, this effect may correspond to a decrease in β by the treatment, for example, due to a reduction in the effective dopaminergic gain. This would result in altered behavioral output, without necessarily affecting learning. More generally, the sampling framework can provide a quantitative theoretical framework to model the relation between dopamine and various aspects of stochastic exploration, such as novelty-driven exploration [132,133].
A more realistic and complete model would include other aspects of decision making such as goal-directed behavior and planning. It is important to note that since the FCD model does not hinder learning, it is compatible with these aspects and they are likely to complement reward-taxis with more directed movement. Such a combination of navigation mechanisms is evident also in simple organisms that employ run-and-tumble navigation. For example, in C. elegans thermotaxis, run-and-tumble navigation is combined with biased reorientations in order to navigate towards an optimal temperature range [106]. Formally, while run-and-tumble navigation resembles Langevin-based sampling, directed reorientations are more closely related to gradient descent, which is efficient for local optimization but poor for global optimization [127,134,135]. We thus propose that the reward-taxis mechanism we describe can complement other navigation and decision making-mechanisms to allow for efficient navigation in complex environments.

Model equations and fold-change detection
The equations for dopamine (d) and GABAergic inhibition (g) are provided by: For the dopamine equation, ω d determines the dopamine degradation rate, μ is dopamine gain, R is expected reward (defined in the next Methods section), and α is GABAergic inhibition strength. For the GABAergic inhibition equation, ω determines the adaptation rate and d 0 is the adapted steady-state of dopamine. For simplicity, we assume that dopamine dynamics are faster than the dynamics of adaptation due to g (i.e., ω d is large compared with ω, this assumption is not important for our conclusions) so we take: Eqs 7, 8 and 9 can represent the average activity of individual neurons, or the total activity of many neurons. We therefore used the same equations both to model average individual neuron recordings (as in Fig 2), and to model the effect of dopamine on movement, which is likely to be the sum of the activity of many neurons.
Consider now a constant input R = R 0 , so that after some time the system reaches steady-state. To find the steady state, we solve Eqs 7 and 8, taking _ d ¼ 0; _ g ¼ 0, which yields the steady-state solutions d st = d 0 and g st = α −1 (C−d 0 +μ log R 0 ). The observation that d st = d 0 regardless of R 0 and other circuit parameters is an important circuit feature from systems biology known as exact adaptation [27,50,[136][137][138]. This feature is essential for explaining why dopamine activity returns precisely to baseline after a step increase in expected reward, while GABAergic activity increases in a way that tracks expected reward.
Beyond exact adaptation, the system has an even stronger property of fold-change detection (FCD). FCD is defined as dynamics of dopamine (d) in response to an input λR(t) that are independent of λ, starting from initial conditions at steady-state for λR(0). To show this we relabel g = g 0 +α −1 μ log λ: Note that the rate equation for _ g 0 is the same as for _ g since dg 0 dt ¼ dg0 dg dg dt ¼ dg dt . Eqs 10 and 11 are completely independent of λ, and their steady-state g 0 st ¼ a À 1 ðC À d 0 þ mlogR 0 Þ and d st = d 0 is also independent of λ. This means that the dynamics of the system have the FCD property. The FCD property is essential for explaining the scale invariance of the dopaminergic responses to rewards in Fig 2 -the response only depends on the fold-change of expected reward (two-fold change upon reception of reward at p = 0.5) but not on reward magnitude. While Eqs 7, 8 and 9 provide FCD, they are not the only possible model that provides FCD for this system. In particular, a feed-forward model where expected reward activates g is also possible, i.e.: For this circuit, the steady state for a constant input R = R 0 is g st ¼ aþmlogR 0 a and d st = d 0 . FCD can also be analogously shown. Given an input λr(t), we can take g = g 0 +α −1 μ log λ, which again provides equations and steady-state that are independent of λ: While this simple log-linear model captures various important experimental observations, it is important to note that it has some clear limitations. One limitation is that both d and g can in principle reach negative values when R is small. Measurements of dopamine responses in monkeys indeed show deviations from sub-linearity for small rewards [139]. The model can be adjusted to prevent negative undershoots (S3 Fig). Future studies may build on improved measurements and better mechanistic characterization of the dopamine circuit to refine this model. Finally, the original studies quantifying the input/output relation between reward magnitude and dopaminergic output, presented in Fig 2, considered fits by strongly sublinear power-and hill-functions [30,38]. It is not possible to discriminate between these functions and the logarithmic relation with the available data, and such a fit would require more accurate measurements over large magnitude ranges.

Definition of expected reward and relation between the circuit and TD learning
Here we will define the input to the circuit, which is the logarithmic expected reward log R, and present it in the context of the temporal difference (TD) learning theory of dopamine function [5,7]. We first define the expected temporally discounted sum of future rewards V (also known as the value function in TD learning): Where γ<1 is a "future discounting" factor and r(t) is the reward received at time t into the future (here for simplicity we take discrete time; for equivalent formulation for continuous time, see Doya [140]). It is possible to think of V as a function of the current state s of the agent, which may include for example its location in space x. This is known as the Markovian setting, where we denote the value function as V(s). The value function plays an important role in decision making-learning the value function is a principal focus of reinforcement learning algorithms [5,7].
In our model, the input to the circuit for an agent moving into a state s at time t is defined using the expected reward R: As an example, consider the setting of Fig 2D-2F, where a reward of size r = y is delivered with probability p at Δt time-units into future: the expected reward would in this case be R (0)�pγ Δt y. An actual delivery of the reward would then increase R to R(Δt)�y, so the ratio RðDtÞ Rð0Þ ¼ 1 pg Dt is independent of reward magnitude. Note that due to discounting and uncertainty, R decays with the distance from a location where a reward is delivered, as in Fig 1D [36].
We will now show that our model is consistent with the TD learning theory of dopamine function with a slight modification to the TD learning rule. We will first briefly present the TD learning algorithm. In reinforcement learning, the agent usually does not know V and needs to learn it from experience. This presents a computational problem, since V is an infinite sum over unknown future events. A way to get around this is to update the learned V using dynamic programming [141]. The key insight is that Eq 16 can be rewritten as: The above equation implies that V can be estimated iteratively with a simple update rule, which is at the heart of TD learning. If the agent is at state s at time t, and at state s' at time t+1, the update rule is: Vðt þ 1; sÞ Vðt; sÞ þ a ðrðtÞ þ gVðt; s 0 Þ À Vðt þ 1; sÞÞ |ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl {zffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl ffl } prediction error Where V(t, s) is the computed estimate of the expected reward at state s at time t, α is the learning rate, r(t) is the reward delivered at time t and γ is the discounting factor. There is extensive literature demonstrating correspondence between TD learning and midbrain dopamine function (reviewed by [8]); specifically, experiments show a correspondence between phasic dopamine secretion and the prediction error term of Eq 19 [5,8], in the sense that positive or negative firing of dopamine neurons relative to baseline corresponds positive and negative predictions errors in TD models of learning.
We will now show that our model is capable of learning the logarithm of V (that is, the logarithm of the entire discounted sum over future rewards), in a manner similar to the learning of V by classic TD learning. Since both are equivalent, our model is sufficient for explaining TD learning by dopaminergic responses. For this we will develop a plausible temporal difference learning rule based on logarithmic prediction errors: d log ¼ logðrðtÞ þ gVðt; s 0 ÞÞ À logVðt; sÞ.
The learning rule is an extension of the learning rule presented in Eqs 15-17 in Coulthard et al. [142]. To devise the learning rule, consider the Taylor expansion of the logarithm of the update rule given in Eq 19 around r(t)+γV(t, s 0 ) = V(t, s): logVðt þ 1; sÞ logðVðt; sÞ þ aðrðtÞ þ gVðt; s 0 Þ À Vðt; sÞÞÞ � logVðt; sÞþ a rðtÞ þ gVðt; s 0 Þ Vðt; sÞ À 1 � � ¼ logVðt; sÞ þ a e logðrðtÞþgVðt;s 0 ÞÞÀ logVðt;sÞ À 1 À � ¼ logVðt; sÞþ The above equation represents an update rule that only needs the modified prediction error term δ log in order to learn the value function. In the continuous limit, and in the absence of reward, the error term is approximately proportional the logarithmic derivative of the value function. This corresponds to the output of our proposed FCD model for low frequency signals.
The output of the circuit to a delivered reward in a transition from a state s to a state s' is also approximately proportional to the above error term: The final equality is due to the fact that prior to reward delivery, GABAergic output adapts to α −1 (C−d 0 +μ log V(s)). The FCD model is therefore compatible with the TD learning theory of dopamine function. In S4 Fig we provide simulations for learning with the modified learning rule, where we show that it indeed learns log V.

Analysis
The fit of the dopaminergic responses in Fig 2 (including confidence intervals) was performed using the NonlinearModelFit function of Mathematica (version 12.1.1). All other figures and simulations were produced using Python (version 3.8.5). The source code and data to produce all the figures is available at https://github.com/omerka-weizmann/reward_taxis.  [12]. Insets. Corresponding dopaminergic outputs from mice (left to right: n = 11, n = 11, n = 15, n = 5) VTA neurons measured by calcium imaging, from Fig 2C, 2G, 2K and 2O in [12], smoothed using a Savitzky-Golay filter. All simulations were performed with the parameters provided in Table 1.