Reward Optimization in the Primate Brain: A Probabilistic Model of Decision Making under Uncertainty

A key problem in neuroscience is understanding how the brain makes decisions under uncertainty. Important insights have been gained using tasks such as the random dots motion discrimination task in which the subject makes decisions based on noisy stimuli. A descriptive model known as the drift diffusion model has previously been used to explain psychometric and reaction time data from such tasks but to fully explain the data, one is forced to make ad-hoc assumptions such as a time-dependent collapsing decision boundary. We show that such assumptions are unnecessary when decision making is viewed within the framework of partially observable Markov decision processes (POMDPs). We propose an alternative model for decision making based on POMDPs. We show that the motion discrimination task reduces to the problems of (1) computing beliefs (posterior distributions) over the unknown direction and motion strength from noisy observations in a Bayesian manner, and (2) selecting actions based on these beliefs to maximize the expected sum of future rewards. The resulting optimal policy (belief-to-action mapping) is shown to be equivalent to a collapsing decision threshold that governs the switch from evidence accumulation to a discrimination decision. We show that the model accounts for both accuracy and reaction time as a function of stimulus strength as well as different speed-accuracy conditions in the random dots task.


Introduction
Animals are constantly confronted with the problem of making decisions given noisy sensory measurements and incomplete knowledge of their environment. Making decisions under such circumstances is difficult because it requires (1) inferring hidden states in the environment that are generating the noisy sensory observations, and (2) determining if one decision (or action) is better than another based on uncertain and delayed reinforcement. Experimental and theoretical studies [1][2][3][4][5][6] have suggested that the brain may implement an approximate form of Bayesian inference for solving the hidden state problem. However, these studies typically do not address the question of how probabilistic representations of hidden state are employed in action selection based on reinforcement. Daw, Dayan and their colleagues [7,8] explored the suitability of decision theoretic and reinforcement learning models in understanding several well-known neurobiological experiments. Bogacz and colleagues proposed a model that combines a traditional decision making model with reinforcement learning [9] (see also [10]). Rao [11] proposed a neural model for decision making based on the framework of partially observable Markov decision processes (POMDPs) [12]; the model focused on network implementation and learning but assumed a deadline to explain the collapsing decision threshold. Drugowitsch et al. [13] sought to explain the collapsing decision threshold by combining a traditional drift diffusion model with reward rate maximization.
Other recent studies have used the general framework of POMDPs to explain experimental data in decision making tasks such as those involving a stop-signal [14,15] and different types of prior knowledge [16].
In this paper, we derive from first principles a POMDP model for the well-known random dots motion discrimination task [17]. We show that the task reduces to the problems of (1) computing beta-distributed beliefs over the unknown direction and motion strength from noisy observations, and (2) selecting actions based on these beliefs in order to maximize the expected sum of future rewards. Without making ad-hoc assumptions such as a hypothetical deadline, a collapsing decision threshold emerges naturally via expected reward maximization. We present results comparing the model's predictions to experimental data and show that the model can explain both reaction time and accuracy as a function of stimulus strength as well as different speed-accuracy conditions.

POMDP framework
We model the random dots motion discrimination task as a POMDP. The POMDP framework assumes that at any particular time step, the environment is in a particular hidden state, m, that is not directly accessible to the animal. This hidden state however can be inferred by making a sequence of sensory measurements. At each time step t, the animal receives a sensory measurement (observation), o t , from the environment, which is determined by an emission probability distribution P(o t Dm). Since the hidden state m is unknown, the animal must maintain a belief (posterior probability distribution) over the set of possible states given the sensory observations seen so far: b t (mDo 1:t ), where o 1:t represents the sequence of observations that the animal has accumulated so far. At each time step, an action (decision) a t [A made by the animal can affect the environment by changing the current state to another according to a transition probability distribution P(m'Dm,a t ) where m is the current state, and m' is a new state. The animal then gets a reward R(m,a t ) from the environment, depending on the current state and the action taken. During training, the animal learns a policy, p(b)[A, which indicates which action a to perform for each belief state b. We make two main assumptions in the POMDP model. First, the animal uses Bayes rule to update its belief about the hidden state after each new . Second, the animal is trained to follow an optimal policy p Ã (b) that maximizes the animal's expected total future reward in the task. Figure 1 illustrates the decision making process using the POMDP framework. Note that in the decision making tasks that we model in this paper, the hidden state m is fixed by experimenters within a trial and thus there is no transition distribution to include in the belief update equation. In general, the hidden state in a POMDP model follows a Markov chain, making the observations o 1:t temporally correlated.

Random dots task as a POMDP
We now describe how the general framework of POMDPs can be applied to the random dots motion discrimination task as shown in Figure 1. In each trial, experimenter chooses a fixed direction d[f{1,z1g corresponding to leftward and rightward motion respectively, and a stimulus strength (motion coherence) c[½0,1, where 0 corresponds to completely random motion and 1 corresponds to 100% coherent motion (i.e., all dots moving in the same direction). Intermediate values of c represent a corresponding fraction of dots moving in the coherent direction (e.g., 0:5 represents 50% coherent motion). The animal is shown a movie of randomly moving dots, a fraction c of which are moving in the same direction d.
In a given trial, neither the direction d nor the coherence c is known to the animal. We therefore regard (c,d) as the joint hidden environment state m in the POMDP model. Neurophysiological evidence suggests that information regarding random dot motion is received from neurons in cortical area MT [18][19][20][21]. Therefore, following previous models (e.g., [22][23][24]), we define the observation model P(o t Dm) in the POMDP as a function of the responses of MT neurons. Let the firing rate of MT neurons preferring rightward and leftward direction be l MT R and l MT L respectively. We can define: where l MT 0~2 0 spikes/second is the average spike rate for 0% coherent motion stimulus, and r pref~4 0 and r null~{ 20 are the ''drive'' in the preferred and null directions respectively. These constants (r pref , r null and l MT 0 ) are based on fits to experimental data as reported in [23,25]. Let t t be the elapsed time between time steps t and tz1. Then, the number of spikes emitted by MT neurons r MT within t t follows a Poisson distribution: We define the observation o t at time t as the spike count from MT neurons preferring rightward motion, given the total spike count from rightward and leftward-preferring neurons, i.e., the observation is a conditional random variable o t~r MT R Dn t where n t~r MT R zr MT L . Then o t follows a stationary Binomial distribution Bino(n,m). Note that the duration of each POMDP time step need not be fixed, and we can therefore adjust t t such that n t~n for some fixed n, i.e., the animal updates the posterior distribution over hidden state each time it receives n spikes from the MT population. t t is exponentially distributed, and the standard deviation of t t will approach zero as n increases. When n~1, o t becomes an indicator random variable representing whether a spike was emitted by a rightward motion preferring neuron or not.
It can be shown [26] that o t follows a Binomial distribution Bino(n,m) with  Bayesian inference of hidden state Given the framework above, the task of deciding the direction of motion of the coherently moving dots is equivalent to the task of deciding whether d~1 or not, and deciding when to make such a decision. The POMDP model makes decisions based on the ''belief'' state b t (m)~P(mDo 1:t ), which is the posterior probability distribution over m~c dz1 2 given a sequence of observations o 1:t : To facilitate the analysis, we represent the prior probability Pr m ½ as a beta distribution with parameters a 0 and b 0 . Note that the beta distribution is quite flexible: for example, a uniform prior can be obtained using a 0~b0~1 . Without loss of generality, we will fix a 0~b0~1 throughout this paper. The posterior distribution can now be written as: The belief state b t at time step t thus follows a beta distribution with two parameters a and b as defined above. Consequently, the posterior probability distribution over m depends only on the number of spikes m R and m L for rightward and leftward motion respectively. These in turn determinem m and t, wherê is the point estimator of m, and t~m R zm L n . The animal only needs to keep track ofm m and t in order to encode the belief state b t~B eta½mDa~m m(ntza 0 zb 0 ),b~(1{m m)(ntza 0 zb 0 ). After marginalizing over coherence c, we have the posterior probability over direction d: Pr where I x (a,b)~Ð x m~0 Beta(mDa,b)dm is the regularized incomplete beta function.

Actions, rewards, and value function
The animal updates its belief after receiving the current observation o t , and chooses one of the three actions (decisions) a[fA R ,A L ,A S g, denoting rightward eye movement, leftward eye movement, and sampling (i.e., waiting for one more observation) respectively. The model assumes the animal receives rewards R(m,a) as follows (rewards are modeled using real numbers). When the animal makes a correct choice, i:e:, a rightward eye movement A R when d~1 (mw1=2) or a leftward eye movement A L when d~{1 (mv1=2), the animal receives a positive reward R P w0.
The animal receives a negative reward (i.e., penalty) or nothing when an incorrect action is chosen R N ƒ0. We further assume that the animal is motivated by hunger or thirst to make a decision as quickly as possible. This is modeled using a unit penalty R S~{ 1 for each observation the animal makes, representing the cost the animal needs to pay when choosing the sampling action A S .
Recall that a belief state b t is determined by the parameters a,b. The goal of the animal is to find an optimal ''policy'' p Ã that maximizes the ''value'' function v p (b t ), defined as the expected sum of future rewards given the current belief state: where the expectation is taken with respect to all future belief states (b tz1 , . . . ,b tzk , . . . ). The reward term R(b t ,a) above is the expected reward for the given belief state and action: The above equations can be interpreted as follows. When A S is selected, the animal receives n more samples at a cost of nR S . When A R is selected, the expected reward R(b t ,A R ) depends on the probability density function of the hidden parameter m given belief state b t . With probability I 0:5 (a,b), the true parameter m is less than 0:5, making A R an incorrect decision with penalty R N , and with probability 1{I 0:5 (a,b), action A R is correct, earning the reward R P .

Finding the optimal policy
A policy p(b t ) defines a mapping from a belief state to one of the available actions a. A method for learning a POMDP policy by trial and error using the method of temporal difference (TD) learning was suggested in [11]. Here, we derive a policy from first principles and compare the result with behavioral data.
One standard way [12] to solve a POMDP is to first convert it into a Markov Decision Process (MDP) over belief state, and then apply standard dynamical programming techniques such as value iteration [27] to compute the value function in equation 9. For the corresponding belief MDP, we need to define the transition probabilities T(b t Db t{1 ,a t{1 ). When a t{1~AS , the belief state can be updated using the previous belief state and current observation based on Bayes' rule: which is a stationary distribution independent of time t. When the selected action is A R or A L , the animal stops sampling and makes an eye movement. To account for such cases, we include an additional state C, representing a terminal state, with zero reward R(C,a)~0 and absorbing behavior, T(CDC,a)~1 for all actions a. Formally, the transition probabilities with respect to the absorbing (termination) state are defined as Pr CDb t ,a[fA R ,A L g ½ 1 for all b t , indicating the end of a trial.
Given the time-independent belief state transition Pr b 0 t Db t ,a Â Ã , the optimal value v ? and policy p ?~a rg max p v p can be obtained by solving Bellman's equation: Before we proceed to results from the model, we note that the one-step belief transition probability matrix T(b t Db t{1 ,A S ) with n~n 0 can be shown be mathematically equivalent to the n 0 -steps transition matrix T n0 (b t Db t{1 ,A S ) with n~1. The solution to Bellman's equation 13 is independent of n. Therefore, unless otherwise mentioned, the results are based on the most general scenario where the animal needs to select an action whenever a new spike is received, i:e:, n~1.
We summarize the model variables as well as their statistical relationships in table 1.

Optimal value function and policy
Figure 2 (a) shows the optimal value function computed by applying value iteration [27] to the POMDP defined in the Methods and Analysis section, with parameters R P~5 0, R N~0 , and R S~{ 0:1. The x-axis of Figure 2 (a) represents the total number of observations m~m R zm L encountered thus far, which is equal to the elapsed time t in the trial. The y-axis represents the ratiom m~m R za 0 mza 0 zb 0 , which is the estimator of the hidden parameter m. In general, the model predicts a high value when m m is close to 1 or 0, or equivalently, when the estimated coherence is close to 1. This is because at these two extremes, selecting the appropriate action has a high probability of receiving a large positive reward R P . On the other hand, form m near 0:5 (estimated c near 0), choosing A L or A R in these states has a high chance of resulting in an incorrect decision and a large negative reward R N (see [11] for a similar result using a different model and under the assumption of a deadline). Thus, belief states with m R *m L have a much lower value compared to belief states with m R &m L or m R %m L . Figure 2 (b) shows the corresponding optimal policy p ? as a joint function ofm m and t. The optimal policy p ? partitions the belief space into three regions: P R , P L , and P S , representing the set of belief states preferring actions A R , A L and A S respectively. Let P a m be the set of belief states preferring action a after m observations, for a[fA R ,A L ,A S g and m~m R zm L . Early in a trial, when m is small, the model selects the sampling action A S regardless of the value ofm m. This is because for small m, the variance of the point estimatorm m(m) is high. For example, even whenm m~1 when m~2, the probability that the true mv0:5 is still high. The sampling action A S is required to reduce this variance by accruing more evidence. As m becomes larger, the variance of m m decreases, and the deviation betweenm m and the true value of m diminishes by the law of large numbers. Consequently, the animal will pick action A R even whenm m is only slightly above 0:5. This gradual decrease in the threshold over time for choosing the overt actions A R or A L has been called a ''collapsing bound'' in the decision making literature [28][29][30].
The optimal policy p ? is entirely determined by three reward parameters fR P ,R N ,R S g. At a given belief state, p ? picks one of the three available actions that leads to the largest expected future reward. Thus, the choice is determined by the relative, not the absolute, value of the expected future reward for the different actions. From equation 10, we have If we regard the sampling penalty R S as specifying the unit of reward, the optimal policy p ? is determined by the ratio R N {R P R S alone. Figure 2 (c) shows the relationship between R N {R P R S and the optimal policy p ? by showing the rightward decision boundaries w R (t) for different values of R N {R P R S . As R N {R P R S increases (e.g., by making the sampling cost R s smaller), the boundary w R (t) gradually moves towards the upper right corner, giving the animal more time to make decisions which results in more accurate decisions. To better understand this relationship, we fit the decision boundary to a hyperbolic function: We find that t 1=2 exhibits nearly logarithmic growth with R N {R P R S . Interestingly, a collapsing bound is obtained even with extremely small R S because the goal is reward maximization across trials: it is better to terminate a trial and accrue reward in future trials than to continue sampling noisy (possibly 0% coherent) stimuli.

Model predictions: psychometric function and reaction time
We compare predictions of the model based on the learned policy p Ã with experimental data from the reaction time version (rather than the fixed duration version) of the motion discrimination task [31]. As illustrated in Figure 3, the model assumes that motion information regarding the random dots on the screen is processed by MT neurons. These neurons provide the observations o t (and n{o t ) to right-and left-direction coding LIP neurons, which maintain the belief state b t~f a~P t o t ,b~P t (n{o t )g. Actions are selected based on the optimal policy p Ã . If b t [P R t or b t [P L t , the animal makes a rightward or leftward decision respectively and terminates the trial. When b t [P S t , the animal chooses the sampling action and gets a new observation o tz1 . The performance on the task using the optimal policy p Ã can be measured in terms of both the accuracy of direction discrimination (the so-called psychometric function), and the reaction time required to reach a decision (the chronometric function). In this section, we derive the expected accuracy and reaction time as a function of stimulus coherence c, and compare them to the psychometric and chronometric functions of a monkey performing the same task [31].
The sequence of random variables fm m 1 ,m m 2 , . . . ,m m t g forms a (nonstationary) Markov chain with transition probabilities determined by equation 11. Let Y(m m t ,tDm) be the joint probability that the animal keeps selecting A S until time step t: At t~0, the animal will select A S regardless ofm m under p Ã , making y(m m,0Dm)~Prm m 0 ½ . At t §1, Y(m m t ,tDm) can be expressed recursively as: Let Pr t,RDm ½ and Pr t,LDm ½ be the joint probability mass functions that the animal makes a right or left choice at time t, respectively. These correspond to the probability that the point estimatorm m(t) crosses the boundary of P R or P L for the first time at time t: Pr t,LDm ½ ~X The probabilities of making rightward or leftward eye movement are the marginal probabilities summing over all possible crossing times: Pr RDm ½ ~P ? t~1 Pr t,RDm ½ and Pr LDm ½ ~P ? t~1 Pr t,LDm ½ . When the underlying motion direction is rightward, Pr RDm ½ represents the accuracy of motion discrimination and Pr LDm ½ represents the error rate. The mean reaction times for correct and error choices are the expected crossing times over the conditional probability that the animal makes decision A R and A L respectively at time t: Pr t,RDm ½ Pr RDm ½ ð20Þ The coherence (motion strength) of the random dots task. c[½0,1. c is fixed during a task. d The underlying direction of the random dots task. d[f+1g. d is fixed during a task.
l MT

R,L
The average spike rate of MT neurons preferring rightward or leftward direction, respectively, as a function of both coherence c and d described in equations 1. The belief (posterior distribution) b t~P (mDo 1:t ). With a beta-distributed initial belief b 0~B eta(a 0 ,b 0 ), b t is also beta distributed due to the binomial distributed emission probability P(o t Dm). Without loss of generality, a 0~b0~1 throughout the paper.

R S
A negative reward associated with the cost of an observation.

R P
A positive reward associated with a correct eye movement.

R N
A negative reward associated with an incorrect eye movement.

RT step
The duration of a single observation, the real elapsed time per POMDP step. Only used to translate the number of POMDP time steps to real elapsed time when comparing with experimental data. Reward Optimization in the Primate Brain The left panel of Figure 4 shows performance accuracy as a function of motion strength c for the model (solid curve) and a monkey (black dots). The model parameters are the same as those in Figure 2, obtained using a binary search within R p [f0,2000g with a minimum step size 10.
The right panel of Figure 4 shows for the same model parameters the predicted mean reacton time RT R (m) for correct choices as a function of coherence c (and fixed direction d~1) for the model (solid curve) and the monkey (black dots). Note that RT R (m) represents the expected number of POMDP time steps for making a rightward eye movement A R . It follows from the Poisson spiking process that the duration of each POMDP time step follows a exponential distribution with its expectation proportional to l R (m)zl L (m). In order to make a direct comparison to the monkey data RT Ã R (m), which is in units of real time, a linear regression was used to to determine the duration RT step of a single Note that the reaction time in a trial is the sum of decision time plus the non-decision delays whose properties are not well understood. The offset RT 0 represents the non-decision residual time. We applied the experimental mean reaction time reported in [31] with motion coherence c~f0:032,0:064,0:128,0:256,0:512g to compute the two coefficients RT step and RT 0 . The unit duration per POMDP step RT step~9 :20 ms/step, and the offset RT 0~3 58:5 ms, which is comparable to the 300 ms non-decision time on average reported in the literature [23,32]. There is essentially one parameter in our model needed to fit the experimental accuracy data, namely, the reward ratio R N {R P R S . The other two parameters RT step and RT 0 are independent of the POMDP model, and are used only to translate the POMDP time steps into real elapsed time. This reward ratio has direct physical interpretation and can be easily manipulated by the experimenters. For example, changing the amount of awards for the correct/incorrect choices, or giving subjects different speed instructions will effectively change R N {R P R S . In Figure 5 (a), we show performance accuracies Pr RDm ½ and predicted mean fixed R N and R P , decreasing R S makes the observations more affordable and allows subjects to accumulate more evidence, in turn leads to a longer decision time and higher accuracy. Our model thus provides a quantitative framework for predicting the effects of reward parameters on the accuracy and speed of decision making. To test our theory, we compare the model predictions with the experimental data from a human subject, reported by Hanks et al [33], under different speed-accuracy regimes. In their experiments, human subjects were instructed to perform the random dots task under different speed-accuracy conditions. The red crosses in Figure 5 (b) represent the response time and accuracy of a human subject in the direction discrimination task with instructions to perform the task more carefully at a slower speed, while the black dots represent the task under normal speed conditions. The slower speed instruction encourages human subjects to accumulate more observations before making the final decision. In the model, this amounts to reducing the negative cost  associated with each sample R s . Indeed, this tradeoff between speed and accuracy was consistent with predicted effects of changing the reward ratio. We first fit the model parameters to experimental data under normal speed conditions, based on fitting R N {R P R S , RT step~7 :7 ms/step, and RT 0~2 04 ms ( Figure 5 (b), black solid curves). The red dashed lines shown in Figure 5 (b) are model fits to the data under slower speed instruction. There is just one degree of freedom in this fit, as all model parameters except the reward ratio were fixed to the values used to fit data in the normal speed regime. respectively. The per-step duration and non-decision residual time are fixed to be the same for both conditions: RT step~7 :7 ms/step, and RT 0~2 04 ms. Human data are from human subject LH in [33]. doi:10.1371/journal.pone.0053344.g005 Neural response during direction discrimination task From Figure 2 (b), it is clear that for the random dots task, the animal does not need to store the whole two dimensional optimal policy but only the two one-dimensional decision boundaries w R and w L . This naturally suggests a neural mechanism for decision making similar to that in drift diffusion models: LIP neurons compute the belief state from MT responses and employ divisive normalization to maintain the point estimatem m t~m R za 0 mza 0 zb 0 . We now explore the hypothesis that the response of LIP neurons represents the difference betweenm m and the optimal decision threshold w R (t). In this model, a rightward eye movement is initiated only when the difference m R m R zm L {w R reaches a fixed bound (in this case, 0). Therefore, we modeled the firing rates in the lateral intraparietal area (LIP) l LIP as: where l LIP 0 is the spontaneous firing rate for LIP neurons. Since represents the termination bound;B B~49:6 spikes s {1 from [30]. The firing rate l LIP L is defined similarly. The above model makes two testable predictions about neural responses in LIP. The first is that the neural response to 0% coherent motion (the so called ''urgency'' signal [30,34]) encodes the decision boundary w R (t) (or w L (t) for leftward-preferring LIP neurons). In Figure 6a, we plot the model response to 0% coherent motion, along with a fit to a hyperbolic function u(t)! t tzt 1=2 , the same function that Churchland et al [30] used to parametrize the experimentally observed ''urgency signal.'' The parameter t 1=2 is the time taken to reach 50% of the maximum. The estimate of t 1=2 for the model from Figure 6 (a) is 123 ms, which is consistent with the t 1=2~1 33:2 ms estimated from neural data [30]. The second prediction concerns the buildup rate (in units of spikes s {2 coh {1 ) of the LIP firing rates. The buildup rate of LIP at each motion strength is calculated from the slope of a line fit to model LIP firing rate during the first 120 ms of decision time. As shown in Figure 6 (b), buildup rates scaled approximately linearly as a function of motion coherence. The effect of a unit change in coherence on the buildup rate can be estimated from the slope of the fitted line to be 227:7 spike s {2 coh {1 , similar to what has been reported in the literature [30] (222:5 spike s {2 coh {1 ).

Discussion
The random dots motion discrimination task has provided a wealth of information regarding decision making in the primate brain. Much of this data has previously been modeled using the drift diffusion model [35,36], but to fully account for the experimental data, one has to sometimes use ad-hoc assumptions. This paper introduces an alternative model for explaining the monkey's behavior based on the framework of partially observable Markov decision processes (POMDPs).
We believe that the POMDP model provides a more versatile framework for decision making compared to the drift diffusion model, which can be viewed as a special case of sequential statistical hypothesis testing (SSHT) [37]. Sequential statistical hypothesis testing assumes that the stimuli (observations) are independent and identically distributed whereas the POMDP model allows observations be temporally correlated. The observations in the POMDP are conditionally independent given the hidden state m, which evolves according to a Markov chain. Thus, the POMDP framework for decision making [11,14,16,38,39] can be regarded as a strictly more general model than the SSHT models. We intend to explore the applicability of our POMDP where t 1=2~1 23 ms, which is comparable to the value of 133:2 ms estimated from neural data [30]. (b) The first 120 ms of decision time was used to compute the buildup rate from the model response following the procedure in [30]. model to time-dependent stimuli, such as temporally dynamic attention [40] and temporally blurred stimulus representations [41] in future studies. Another advantage of a POMDP model is that the model parameters have direct physical interpretations and can be easily manipulated by the experimenter. Our analysis shows that the optimal policy is fully determined by the reward parameters fR P ,R N ,R S g. Thus, the model psychometric and chronometric functions, which are derived from the optimal policy, are also fully determined by these model parameters. Experimenters can control these reward parameters by changing the amount of awards for the correct/incorrect choices, or by giving subjects different speed instructions. This allows our model to make testable predictions, as demonstrated by the effects of the change in the reward ratios on the speed-accuracy trade-off. It should be noted that these reward parameters can be subjective and may vary from individual to individual. For example, R P can be directly related to the external food or juice reward provided by the experimenter while R S may be linked to internal factors such as degree of hunger or thirst, drive, and motivation. The precise relationship between these reward parameters and the external reward/risk controlled by the experimenter remains unknown. Our model thus provides a quantitative framework for studying this relationship between internal reward mechanisms and external physical reward.
The proposed model demonstrates how the monkey's choices in the random dots task can be interpreted as being optimal under the hypothesis of reward maximization. The reward maximization hypothesis has previously been used to explain behavioral data from conditioning experiments [8] and dopaminergic responses under the framework of temporal difference (TD) learning [42]. Our model extends these results to the more general problem of decision making under uncertainty. The model predicts psychometric and chronometric functions that are quantitatively close to those observed in monkeys and humans solving the random dots task.
We showed through analytical derivations and numerical simulation that the optimal threshold for selecting overt actions is a declining function of time. Such a collapsing decision bound has previously been obtained for decision making under a deadline [11,29]. It has also been proposed as an ad-hoc mechanism in drift diffusion models [28,30,43] for explaining finite response time at zero percent coherence. Our results demonstrate that a collapsing bound emerges naturally as a consequence of reward maximization. Additionally, the POMDP model readily generalizes to the case of decision making with arbitrary numbers of states and actions, as well as time-varying state.
Instead of traditional dynamic programming techniques, the optimal policy p Ã and value v Ã can be learned via Monte Carlo approximation-based methods such as temporal difference (TD) learning [27]. There is much evidence suggesting that the firing rate of midbrain dopaminergic neurons might represent the reward prediction error in TD learning. Thus, the learning of value and policy in the current model could potentially be implemented in a manner similar to previous TD learning models of the basal ganglia [8,9,11,42].