^{1}

^{2}

^{1}

^{2}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: LR EG. Performed the experiments: LR EG. Analyzed the data: LR EG. Contributed reagents/materials/analysis tools: LR EG. Wrote the paper: LR EG.

Costs (e.g. energetic expenditure) and benefits (e.g. food) are central determinants of behavior. In ecology and economics, they are combined to form a utility function which is maximized to guide choices. This principle is widely used in neuroscience as a normative model of decision and action, but current versions of this model fail to consider how decisions are actually converted into actions (i.e. the formation of trajectories). Here, we describe an approach where decision making and motor control are optimal, iterative processes derived from the maximization of the discounted, weighted difference between expected rewards and foreseeable motor efforts. The model accounts for decision making in cost/benefit situations, and detailed characteristics of control and goal tracking in realistic motor tasks. As a normative construction, the model is relevant to address the neural bases and pathological aspects of decision making and motor control.

Behavior is made of decisions and actions. The decisions are based on the costs and benefits of potential actions, and the chosen actions are executed through the proper control of body segments. The corresponding processes are generally considered in separate theories of decision making and motor control, which cannot explain how the actual costs and benefits of a chosen action can be consistent with the expected costs and benefits involved at the decision stage. Here, we propose an overarching optimal model of decision and motor control based on the maximization of a mixed function of costs and benefits. The model provides a unified account of decision in cost/benefit situations (e.g. choice between small reward/low effort and large reward/high effort options), and motor control in realistic motor tasks. The model appears suitable to advance our understanding of the neural bases and pathological aspects of decision making and motor control.

Consider a simple living creature that needs to move in its environment to collect food for survival (foraging problem;

Most theories of decision making and motor control do not account for these characteristics of behavior. The main reason for this is that decision and control are essentially blind to each other in the proposed frameworks

Here, we consider a normative approach to decision making and motor control derived from the theory of

The proposed model is a model for decision and action. It is based on an objective function representing a trade-off between expected benefits and foreseeable costs of potential actions (

The model contains five parameters (^{*}, ^{*} specifies the location of the goal to be pursued, and acts as a classic boundary condition for a control policy. Parameter ^{*} and

For the purpose of decision and action, a reward value needs to be translated into an internal currency which measures “how much a reward is rewarding” (parameter ρ). A subject may not attribute the same value to food if he is hungry or satiated, and the same value to money if he plays Monopoly or trades at the stock exchange.

Parameter ε is a scaling factor that expresses “how much an effort is effortful”. A subject may not attribute the same value to effort if he is rested or exhausted. ρ and ε are redundant in the sense that only their ratio matters (

Parameter γ is a discount factor on reward and effort. It is both a computational parameter that is necessary to the formulation of the model, and a factor related to the process by which delayed or far away reinforcers lose value

In the following, ρ, ε, and γ are called

We note that the principle of the model is independent of the values of the parameters, i.e. the decision process and the control policy are generic characteristics of the model.

The model provides a normative criterion for decision making when choices involve different costs and benefits. To explore this issue, we considered the simple situation depicted in

The model further states that the same parameters underlie both decision and movement production. To test this idea, we modeled the experiment reported by Stevens et al.

_{SINs} = .001, σ_{SDNm} = 1.

For Object I, we have an analytic formula for optimal movement duration ^{*}(_{1}_{1}_{1}_{1}_{2}_{2}_{2}_{2}

We randomly drew pairs of movement duration (one for each condition) from a Gaussian distribution specified by the mean and sd ( = s.e.m×sqrt(_{1} = ^{*}(_{1},_{1}_{2} = ^{*}(_{2},_{2}

Then we computed for each monkey (i.e. for each set of parameters shown in

To determine the choice behavior of the monkeys from option utilities, we calculated the probability to choose the large reward at the different distances vs the small reward at the shortest distance using a softmax rule_{∞}^{large} and _{∞}^{small} are the utilities for the large reward and small reward options, respectively, and β a temperature parameter which represents the degree of randomness of the action selection. It should be noted that the softmax transform is not a part of the model, but a way to translate utilities into choice proportions, using the natural principle that different option utilities should lead to a proportion near 1 (or 0), and equal option utilities to a proportion of 0.5. The parameter β, which had no qualitative effect on the predicted preferences, was selected for each monkey to fit the data from Stevens. The model quantitatively reproduced the empirical results in the decision task for the two monkey species (

To assess more precisely the ability of the model to predict the choices, we performed a detailed analysis over the two sets of simulated utilities (not over choices, to rule out any confound induced by β). We found that distance to the large reward modulated the utility of the large reward for both species, and that: 1. for tamarins, the large reward option had a larger utility than the small reward option for all distances; 2. for marmosets, the large reward option had a larger utility than the small reward option only for test distances strictly smaller than 210 cm. These results exactly parallel the effects found by Stevens, and show that the model can quantitatively predict the inversion of preferences of the different species. This further supports the hypothesis that the same process governs decision making and action in a cost/benefit choice situation.

The model reproduced basic characteristics of motor behavior, as expected from the close relationship with previous optimal control models

Unexpected events can perturb an ongoing action, and prevent a planned movement to reach its goal. Typical examples are sudden changes in target location

In the experiment of Liu and Todorov ^{*}) in the controller at different times (perturbation time+Δ, to account for delayed perception of the change). The parameters of the model were estimated from unperturbed trials. The model quantitatively reproduced trajectory formation (

In the experiment of Shadmehr and Mussa-Ivaldi

These results illustrate how a unique set of parameters, and thus a unique controller, explains both normal trajectory formation, and complex updating of motor commands and trajectories when participants face unexpected perturbations. The same mechanisms (optimality, feedback control, implicit determination of duration) underlie basic motor characteristics (scaling law), and flexible control and goal tracking in complex situations.

The model is governed by the vigor (ρ/ε) and discount (γ) factors that can modulate both the decision process and the control policy (

Decision making in a cost/benefit situation (

Motor control was characterized by scaling laws (

Along the scaling laws defined by each factor (_{2}(2

_{SINs} = .001, σ_{SDNm} = 1.

Overall, these results show that the internal parameters modulate decision making and motor control in a way that makes sense from a physiological and psychological point of view.

We have presented a computational framework that describes decision making and motor control as an ecological problem. The problem was cast in the framework of reinforcement learning, and the solution formulated as an optimal decision process and an optimal control policy. The resulting model successfully addressed decision making in cost/benefit situations and control in realistic motor tasks.

The proposed model is not intended to be a general theory of decision making and motor control, which may not be feasible (e.g.

Our model is closely related to previous works in the field of decision making and motor control. The central idea derives from optimal feedback control theory

A series of study by Trommershäuser and colleagues

A central and novel aspect of the model is the integration of motor control into the decision process. This idea was not exploited in previous models because movement duration was fixed

The model was described here in its simplest form. In particular, decision making was considered as a deterministic process. The scope of the model could easily be extended to address stochastic paradigms as in previous models

An analysis of behavior in terms of costs and benefits has long since been usual in behavioral ecology

A central observation in behavioral settings is that the calculation of cost involves a detailed knowledge of motor behavior

The study of Dean et al.

A central property of the model is motor control, i.e. the formation of trajectories for redundant biomechanical systems. This property is inherited from a close proximity with previous models based on optimal feedback control

The model is governed by task and internal parameters that specify choices in cost/benefit situations, and kinematics and precision in motor tasks. These parameters have a psychological and neural dimension that we discuss below.

Parameter

Parameter γ has two dimensions. On the one hand, it is a

The second question is related to the relationship between delay discounting and velocity. The study of Stevens et al.

The model was applied to pure motor tasks in which there was no explicit reward

The model is built on a classic control/estimation architecture (

A central proposal of the model is a common basis for decision and action. The only available data that quantitatively support this proposal are those of Stevens et al.

The preceding results involved locomotor patterns, but appropriate data for arm movements could be obtained using methods described in

Our objective is to formulate a unified model of decision making and motor control. Classical normative approaches formalize decision making as a maximization process on a

We have arbitrarily chosen the notations of control theory (

The principles of the model are first explained on a simple, deterministic example. Then the complete, stochastic version is described. The model is cast in the framework of reinforcement learning although we only exploit the optimal planning/decision processes of RL, but not the learning processes. The rationale for this choice is the following. Formally, the model corresponds to an infinite-horizon optimal control problem

We consider an inertial point (controlled object) described by its mass _{0};_{f}]) that can displace the point between given states in the duration _{f}−_{0}. In the framework of the optimal control theory, the control policy is derived from the constraint to minimize a cost function_{0};_{f}], where

An alternative approach has been elaborated as an extension of RL in continuous time and space

We consider the case of a simple ^{*}, i.e.^{*}, and 0 everywhere else, and ρ and ε are scaling factors for reward and effort, respectively (see

where the term ρ^{−T/γ} is the discounted reward (this result comes from the fact that ∫ _{u}_{∞} and _{u}

The purpose of _{∞} requires to find a time ^{−T/γ}) and the _{u}_{u}^{*} (^{*} (^{*} may not exist in general, depending on the shape of the reward and effort terms (^{*} if it exists), and a control process (if ^{*} exists, act with the optimal control policy defined by ^{*}). In the following, the maximal value of _{∞} (for ^{*}) will be called

This description in terms of duration should not hide the fact that duration is only an intermediate quantity in the maximization of the utility function, and direct computation of choices and commands is possible without explicit calculus of duration

If there are multiple reward states in the environment, the utility defines a normative priority order among these states. A decision process which selects the action with the highest utility will choose the best possible cost/benefit compromise.

The proposed objective function involves two elements that are central to a decision making process: the benefit and the cost associated with a choice. A third element is uncertainty on the outcome of a choice. In the case where uncertainty can be represented by a probability (risk), this element could be integrated in the decision process without substantial modification of the model. A solution is to weight the reward value by the probability, in order to obtain an “expected value”. Another solution is to consider that temporal discounting already contains a representation of risk

In summary,

For any dynamics (

The general control architecture is depicted in _{OBJ}; 2. A ^{∧} is the state estimate (described below); 3. An ^{∧} according to_{OBS} the observation vector corrupted by observation noise, and Δ the time delay in sensory feedback pathways. The observed states were the position and velocity of the controlled object.

Object noise was a multiplicative (signal-dependent) noise with standard deviation σ_{SDNm}, and observation noise was an additive (signal-independent) noise with standard deviation σ_{SINs}

A simulation consisted in calculating the utility (maximal value of the objective function), and the timecourse of object state and controls for a given dynamics ^{*}, _{SINs}, σ_{SDNm}, Δ. The solution was calculated iteratively at discretized times (timestep η). At each time ^{∧} (_{SINs}, σ_{SDNm}) to obtain ^{∧}(

Three types of object were considered, corresponding to different purposes. The rationale was to use the simplest object which is deemed sufficient for the intended demonstration. Object I was a unidimensional linear object similar to that described in the starting example. The force generating system was

For Objects I and II, the mass _{SINs}, σ_{SDNm}) were chosen to obtain an appropriate functioning of the Kalman filter, and a realistic level of variability. The remaining parameters (^{*},

Object III is a two-joint (shoulder, elbow) planar arm. Its dynamics is given by_{1},θ_{2}) is the vector of joint angles, _{max} the matrix of maximal muscular forces, and

For each segment (1: upper arm, 2: forearm), ^{T}_{max} (N) is diag([700;382;572;449]). Matrix

Two sets of parameter values were used in the simulations. For Object IIIa, we used the values found in _{1} = .30, _{2} = .33, _{1} = .025, _{2} = .045, _{1} = 1.4, _{2} = 1.1, _{1} = .11, _{2} = .16. For Object IIIb, we used the values given in _{1} = .33, _{2} = .34, _{1} = .0141, _{2} = .0188, _{1} = 1.93, _{2} = 1.52, _{1} = .165, _{2} = .19.

The problem is to find the sequence of control _{u}_{0}) = _{0} and ^{*} for a given dynamic

If the dynamic ^{*}(^{*}(^{*}^{*} analytically. Symbolic calculus was performed with Maxima (Maxima, a Computer Algebra System. Version 5.18.1 (2009)

When the dynamics is nonlinear (Object III), the set of differential equations (

We thank O. Sigaud, A. Terekhov, P. Baraduc, and M. Desmurget for fruitful discussions.

^{nd}edition. New York: Cambridge University Press. 994 p.