## Figures

## Abstract

What kind of strategies subjects follow in various behavioral circumstances has been a central issue in decision making. In particular, which behavioral strategy, maximizing or matching, is more fundamental to animal's decision behavior has been a matter of debate. Here, we prove that any algorithm to achieve the stationary condition for maximizing the average reward should lead to matching when it ignores the dependence of the expected outcome on subject's past choices. We may term this strategy of partial reward maximization “matching strategy”. Then, this strategy is applied to the case where the subject's decision system updates the information for making a decision. Such information includes subject's past actions or sensory stimuli, and the internal storage of this information is often called “state variables”. We demonstrate that the matching strategy provides an easy way to maximize reward when combined with the exploration of the state variables that correctly represent the crucial information for reward maximization. Our results reveal for the first time how a strategy to achieve matching behavior is beneficial to reward maximization, achieving a novel insight into the relationship between maximizing and matching.

**Citation: **Sakai Y, Fukai T (2008) When Does Reward Maximization Lead to Matching Law? PLoS ONE 3(11):
e3795.
https://doi.org/10.1371/journal.pone.0003795

**Editor: **Tim Bussey, University of Cambridge, United Kingdom

**Received: **April 4, 2008; **Accepted: **November 4, 2008; **Published: ** November 24, 2008

**Copyright: ** © 2008 Sakai et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **This work was partially supported by the Grant in Aid for Priority Researches, no. 17022036 and no. 17021038. The funder had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

How do animals, including humans, determine appropriate behavioral responses when their behavioral outcomes are uncertain? Decision-making is a fundamental process of the brain for organizing behaviors, and depends crucially on how subjects have been rewarded in their past behavioral responses. Mechanism of reward-driven learning has extensively been studied theoretically and experimentally. A well-known example includes the reinforcement learning theory based on the temporal difference (TD) error algorithm[1], which is powerful enough to solve difficult problems in machine control and accounts for the basal-ganglia activity representing reward expectancy in monkeys and humans[2]–[4]. It is generally considered that subjects attempt to choose a behavioral policy that will maximize the amount of reward under a given environmental condition [5]. In addition, many algorithms in machine learning and other brain-style computations aim at reward maximization or, somewhat more generally, optimization of a given cost function.

Nevertheless, animals often exhibit matching behavior in a variety of decision-making tasks[6]–[9], even if such behavior does not necessarily maximize reward. The matching law states that the frequency of choosing an option is proportional to the amount of past reward obtained from that option[6]: *N _{a}*/(

*N*

_{1}+

*N*

_{2}+…+

*N*) =

_{n}*I*/(

_{a}*I*

_{1}+

*N*

_{2}+…+

*N*), where

_{n}*N*(

_{a}*a*= 1,…,

*n*) represents the times option

*a*has been chosen and

*I*the total amount of income obtained at the option. A typical example showing this law is the alternative choice task, in which subjects have to choose one from the two options that may be rewarded at different average rates. Matching and maximizing are mathematically equivalent in simple tasks[10], [11], but not in arbitrary tasks[12]–[15].

_{a}Decision-making models to reproduce the matching behavior have been proposed[9], [16], [17], and recent computational studies pointed out possible origins of matching behavior in biological neural systems[18], [19]. For instance, a recent model proposed that the matching law results from the covariance learning rule in synaptic plasticity[19]. In addition, we previously demonstrated that the matching law emerges in a class of the reinforcement learning systems including the actor-critic[20], [21], which has widely been used in engineering applications. However, whether matching and maximizing share a common computational principle and whether matching behavior is beneficial to decision making remain unclear. In this study, we propose a view that unifies matching behavior into the general computational framework of reward maximization.

## Results

We first prove that partial maximization of reward leads to matching behavior irrespective of the mathematical algorithm used for this computation. A crucial step is to define “the matching strategy” that plays a central role in the present study. We then demonstrate how the matching strategy substitutes for the maximizing strategy in a decision-making task that is difficult to solve, when matching is combined with an appropriate utilization of available information sources.

### Matching as a Sub-optimal Maximizing Strategy in Independent Choice Behaviors

The analysis is easier if we express the matching law as follows[8]:(1)where 〈*r*〉 is the average reward per choice from all options and 〈*r*|*a*〉 the average reward conditioned on choice of option *a*. We can derive the above expression from the relationship *I _{a}*≅〈

*r*|

*a*〉

*N*. Thus, the matching law equalizes the expected returns on all the options that are chosen sufficiently many times. Note that the matching law should not be confused with “probability matching”[22], which states that the frequency of choosing option

_{a}*a*is proportional to 〈

*r*|

*a*〉 rather than

*I*. Probability matching is typically observed in a task in which each expected return 〈

_{a}*r*|

*a*〉 is fixed and independent of subject's behavior (i.e., concurrent variable-ratio schedules). In such a simple task, the maximizing behavior satisfies the matching law, but not the probability matching. Hereafter, we focus on the matching law. Moreover, we consider the case where subjects make choices at fixed intervals. We can employ the discrete time steps without much loss of generality, since the framework describes a free-response task on continuous time if the interval is sufficiently short and choosing nothing is an available option.

We analyze the outcome of the decision process without specifying the detail of neural decision system. To this end, we assume a set of ‘synapses’ ** w** = (

*w*

_{1},

*w*

_{2}, …,

*w*) that determines the behavioral policy to make decision. These variables are often called “policy parameters” in mathematical models of decision making. Then, the probability of choosing option

_{m}*a*is given as a function

*p*(

_{a}**) of the synaptic weights. To ensure a smooth search for an optimal set of choice probabilities, we require that arbitrary infinitesimal changes of {**

*w**p*(

_{a}**)} allowed in the space of choice probabilities can be caused by some set of infinitesimal changes {**

*w**dw*}.

_{j}With the above definitions, we can describe the average reward per choice as . Many decision-making algorithms attempt to maximize 〈*r*〉 by modifying behavioral outputs. Whatever algorithm is used, the synaptic weights to maximize 〈*r*〉 should satisfy the stationary condition ∂〈*r*〉/∂*w _{j}* = 0 for arbitrary

*j*, i.e.,(2)

The first term contains the explicit dependence of the choice probability on *w _{j}*, whereas the second term the possible change in 〈

*r*|

*a*〉 generated implicitly by the change in subject's behavioral policy. The conditional expectation value 〈

*r*|

*a*〉 is obtained by taking an average over all possible patterns of past choices in which the newest choice is option

*a*. In general, the reward probability depends not only on the current choice, but also on the history of the past choices[6], [12]–[15]. In such a case, 〈

*r*|

*a*〉 depends on the choice probabilities that produced the past choices, and hence depends on

*w*.

_{j}In order to maximize reward, the brain has to explore the correct dependence of the reward probability on the past choices. It seems, however, difficult to infer this dependency correctly with little knowledge on an accurate model of the environment. In such a difficult situation, the brain may simply omit the second term in Eq. 2 in its practical attempt to maximize reward,(3)

Multiplying Eq. 3 by arbitrary variations {*dw _{j}*} and taking a summation over

*j*gives , where

*dp*(

_{a}**)≡Σ**

*w**(∂*

_{j}*p*/∂

_{a}*w*)

_{j}*dw*represents the infinitesimal change caused by {

_{j}*dw*}, and

_{j}**≡(〈**

*R**r*|1〉, 〈

*r*|2〉, …, 〈

*r*|

*n*〉) and

*d*

**(**

*p***)≡(**

*w**dp*

_{1}(

**),**

*w**dp*

_{2}(

**), …,**

*w**dp*(

_{n}**)) are vectors in the space of multiple options. If all options have non-vanishing stationary choice probabilities, the probability changes**

*w**d*

**(**

*p***) may occur in an arbitrary direction that satisfies the probability conservation , where**

*w***1**≡(1, 1, …, 1) is an

*n*-dimensional identity vector. Therefore, the conditions

**·**

*R**d*

**(**

*p***) = 0 and**

*w***1**·

*d*

**(**

*p***) = 0 can simultaneously be satisfied only by such**

*w***that is parallel to**

*R***1**. If the stationary choice probability vanishes for some option,

*p*= 0, we can forbid the changes in this direction (

_{a}*dp*= 0), and

_{a}**should have identical components for all the options exhibiting non-zero choice probabilities. These results and Eq. 1 imply that the truncated stationary condition given by Eq. 3 is equivalent to the matching law.**

*R*Thus, the steady choice behavior exhibits matching when the decision system ignores the influence of subject's past choices on the expected outcome in aiming for the stationary condition of reward maximization. Hereafter, we call this suboptimal maximization strategy to achieve Eq. 3 “matching strategy”. By contrast, we call the strategy to directly solve Eq. 2 “the maximizing strategy”.

To demonstrate the above relationship between the matching and maximizing strategies, we study an alternative choice task (*n* = 2), in which the expectation value of return on each choice pattern is specified completely by the subject's current (*a _{t}*) and most recent choices (

*a*

_{t}_{−1}) as (see Methods). We consider the case where subject's current choice is independent of its past choices. Hereafter, such decision behavior is called “independent choice behavior”. Since

*p*

_{2}(

**) = 1−**

*w**p*

_{1}(

**), the subject's decision system controls only the choice probability**

*w**p*

_{1}(

**) through**

*w***, and makes every choice with probability**

*w**p*

_{1}(

**). Then the average return on the current choice 〈**

*w**r*|

_{t}*a*〉 is obtained by averaging over the possible patterns

_{t}*a*

_{t}_{−1}= 1,2 as , and hence depends on

**through the choice probability**

*w**p*

_{1}(

**). Since ∂〈**

*w**r*|

_{t}*a*〉/∂

_{t}*w*≠0, the matching strategy does not maximize reward in this task. Actually, it gives 〈

_{j}*r*〉 = 0.25 whereas the maximizing strategy yields 〈

*r*〉 = 0.45 (Figure 1).

The reward probability is given as a function of the current and most recent choices, but the subject makes each choice independently of the past choices. The task parameters are set as *g*_{11} = 0, *g*_{21} = 0.2, *g*_{12} = 1 and *g*_{22} = 0.4. The expectation values are given as 〈*r*|*a*〉 = *g _{a}*

_{1}

*p*

_{1}+

*g*

_{a}_{2}(1−

*p*

_{1}) and 〈

*r*〉 = 〈

*r*|1〉

*p*

_{1}+〈

*r*|2〉(1−

*p*

_{1}). The matching (vertical solid line) and maximizing (vertical dashed line) choice probabilities are obtained as solutions of equations 〈

*r*|1〉 = 〈

*r*|2〉 and

*d*〈

*r*〉/

*dp*

_{1}= 0 respectively. The matching strategy (〈

*r*〉 = 0.25) earns less than the maximizing strategy (〈

*r*〉 = 0.45) in this task.

The matching strategy enables us to derive a variety of learning rules that lead to matching behavior (Supporting Text S1). For instance, such a category of learning rules includes the well-known actor-critic in the reinforcement learning theory [1], [20], [21], direct actor[23], melioration[16] and local matching[9]. In particular, the actor-critic and direct actor also belong to the covariance rule[19]. We numerically solved the decision task analyzed in Figure 1 to show that all these learning algorithms generate matching behavior (Figure 2A). By contrast, indirect actor [23] does not exhibit matching in the steady behavior (Figure 2B). The indirect actor belongs to Q-learning without state variables[1] (see below for the state variables). Since Q-learning determines the choice probabilities by estimating “action values”, i.e., the expected returns on individual options, it does not show matching.

The horizontal and vertical axes indicate the cumulative numbers of choices given to option 1 and 2, respectively. Dashed and solid line segments indicate the slopes corresponding to the maximizing and matching choice probabilities, respectively. See Methods for details of the algorithms. (A) The actor critic (red), direct actor (magenta), local matching (blue) and melioration (green) were numerically simulated with *β* = 4. (B) The Q-leaning was simulated for *β* = 2, 4, 8, 16, and 32. At *β* = 32, the system eventually learns to choose only option 2.

### Matching vs. Maximizing over All Possible Choice Behaviors

The quantitative analysis conducted in Figure 1 was restricted to the case where the subject generates independent choice behaviors. It was shown that the maximizing strategy earns better than the matching strategy. However, the average reward 〈*r*〉 = 0.45 achieved by the maximizing strategy in Figure 1 is not the global maximum, but is only the best one among independent choice behaviors. For instance, an alternate choice pattern of 1212…, where the current choice depends on the most recent choice, can earn better (〈*r*〉 = (*g*_{12}+*g*_{21})/2 = 0.6) than the best independent choice behavior in that task. Thus, to produce a better outcome in some situation, the subject is required to make each choice depending on the past choices or other available information. Below, we investigate the relationship between the matching and maximizing strategies, taking all possible choice behaviors into account.

To make the argument as general as possible, we include the case where the subject may receive sensory signals *σ** _{t}* before making a choice

*a*at time

_{t}*t*. Then, in a given task, the external and internal information available for the subject at time

*t*consists of the histories of sensory signals, subject's past choices and the past returns:

*H**= (*

_{t}

*σ**,*

_{t}*r*

_{t}_{−1},

*a*

_{t}_{−1},

*σ*

_{t}_{−1},

*r*

_{t}_{−2},

*a*

_{t}_{−2},

*σ*

_{t}_{−2},…). A decision-making task specifies the conditional probability distribution

*P*(

*σ*

_{t}_{+1},

*r*|

_{t}*a*,

_{t}

*H**). In contrast, the general rule to determine subject's choice behavior is described by the conditional probability distribution*

_{t}*P*(

*a*|

_{t}

*H**). The problem is how to explore an optimal behavioral policy*

_{t}*Pˆ*(

*a*|

_{t}

*H**) to maximize the average reward 〈*

_{t}*r*〉 in a given task.

In practice, however, it is difficult to optimize the dependence of *P*(*a _{t}*|

*H**) on the whole history*

_{t}

*H**. Hence, subject's decision system may extract partial information*

_{t}

*s**from*

_{t}

*H**, and restrict the behavioral policy as(4)*

_{t}We may call the above *s** _{t}* “state variables”. We assume that the decision system controls the definition of state

*s**,*

_{t}

*H**↦*

_{t}

*s**, and*

_{t}*P*(

*a*|

_{t}

*s**). In order to maximize the average reward, the decision system has to adopt an appropriate definition of state with which an optimal behavioral policy*

_{t}*Pˆ*(

*a*|

_{t}

*H**) satisfies Eq 4. It has been proved [24] that if a map*

_{t}

*H**↦*

_{t}

*s**satisfies(5)for a given task, then the maximal average reward can be obtained by a behavioral policy that satisfies Eq. 4. The average reward obtained by an arbitrary choice sequence can be expressed by*

_{t}*P*(

*s*

_{t}_{+1},

*r*|

_{t}*a*,

_{t}

*s**) that satisfies Eq. 5 and does not depend on the variables that are not reflected in*

_{t}**. Therefore, state**

*s***that satisfies Eq. 5 represents crucial information about reward delivery in that task. The above theorem means that the optimal policy**

*s**Pˆ*(

*a*|

_{t}

*H**) depends on only the crucial information. Hereafter, we may say that a definition of state variables,*

_{t}

*H**↦*

_{t}

*s**, is correct if and only if*

_{t}

*s**satisfies Eq. 5. Note that the selection of the correct definition may not be unique.*

_{t}Suppose that the decision system adopts a certain definition of state variables, *H** _{t}*↦

*s**. Let*

_{t}*p*=

_{as}*P*(

*a*=

_{t}*a*|

*s**=*

_{t}**) be the choice probability with which the decision system in state**

*s***chooses option**

*s**a*. Each state-dependent choice probability is determined as a function of the synaptic weights

*p*(

_{as}**). In order to explore all possible patterns of state-dependent choice probabilities smoothly, we assume that an arbitrary pattern of {**

*w**p*} and an arbitrary direction of infinitesimal changes {

_{as}*dp*} allowed in the space of probabilities can be expressed by some pattern of

_{as}**and some direction of infinitesimal changes**

*w**d*

**, respectively (see Methods).**

*w*Taking the state dependence into account, the average reward is written as 〈*r*〉 = Σ**_{s}**Σ

*〈*

_{a}*r*|

*a*,

**〉**

*s**p*

_{a}_{s}(

**)**

*w**P*(

**), where 〈**

*s**r*|

*a*,

**〉 is the average reward conditioned on choice of option**

*s**a*in state

**, and**

*s**P*(

**) is the distribution of the states that the subject has visited over sufficiently many decision trials with fixed {**

*s**p*(

_{as}**)}. The stationary condition for reward maximization ∂〈**

*w**r*〉/∂

*w*= 0 is written as (6)

_{j}The maximizing strategy attempts to achieve Eq. 6 taking the whole dependence on ** w** into account. In contrast, as in the previous case, the matching strategy ignores the dependence of the expected outcome of the current choice on

**in aiming for the stationary condition. The outcome in the present case consists of the return**

*w**r*and the next state

_{t}

*s*

_{t}_{+1}. Therefore, the matching strategy ignores the dependence of

*P*(

*s*

_{t}_{+1},

*r*|

_{t}*a*,

_{t}

*s**) on*

_{t}**, and hence ignores ∂〈**

*w**r*|

*a*,

**〉/∂**

*s**w*and ∂

_{j}*P*(

**′|**

*s**a*,

**)/∂**

*s**w*, where

_{j}*P*(

**′|**

*s**a*,

**)≡**

*s**P*(

*s*

_{t}_{+1}=

**′|**

*s**a*=

_{t}*a*,

*s**=*

_{t}**). By transforming the second term repetitively with the recursive relation**

*s**P*(

**′) = Σ**

*s*

_{s}_{,a}

*P*(

**′|**

*s**a*,

**)**

*s**p*(

_{as}**)**

*w**P*(

**) and by setting ∂〈**

*s**r*|

*a*,

**〉/∂**

*s**w*= ∂

_{j}*P*(

**′|**

*s**a*,

**)/∂**

*s**w*= 0, we obtain the stationary condition of the matching strategy (Supporting Text S2): (7)

_{j}Note that the terms omitted in the matching strategy differ for different definitions of the state. Then, using Eq. 7 and the probability conservation, we can extend the matching law to the case of state-dependent choice behaviors (Supporting Text S2):(8)

The extended matching law given as Eq. 8 depends also on the definition of the state.

We schematically illustrate the relationships between the maximizing and matching strategies with correct and incorrect definitions of the state variables (Figure 3A). The horizontal plane represents the multi-dimensional space of arbitrary choice behaviors. Defining state variables restricts the state-dependent choice behavior to a certain subspace. If state variables are correctly defined to satisfy Eq.5, the subspace (red curve) includes the optimal choice behavior (red circle). The conditional probability *P*(*s*_{t}_{+1}, *r _{t}*|

*a*,

_{t}

*s**) takes a fixed value specified by the task, which is actually independent of*

_{t}**. Therefore, the matching strategy coincides with the maximizing strategy, which indeed earns the globally maximal average reward (red triangle) unless the choice behavior is trapped by a local stationary point. In contrast, if an incorrect definition of state variables is chosen, the set of generable choice behaviors (blue curve) does not necessarily include the optimal choice behavior. Therefore, the maximizing strategy can lead to only the best choice behavior (blue triangle) within the restricted set. The conditional probability**

*w**P*(

*s*

_{t}_{+1},

*r*|

_{t}*a*,

_{t}

*s**) depends on the past choices that are not reflected in state*

_{t}

*s**, and hence depends on*

_{t}**. Therefore, the matching strategy (blue cross) in general deviates from the maximizing one (blue triangle).**

*w*(A) The performance of the matching and maximizing strategies based on correctly (red) or incorrectly (blue) defined state variables is shown schematically. (B) Actor-critic systems (Methods) were trained on a decision task in which the subject's current and most recent two choices, *a _{t}*,

*a*

_{t}_{−1}and

*a*

_{t}_{−2}, specify the reward probability according to the following task parameters:

*g*

_{111}= 0,

*g*

_{211}= 0.6,

*g*

_{121}= 0.9,

*g*

_{221}= 1,

*g*

_{112}= 1,

*g*

_{212}= 0.6,

*g*

_{122}= 1, and

*g*

_{222}= 0 (Methods). Curves and dashed lines display the local temporal averages of the rewards earned by the actor-critic systems and the best average rewards obtainable by the maximizing strategy, respectively, in three cases: no state variable (blue); an imperfect state variable

*s*=

_{t}*a*

_{t}_{−1}(magenta); correct state variables

*s*

*= (*

_{t}*a*

_{t}_{−1},

*a*

_{t}_{−2}) (red).

To explain the above results, we conduct numerical simulations of a simple alternative task in which the reward probability is given as a function of the current and most recent two choices (*a _{t}*,

*a*

_{t}_{−1},

*a*

_{t}_{−2}) (see Methods). A correct definition of state variables for making choice

*a*is

_{t}

*s**= (*

_{t}*a*

_{t}_{−1},

*a*

_{t}_{−2}). An actor-critic system (see Methods) operating on the correct state variables earns the globally maximal average reward (Figure 3B, red dashed line). In contrast, for an incorrectly defined state, such as

*s*=

_{t}*a*

_{t}_{−1}or no state variable, the best average rewards (magenta and blue dashed lines, respectively) are smaller than the globally maximal one, and the average rewards earned by the actor-critic systems operating on the incorrect state variables (magenta and blue curves) are still smaller.

Thus, the matching strategy is as efficient as the maximizing one if they are combined with a mechanism to explore and select a correct definition of state variables. However, the matching strategy in general deviates from the maximizing one for the choice behaviors restricted by an incorrect definition of state variables.

## Discussion

How subjects decide behavioral responses based on their experience and reward expectancy is a current topic in neuroscience. In particular, which choice behavior, matching or maximizing, is more fundamental in decision making has long been debated. The relationship between matching and maximizing behaviors has been often discussed in the restricted case where every choice is independent of the past choices. For instance, Loewenstein and Seung [18] recently proved for independent choice behaviors that the maximizing behavior is achieved by synaptic learning rules that cancel out the infinite sum of the covariances between the current return and all of the current and past decision-related neural activities, and that the matching behavior appears when only the first term in the sum, i.e., the covariance between the current return and current decision-related neural activity, vanishes. This relationship corresponds to the relationship between Eqs. 2 and 3 when the choice probabilities are described as (Supporting Text S1). This study has further extended their results to derive a more general statement: any attempt to achieve the stationary condition for reward maximization results in matching behavior if it ignores the influence of the past choices on the expected outcome. This result depends on neither a specific leaning algorithm nor a specific reward schedule.

Most importantly, we have clarified the general relationship between matching and maximizing strategies among all the possible choice behaviors. We have proved that the matching strategy can lead to the optimal choice behavior when the subject's decision system correctly discovers the information sources sufficient to specify the expected outcome, and can utilize the information through state variables. Differences between the matching and maximizing strategies can arise when the decision system assigns incorrect information sources to the state variables. Our results for the first time revealed how a strategy to achieve the matching behavior is beneficial to reward maximization, and how the ignorance of the relevant information leads to the matching behavior.

The information sources relevant to the expected outcome are task-dependent. In realistic situations, the subject would have no *a priori* knowledge about the probabilistic rule of the outcomes of their behavioral responses. It seems unlikely that the brain easily identifies the relevant information sources from infinitely many combinations of the histories of past sensory inputs, returns and choices. This might explain why the matching law appears so robustly in various animal species and in various decision-making tasks as a result of ignorance of the relevant information sources. In contrast, the matching strategy with the incorrect selection of information sources may replicate various deviations from the matching behavior, such as the under/over-matching observed in various situations [25]–[28]. Our results provide a theoretical framework to investigate the deviations from matching on the basis of selected information sources. How the brain explores the relevant information sources remains open for further studies. Since this ability of the brain is what discriminates it from any existing artificial machine with human-like adaptive behavior, clarifying the underlying mechanism is an exciting challenge in neuroscience and its application to robotics.

## Methods

### Summary of assumptions

Our proof of matching law (Eq. 3) is valid for a wide class of natural learning rules, including those employing a widely-used soft-max function for choice probabilities (see below). In the following, however, we explicitly describe the assumptions necessary to make our proof mathematically rigorous. For decision-making tasks, we assumed 1) discrete time step *t* at which the subject is required to make decision, 2) a finite number of fixed options (*a* = 1, 2, …, *n*) available for the subject at every time step, and 3) a scalar amount of reward given to the subject at every time step. For the decision system, we required the following assumptions: 4) the decision system can control the definition of state *s** _{t}* and the state-dependent choice probabilities {

*p*} through a set of synapses

_{as}**= (**

*w**w*

_{1},

*w*

_{2}, …,

*w*), 5) it adopts a definition of state

_{m}

*s**with which the number of possible states is finite (*

_{t}*l*), and 6) on a certain definition of state, an arbitrary pattern of possible {

*p*} and an arbitrary direction of possible infinitesimal changes {

_{as}*dp*} can be expressed by some

_{as}**and**

*w**d*

**, respectively. The assumption 6 requires the following condition:(9)where**

*w***(**

*q***) represents the**

*w**ln*-dimensional vector function consisting of the state-dependent choice probabilities {

*p*(

_{as}**)}, and**

*w**J*(

**) is the Jacobian matrix of**

*w***(**

*q***):**

*w**J*(

_{ij}**) = ∂**

*w**q*(

_{i}**)/∂**

*w**w*. Equation 9 requires

_{j}*m*≥

*l*(

*n*−1). Independent choice behaviors are generated in the case

*l*= 1.

### Decision-making task for demonstrations

To examine the performance of the matching and maximizing strategies, we introduced a decision-making task in which reward is given (*r _{t}* = 1) or not given (

*r*= 0) to the subject according to the probability determined by the subject's current (

_{t}*a*) and most recent one or two choices (

_{t}*a*

_{t}_{−1}and

*a*

_{t}_{−2}). Each choice should be taken from one of two options (

*a*= 1, 2), although it is straightforward to extend the present results to more general tasks with more than two options. The conditional expectation value of return on each choice pattern is given as a task parameter: or . The values of these parameters are given in figure legends. For given task parameters {}, we can calculate the maximum of the average reward 〈

*r*〉 = Σ

_{a}_{,a′,a″}

*g*

_{aa}_{′a″}

*p*

_{aa}_{′a″}

*P*(

*a*′,

*a*″), where

*p*is the conditional choice probability

_{aa′a″}*p*

_{aa}_{′a″}≡

*P*(

*a*=

_{t}*a*|

*a*

_{t}_{−1}=

*a*′,

*a*

_{t}_{−2}=

*a*″), and

*P*(

*a*′,

*a*″) is the probability distribution

*P*(

*a*′,

*a*″)≡

*P*(

*a*

_{t}_{−1}=

*a*′,

*a*

_{t}_{−2}=

*a*″) obtained as a solution of equation

*P*(

*a*,

*a*′) = Σ

_{a}_{″}

*p*

_{aa}_{′a″}

*P*(

*a*′,

*a*″). The best average rewards obtainable by the restricted choice behaviors with state-definition

*s*≡

_{t}*a*

_{t}_{−1}and no state variable can be calculated by restricting

*p*

_{aa}_{′a″}as

*p*

_{aa}_{′1}=

*p*

_{aa}_{′2}=

*p*

_{aa}_{′}and

*p*

_{a}_{1}=

*p*

_{a}_{2}=

*p*, respectively.

_{a}### Learning rules for independent choice behaviors

Synapse-updating rules can be described by change Δ*w _{j}* in

*w*at time

_{j}*t*,

*w*(

_{j}*t*+1) =

*w*(

_{j}*t*)+Δ

*w*(

_{j}*t*). Melioration[16] proposes to increase the choice probability of the option that has the largest expectation value of return. An implementation of melioration is described as

*p*

_{1}(

**) =**

*w**w*

_{0},

*p*

_{2}(

**) = 1−**

*w**w*

_{0}, Δ

*w*

_{0}=

*α*(

*w*

_{1}−

*w*

_{2}) and , where

*α*is a positive constant, and if

*a*=

_{t}*a*, and otherwise. The average returns 〈

*r*|1〉 and 〈

*r*|2〉 are estimated as

*w*

_{1}and

*w*

_{2}, and the choice probabilities are determined by

*w*

_{0}updated by the estimated average returns. Local matching[9] is designed to directly achieve the matching law as and . For actor-critic[1], direct actor[23] and Q-learning[1], we used a soft-max function as each choice probability: , where

*β*is a positive constant. Individual updating rules are described as and Δ

*u*=

*α*(

*r*−

_{t}*u*) (actor-critic), (direct actor) and (Q-learning). The details of the algorithms and the relations to the matching strategy and the covariance rule[19] are discussed in Supporting Text S1.

### Actor-critic model with state variables

An iterative method to achieve Eq. 7 was shown in [29], [30]. Assuming a set of synapses corresponding to individual options in individual states {*w _{as}*} and defining the choice probabilities in each state as , we can obtain the stochastic gradient ascent rule for Eq. 7 as 〈Δ

*w*〉 =

_{as}*λβP*(

**)**

*s**p*(

_{as}*Q*−

_{as}*V*), where

_{s}*λ*is a positive constant, and and

*V*

**≡Σ**

_{s}

_{a}Q_{a}_{s}

*p*

_{a}_{s}represent the relative values of choosing

*a*in state

**(relative action-value) and of state**

*s***(relative state-value), respectively. Using the relations and , we can obtain the actor-critic model as an implementation of the matching strategy:(10)where if**

*s*

*s**=*

_{t}**, and otherwise. The variable**

*s**u*estimates the average reward and the variable

*ν*represents the state-value of

_{s}**estimated with the temporal difference (TD) error algorithm. While the actor-critic system is usually designed for maximizing a discounted sum of future rewards[1], the updating rule in Eq. 10 was derived to maximize the average reward[24], [29], [30].**

*s*### Numerical simulations

In the simulations shown in Figures 2 and 3B, model parameters were set as *α* = *λβ* = 0.05, and the initial values of all dynamical variables were set to 1. The value of *β* was set as *β* = 4 by default, while it was varied for the Q-learning simulations (Figure 2B). To show the time evolution of reward in Figure 3B, we updated the local average *y* according to Δ*y* = (*r _{t}*−

*y*)/200 from an initial value of 0.64, which is the average reward obtained with even choice probabilities:

*p*

_{1}=

*p*

_{2}= 0.5.

## Supporting Information

### Text S1.

Strategies of different learning rules. Several well-known learning algorithms are categorized into the matching, maximizing and other strategies.

https://doi.org/10.1371/journal.pone.0003795.s001

(0.19 MB DOC)

### Text S2.

Matching strategy in state-dependent choice behaviors. The extensions of the stationary condition and the matching law are derived.

https://doi.org/10.1371/journal.pone.0003795.s002

(0.19 MB DOC)

## Author Contributions

Conceived and designed the experiments: YS. Performed the experiments: YS. Wrote the paper: TF.

## References

- 1.
Sutton RS, Barto AG (1998) Reinforcement Learning. Cambridge, MA (USA): MIT press. RS SuttonAG Barto1998Reinforcement LearningCambridge, MA (USA)MIT press322
- 2.
Houk JC, Davis JL, Beiser DG (1994) Models of Information Processing in the Basal Ganglia (Computational Neuroscience). Bradford Books. JC HoukJL DavisDG Beiser1994Models of Information Processing in the Basal Ganglia (Computational Neuroscience)Bradford Books382
- 3. Schultz W (1998) Predictive reward signal of dopamine neurons. J Neurophsiol 80: 1–27.W. Schultz1998Predictive reward signal of dopamine neurons.J Neurophsiol80127
- 4. Tanaka SC, Doya K, Okada G, Ueda K, Okamoto Y, et al. (2004) Prediction of immediate and future rewards differentially recruits corticobasal ganglia loops. Nature Neurosci 7: 887–893.SC TanakaK. DoyaG. OkadaK. UedaY. Okamoto2004Prediction of immediate and future rewards differentially recruits corticobasal ganglia loops.Nature Neurosci7887893
- 5.
Mazur JE (2005) Learning and Behavior. Prentice Hall. JE Mazur2005Learning and BehaviorPrentice Hall464
- 6. Herrnstein RJ (1961) Relative and absolute strength of response as a function of frequency of reinforcement. J. Exp. Anal. Behav 4: 267–272.RJ Herrnstein1961Relative and absolute strength of response as a function of frequency of reinforcement.J. Exp. Anal. Behav4267272
- 7.
Davison M, McCarthy D (1987) The Matching Law: A Research Review. Lawrence Erlbaum Assoc Inc. M. DavisonD. McCarthy1987The Matching Law: A Research ReviewLawrence Erlbaum Assoc Inc296
- 8.
Herrnstein RJ (1997) The Matching Law: Papers in Psychology and Economics. Cambridge, MA (USA): Harvard Univ Press. RJ Herrnstein1997The Matching Law: Papers in Psychology and EconomicsCambridge, MA (USA)Harvard Univ Press334
- 9. Sugrue LP, Corrado GS, Newsome WT (2004) Matching behavior and the representation of value in the parietal cortex. Science 304: 1782–1787.LP SugrueGS CorradoWT Newsome2004Matching behavior and the representation of value in the parietal cortex.Science30417821787
- 10. Heyman GM (1979) A Markov model description of changeover probabilities on concurrent variable-interval schedules. J Exp Anal Behav 31: 41–51.GM Heyman1979A Markov model description of changeover probabilities on concurrent variable-interval schedules.J Exp Anal Behav314151
- 11. Baum WM (1981) Optimization and the matching law as accounts of instrumental behavior. J Exp Anal Behav 36: 387–402.WM Baum1981Optimization and the matching law as accounts of instrumental behavior.J Exp Anal Behav36387402
- 12. Herrnstein RJ, Heyman GM (1979) Is matching compatible with reinforcement maximization on concurrent variable interval, variable ratio? J Exp Anal Behav 31: 209–223.RJ HerrnsteinGM Heyman1979Is matching compatible with reinforcement maximization on concurrent variable interval, variable ratio?J Exp Anal Behav31209223
- 13. Mazur J (1981) Optimization theory fails to predict performance of pigeons in a two-response situation. Science 214: 823–825.J. Mazur1981Optimization theory fails to predict performance of pigeons in a two-response situation.Science214823825
- 14. Vaughan WJ (1981) Melioration, matching, and maximization. J Exp Anal Behav 36: 141–149.WJ Vaughan1981Melioration, matching, and maximization.J Exp Anal Behav36141149
- 15. DeCarlo LT (1985) Matching and maximizing with variable-time schedules. J Exp Anal Behav 43: 75–81.LT DeCarlo1985Matching and maximizing with variable-time schedules.J Exp Anal Behav437581
- 16.
Herrnstein RJ, Vaughan WJ (1980) Melioration and behavioral allocation. In: Staddon J, editor. Limits to action: The allocation of individual behavior. New York (USA): Academic Press. RJ HerrnsteinWJ Vaughan1980Melioration and behavioral allocation.J. StaddonLimits to action: The allocation of individual behaviorNew York (USA)Academic Press
- 17. Corrado GS, Sugrue LP, Newsome WT (2005) Linear-nonlinear-Poisson models of primate choice dynamics. J Exp Anal Behav 84: 581–617.GS CorradoLP SugrueWT Newsome2005Linear-nonlinear-Poisson models of primate choice dynamics.J Exp Anal Behav84581617
- 18. Soltani A, Wang X (2006) A biophysically based neural model of matching law behavior: melioration by stochastic synapses. J Neurosci 26: 3731–3744.A. SoltaniX. Wang2006A biophysically based neural model of matching law behavior: melioration by stochastic synapses.J Neurosci2637313744
- 19. Loewenstein Y, Seung H (2006) Operant matching is a generic outcome of synaptic plasticity based on the covariance between reward and neural activity. Proc Natl Acad Sci 103: 15224–15229.Y. LoewensteinH. Seung2006Operant matching is a generic outcome of synaptic plasticity based on the covariance between reward and neural activity.Proc Natl Acad Sci1031522415229
- 20. Sakai Y, Okamoto H, Fukai T (2006) Computational algorithms and neuronal network models underlying decision processes. Neural Netw 19: 1091–1105.Y. SakaiH. OkamotoT. Fukai2006Computational algorithms and neuronal network models underlying decision processes.Neural Netw1910911105
- 21. Sakai Y, Fukai T (2008) The actor-critic learning is behind the matching law: Matching vs. optimal behaviors. Neural Comput 20: 227–251.Y. SakaiT. Fukai2008The actor-critic learning is behind the matching law: Matching vs. optimal behaviors.Neural Comput20227251
- 22. Shanks DR, Tunney RJ, McCarthy JD (2002) A re-examination of probability matching and rational choice. J Behav Dec Making 15: 233–250.DR ShanksRJ TunneyJD McCarthy2002A re-examination of probability matching and rational choice.J Behav Dec Making15233250
- 23.
Dayan P, Abbott L (2001) Theoretical Neuroscience. Cambridge, MA (USA): MIT press. P. DayanL. Abbott2001Theoretical NeuroscienceCambridge, MA (USA)MIT press360
- 24.
Bertsekas DP, Tsitsiklis JN (1996) Neuro-Dynamic Programming. Belmont, MA (USA): Athena Scientific. DP BertsekasJN Tsitsiklis1996Neuro-Dynamic ProgrammingBelmont, MA (USA)Athena Scientific491
- 25. Baum WM (1979) Matching, undermatching, and overmatching in studies of choice. J Exp Anal Behav 32: 269–81.WM Baum1979Matching, undermatching, and overmatching in studies of choice.J Exp Anal Behav3226981
- 26. Davison M, Kerr A (1989) Sensitivity of time allocation to an overall reinforcer rate feedback function in concurrent interval schedules. J Exp Anal Behav 51: 215–231.M. DavisonA. Kerr1989Sensitivity of time allocation to an overall reinforcer rate feedback function in concurrent interval schedules.J Exp Anal Behav51215231
- 27. Alsop B, Davison M (1991) Effects of varying stimulus disparity and the reinforcer ratio in concurrent-schedule and signal-detection procedures. J Exp Anal Behav 56: 67–80.B. AlsopM. Davison1991Effects of varying stimulus disparity and the reinforcer ratio in concurrent-schedule and signal-detection procedures.J Exp Anal Behav566780
- 28. Davison M, Nevin J (1999) Stimuli, reinforcers, and behavior: an integration. J Exp Anal Behav 71: 439–482.M. DavisonJ. Nevin1999Stimuli, reinforcers, and behavior: an integration.J Exp Anal Behav71439482
- 29. Marbach P, Tsitsiklis JN (2001) Simulation-based optimization of Markov reward processes. IEEE Trans Automat Contr 46: 191–209.P. MarbachJN Tsitsiklis2001Simulation-based optimization of Markov reward processes.IEEE Trans Automat Contr46191209
- 30. Konda VR, Tsitsiklis JN (2003) On actor-critic algorithms. SIAM J Contr Optim 42: 1143–1166.VR KondaJN Tsitsiklis2003On actor-critic algorithms.SIAM J Contr Optim4211431166