When Does Reward Maximization Lead to Matching Law?

What kind of strategies subjects follow in various behavioral circumstances has been a central issue in decision making. In particular, which behavioral strategy, maximizing or matching, is more fundamental to animal's decision behavior has been a matter of debate. Here, we prove that any algorithm to achieve the stationary condition for maximizing the average reward should lead to matching when it ignores the dependence of the expected outcome on subject's past choices. We may term this strategy of partial reward maximization “matching strategy”. Then, this strategy is applied to the case where the subject's decision system updates the information for making a decision. Such information includes subject's past actions or sensory stimuli, and the internal storage of this information is often called “state variables”. We demonstrate that the matching strategy provides an easy way to maximize reward when combined with the exploration of the state variables that correctly represent the crucial information for reward maximization. Our results reveal for the first time how a strategy to achieve matching behavior is beneficial to reward maximization, achieving a novel insight into the relationship between maximizing and matching.


Introduction
How do animals, including humans, determine appropriate behavioral responses when their behavioral outcomes are uncertain? Decision-making is a fundamental process of the brain for organizing behaviors, and depends crucially on how subjects have been rewarded in their past behavioral responses. Mechanism of reward-driven learning has extensively been studied theoretically and experimentally. A well-known example includes the reinforcement learning theory based on the temporal difference (TD) error algorithm [1], which is powerful enough to solve difficult problems in machine control and accounts for the basal-ganglia activity representing reward expectancy in monkeys and humans [2][3][4]. It is generally considered that subjects attempt to choose a behavioral policy that will maximize the amount of reward under a given environmental condition [5]. In addition, many algorithms in machine learning and other brain-style computations aim at reward maximization or, somewhat more generally, optimization of a given cost function.
Nevertheless, animals often exhibit matching behavior in a variety of decision-making tasks [6][7][8][9], even if such behavior does not necessarily maximize reward. The matching law states that the frequency of choosing an option is proportional to the amount of past reward obtained from that option [6]: N a /(N 1 +N 2 +…+N n ) = I a / (I 1 +N 2 +…+N n ), where N a (a = 1,…,n) represents the times option a has been chosen and I a the total amount of income obtained at the option. A typical example showing this law is the alternative choice task, in which subjects have to choose one from the two options that may be rewarded at different average rates. Matching and maximizing are mathematically equivalent in simple tasks [10,11], but not in arbitrary tasks [12][13][14][15].
Decision-making models to reproduce the matching behavior have been proposed [9,16,17], and recent computational studies pointed out possible origins of matching behavior in biological neural systems [18,19]. For instance, a recent model proposed that the matching law results from the covariance learning rule in synaptic plasticity [19]. In addition, we previously demonstrated that the matching law emerges in a class of the reinforcement learning systems including the actor-critic [20,21], which has widely been used in engineering applications. However, whether matching and maximizing share a common computational principle and whether matching behavior is beneficial to decision making remain unclear. In this study, we propose a view that unifies matching behavior into the general computational framework of reward maximization.

Results
We first prove that partial maximization of reward leads to matching behavior irrespective of the mathematical algorithm used for this computation. A crucial step is to define ''the matching strategy'' that plays a central role in the present study. We then demonstrate how the matching strategy substitutes for the maximizing strategy in a decision-making task that is difficult to solve, when matching is combined with an appropriate utilization of available information sources.

Matching as a Sub-optimal Maximizing Strategy in Independent Choice Behaviors
The analysis is easier if we express the matching law as follows [8]: where AErae is the average reward per choice from all options and AEr|aae the average reward conditioned on choice of option a. We can derive the above expression from the relationship I a >AEr|aaeN a . Thus, the matching law equalizes the expected returns on all the options that are chosen sufficiently many times. Note that the matching law should not be confused with ''probability matching'' [22], which states that the frequency of choosing option a is proportional to AEr|aae rather than I a . Probability matching is typically observed in a task in which each expected return AEr|aae is fixed and independent of subject's behavior (i.e., concurrent variable-ratio schedules). In such a simple task, the maximizing behavior satisfies the matching law, but not the probability matching. Hereafter, we focus on the matching law. Moreover, we consider the case where subjects make choices at fixed intervals. We can employ the discrete time steps without much loss of generality, since the framework describes a free-response task on continuous time if the interval is sufficiently short and choosing nothing is an available option.
We analyze the outcome of the decision process without specifying the detail of neural decision system. To this end, we assume a set of 'synapses' w = (w 1 , w 2 , …, w m ) that determines the behavioral policy to make decision. These variables are often called ''policy parameters'' in mathematical models of decision making. Then, the probability of choosing option a is given as a function p a (w) of the synaptic weights. To ensure a smooth search for an optimal set of choice probabilities, we require that arbitrary infinitesimal changes of {p a (w)} allowed in the space of choice probabilities can be caused by some set of infinitesimal changes {dw j }.
With the above definitions, we can describe the average reward per choice as SrT~P n a~1 Sr a j Tp a w ð Þ. Many decision-making algorithms attempt to maximize AErae by modifying behavioral outputs. Whatever algorithm is used, the synaptic weights to maximize AErae should satisfy the stationary condition hAErae/hw j = 0 for arbitrary j, i.e., , for Vj: ð2Þ The first term contains the explicit dependence of the choice probability on w j , whereas the second term the possible change in AEr|aae generated implicitly by the change in subject's behavioral policy. The conditional expectation value AEr|aae is obtained by taking an average over all possible patterns of past choices in which the newest choice is option a. In general, the reward probability depends not only on the current choice, but also on the history of the past choices [6,[12][13][14][15]. In such a case, AEr|aae depends on the choice probabilities that produced the past choices, and hence depends on w j .
In order to maximize reward, the brain has to explore the correct dependence of the reward probability on the past choices. It seems, however, difficult to infer this dependency correctly with little knowledge on an accurate model of the environment. In such a difficult situation, the brain may simply omit the second term in Eq. 2 in its practical attempt to maximize reward, Multiplying Eq. 3 by arbitrary variations {dw j } and taking a summation over j gives P n a~1 Sr a j Tdp a w ð Þ~R : dp w ð Þ~0, where dp a (w);S j (hp a /hw j )dw j represents the infinitesimal change caused by {dw j }, and R;(AEr|1ae, AEr|2ae, …, AEr|nae) and dp(w);(dp 1 (w), dp 2 (w), …, dp n (w)) are vectors in the space of multiple options. If all options have non-vanishing stationary choice probabilities, the probability changes dp(w) may occur in an arbitrary direction that satisfies the probability conservation 1 : dp w ð Þ~d P n a~1 p a w ð Þ À Á 0, where 1;(1, 1, …, 1) is an n-dimensional identity vector. Therefore, the conditions R?dp(w) = 0 and 1?dp(w) = 0 can simultaneously be satisfied only by such R that is parallel to 1. If the stationary choice probability vanishes for some option, p a = 0, we can forbid the changes in this direction (dp a = 0), and R should have identical components for all the options exhibiting non-zero choice probabilities. These results and Eq. 1 imply that the truncated stationary condition given by Eq. 3 is equivalent to the matching law.
Thus, the steady choice behavior exhibits matching when the decision system ignores the influence of subject's past choices on the expected outcome in aiming for the stationary condition of reward maximization. Hereafter, we call this suboptimal maximization strategy to achieve Eq. 3 ''matching strategy''. By contrast, we call the strategy to directly solve Eq. 2 ''the maximizing strategy''.
To demonstrate the above relationship between the matching and maximizing strategies, we study an alternative choice task (n = 2), in which the expectation value of return on each choice pattern is specified completely by the subject's current (a t ) and most recent choices (a t21 ) as g atat{1 :Sr t a t ,a t{1 j T (see Methods). We consider the case where subject's current choice is independent of its past choices. Hereafter, such decision behavior is called ''independent choice behavior''. Since p 2 (w) = 12p 1 (w), the subject's decision system controls only the choice probability p 1 (w) through w, and makes every choice with probability p 1 (w). Then the average return on the current choice AEr t |a t ae is obtained by averaging g atat{1 over the possible patterns a t21 = 1,2 as Sr t a t j T~g at1 p 1 w ð Þzg at2 1{p 1 w ð Þ ð Þ , and hence depends on w through the choice probability p 1 (w). Since hAEr t |a t ae/hw j ?0, the matching strategy does not maximize reward in this task. Actually, it gives AErae = 0.25 whereas the maximizing strategy yields AErae = 0.45 ( Figure 1).
The matching strategy enables us to derive a variety of learning rules that lead to matching behavior (Supporting Text S1). For instance, such a category of learning rules includes the well-known actor-critic in the reinforcement learning theory [1,20,21], direct actor [23], melioration [16] and local matching [9]. In particular, the actor-critic and direct actor also belong to the covariance rule [19]. We numerically solved the decision task analyzed in Figure 1 to show that all these learning algorithms generate matching behavior (Figure 2A). By contrast, indirect actor [23] does not exhibit matching in the steady behavior ( Figure 2B). The indirect actor belongs to Q-learning without state variables [1] (see below for the state variables). Since Q-learning determines the choice probabilities by estimating ''action values'', i.e., the expected returns on individual options, it does not show matching.

Matching vs. Maximizing over All Possible Choice Behaviors
The quantitative analysis conducted in Figure 1 was restricted to the case where the subject generates independent choice behaviors. It was shown that the maximizing strategy earns better than the matching strategy. However, the average reward AErae = 0.45 achieved by the maximizing strategy in Figure 1 is not the global maximum, but is only the best one among independent choice behaviors. For instance, an alternate choice pattern of 1212…, where the current choice depends on the most recent choice, can earn better (AErae = (g 12 +g 21 )/2 = 0.6) than the best independent choice behavior in that task. Thus, to produce a better outcome in some situation, the subject is required to make each choice depending on the past choices or other available information. Below, we investigate the relationship between the matching and maximizing strategies, taking all possible choice behaviors into account.
To make the argument as general as possible, we include the case where the subject may receive sensory signals s t before making a choice a t at time t. Then, in a given task, the external and internal information available for the subject at time t consists of the histories of sensory signals, subject's past choices and the past returns: H t = (s t , r t21 , a t21 , s t21 , r t22 , a t22 , s t22 ,…). A decisionmaking task specifies the conditional probability distribution P(s t+1 , r t |a t , H t ). In contrast, the general rule to determine subject's choice behavior is described by the conditional probability distribution P(a t |H t ). The problem is how to explore an optimal behavioral policy P (a t |H t ) to maximize the average reward AErae in a given task.
In practice, however, it is difficult to optimize the dependence of P(a t |H t ) on the whole history H t . Hence, subject's decision system  may extract partial information s t from H t , and restrict the behavioral policy as We may call the above s t ''state variables''. We assume that the decision system controls the definition of state s t , H t¨st , and P(a t |s t ). In order to maximize the average reward, the decision system has to adopt an appropriate definition of state with which an optimal behavioral policy P (a t |H t ) satisfies Eq 4. It has been proved [24] that if a map H t¨st satisfies for a given task, then the maximal average reward can be obtained by a behavioral policy that satisfies Eq. 4. The average reward obtained by an arbitrary choice sequence can be expressed by P(s t+1 , r t |a t , s t ) that satisfies Eq. 5 and does not depend on the variables that are not reflected in s. Therefore, state s that satisfies Eq. 5 represents crucial information about reward delivery in that task. The above theorem means that the optimal policy P (a t |H t ) depends on only the crucial information. Hereafter, we may say that a definition of state variables, H t¨st , is correct if and only if s t satisfies Eq. 5. Note that the selection of the correct definition may not be unique. Suppose that the decision system adopts a certain definition of state variables, H t¨st . Let p as = P(a t = a|s t = s) be the choice probability with which the decision system in state s chooses option a. Each state-dependent choice probability is determined as a function of the synaptic weights p as (w). In order to explore all possible patterns of state-dependent choice probabilities smoothly, we assume that an arbitrary pattern of {p as } and an arbitrary direction of infinitesimal changes {dp as } allowed in the space of probabilities can be expressed by some pattern of w and some direction of infinitesimal changes dw, respectively (see Methods).
Taking the state dependence into account, the average reward is written as AErae = S s S a AEr|a,sae p as (w)P(s), where AEr|a, sae is the average reward conditioned on choice of option a in state s, and P(s) is the distribution of the states that the subject has visited over sufficiently many decision trials with fixed {p as (w)}. The stationary condition for reward maximization hAErae/hw j = 0 is written as The maximizing strategy attempts to achieve Eq. 6 taking the whole dependence on w into account. In contrast, as in the previous case, the matching strategy ignores the dependence of the expected outcome of the current choice on w in aiming for the stationary condition. The outcome in the present case consists of the return r t and the next state s t+1 . Therefore, the matching strategy ignores the dependence of P(s t+1 , r t |a t , s t ) on w, and hence ignores hAEr|a, sae/hw j and hP(s9|a, s)/hw j , where P(s9|a, s);P(s t+1 = s9|a t = a, s t = s). By transforming the second term repetitively with the recursive relation P(s9) = S s,a P(s9|a,s) p as (w)P(s) and by setting hAEr|a, sae/hw j = hP(s9|a, s)/hw j = 0, we obtain the stationary condition of the matching strategy (Support- Sr tzt j a t~a ,s t~s T{SrT ð Þ 0, for Vj: Note that the terms omitted in the matching strategy differ for different definitions of the state. Then, using Eq. 7 and the probability conservation, we can extend the matching law to the case of state-dependent choice behaviors (Supporting Text S2): Sr tzt ja t~a ,s t~s T{Sr tzt s t~s j T ð Þ0, for Va,s: The extended matching law given as Eq. 8 depends also on the definition of the state.
We schematically illustrate the relationships between the maximizing and matching strategies with correct and incorrect definitions of the state variables ( Figure 3A). The horizontal plane represents the multi-dimensional space of arbitrary choice behaviors. Defining state variables restricts the state-dependent choice behavior to a certain subspace. If state variables are correctly defined to satisfy Eq.5, the subspace (red curve) includes the optimal choice behavior (red circle). The conditional probability P(s t+1 , r t |a t , s t ) takes a fixed value specified by the task, which is actually independent of w. Therefore, the matching strategy coincides with the maximizing strategy, which indeed earns the globally maximal average reward (red triangle) unless the choice behavior is trapped by a local stationary point. In contrast, if an incorrect definition of state variables is chosen, the set of generable choice behaviors (blue curve) does not necessarily include the optimal choice behavior. Therefore, the maximizing strategy can lead to only the best choice behavior (blue triangle) within the restricted set. The conditional probability P(s t+1 , r t |a t , s t ) depends on the past choices that are not reflected in state s t , and hence depends on w. Therefore, the matching strategy (blue cross) in general deviates from the maximizing one (blue triangle).
To explain the above results, we conduct numerical simulations of a simple alternative task in which the reward probability is given as a function of the current and most recent two choices (a t , a t21 , a t22 ) (see Methods). A correct definition of state variables for making choice a t is s t = (a t21 , a t22 ). An actor-critic system (see Methods) operating on the correct state variables earns the globally maximal average reward ( Figure 3B, red dashed line). In contrast, for an incorrectly defined state, such as s t = a t21 or no state variable, the best average rewards (magenta and blue dashed lines, respectively) are smaller than the globally maximal one, and the average rewards earned by the actor-critic systems operating on the incorrect state variables (magenta and blue curves) are still smaller.
Thus, the matching strategy is as efficient as the maximizing one if they are combined with a mechanism to explore and select a correct definition of state variables. However, the matching strategy in general deviates from the maximizing one for the choice behaviors restricted by an incorrect definition of state variables.

Discussion
How subjects decide behavioral responses based on their experience and reward expectancy is a current topic in neuroscience. In particular, which choice behavior, matching or maximizing, is more fundamental in decision making has long been debated. The relationship between matching and maximizing behaviors has been often discussed in the restricted case where every choice is independent of the past choices. For instance, Loewenstein and Seung [18] recently proved for independent choice behaviors that the maximizing behavior is achieved by synaptic learning rules that cancel out the infinite sum of the covariances between the current return and all of the current and past decision-related neural activities, and that the matching behavior appears when only the first term in the sum, i.e., the covariance between the current return and current decision-related neural activity, vanishes. This relationship corresponds to the relationship between Eqs. 2 and 3 when the choice probabilities are described as p a w ð Þ~e bwa P a 0 e bw a 0 (Supporting Text S1). This study has further extended their results to derive a more general statement: any attempt to achieve the stationary condition for reward maximization results in matching behavior if it ignores the influence of the past choices on the expected outcome. This result depends on neither a specific leaning algorithm nor a specific reward schedule.
Most importantly, we have clarified the general relationship between matching and maximizing strategies among all the possible choice behaviors. We have proved that the matching strategy can lead to the optimal choice behavior when the subject's decision system correctly discovers the information sources sufficient to specify the expected outcome, and can utilize the information through state variables. Differences between the matching and maximizing strategies can arise when the decision system assigns incorrect information sources to the state variables. Our results for the first time revealed how a strategy to achieve the matching behavior is beneficial to reward maximization, and how the ignorance of the relevant information leads to the matching behavior.
The information sources relevant to the expected outcome are task-dependent. In realistic situations, the subject would have no a priori knowledge about the probabilistic rule of the outcomes of their behavioral responses. It seems unlikely that the brain easily identifies the relevant information sources from infinitely many combinations of the histories of past sensory inputs, returns and choices. This might explain why the matching law appears so robustly in various animal species and in various decision-making tasks as a result of ignorance of the relevant information sources. In contrast, the matching strategy with the incorrect selection of information sources may replicate various deviations from the matching behavior, such as the under/over-matching observed in various situations [25][26][27][28]. Our results provide a theoretical framework to investigate the deviations from matching on the basis of selected information sources. How the brain explores the relevant information sources remains open for further studies. Since this ability of the brain is what discriminates it from any existing artificial machine with human-like adaptive behavior, clarifying the underlying mechanism is an exciting challenge in neuroscience and its application to robotics.

Summary of assumptions
Our proof of matching law (Eq. 3) is valid for a wide class of natural learning rules, including those employing a widely-used soft-max function for choice probabilities (see below). In the following, however, we explicitly describe the assumptions necessary to make our proof mathematically rigorous. For decision-making tasks, we assumed 1) discrete time step t at which the subject is required to make decision, 2) a finite number of fixed options (a = 1, 2, …, n) available for the subject at every time step, and 3) a scalar amount of reward given to the subject at every time step. For the decision system, we required the following assumptions: 4) the decision system can control the definition of state s t and the state-dependent choice probabilities {p as } through a set of synapses w = (w 1 , w 2 , …, w m ), 5) it adopts a definition of state s t with which the number of possible states is finite (l), and 6) on a certain definition of state, an arbitrary pattern of possible {p as } and an arbitrary direction of possible infinitesimal changes The performance of the matching and maximizing strategies based on correctly (red) or incorrectly (blue) defined state variables is shown schematically. (B) Actor-critic systems (Methods) were trained on a decision task in which the subject's current and most recent two choices, a t , a t21 and a t22 , specify the reward probability according to the following task parameters: g 111 = 0, g 211 = 0.6, g 121 = 0.9, g 221 = 1, g 112 = 1, g 212 = 0.6, g 122 = 1, and g 222 = 0 (Methods). Curves and dashed lines display the local temporal averages of the rewards earned by the actor-critic systems and the best average rewards obtainable by the maximizing strategy, respectively, in three cases: no state variable (blue); an imperfect state variable s t = a t21 (magenta); correct state variables s t = (a t21 , a t22 ) (red). doi:10.1371/journal.pone.0003795.g003 {dp as } can be expressed by some w and dw, respectively. The assumption 6 requires the following condition: V y as y as w0 Va,s, and X n a~1 y as~1 Vs n o , Aw s:t: y as~pas w ð Þ Va,s, J(w) exists, and rank J w where q(w) represents the ln-dimensional vector function consisting of the state-dependent choice probabilities {p as (w)}, and J(w) is the Jacobian matrix of q(w): J ij (w) = hq i (w)/hw j . Equation 9 requires m$l(n21). Independent choice behaviors are generated in the case l = 1.

Decision-making task for demonstrations
To examine the performance of the matching and maximizing strategies, we introduced a decision-making task in which reward is given (r t = 1) or not given (r t = 0) to the subject according to the probability determined by the subject's current (a t ) and most recent one or two choices (a t21 and a t22 ). Each choice should be taken from one of two options (a = 1, 2), although it is straightforward to extend the present results to more general tasks with more than two options. The conditional expectation value of return on each choice pattern is given as a task parameter: g atat{1 :Sr t a t ,a t{1 j T or g atat{1at{2 :Sr t a t ,a t{1 ,a t{2 j T. The values of these parameters are given in figure legends. For given task parameters {g atat{1at{2 }, we can calculate the maximum of the average reward AErae = S a,a9,a0 g aa9a0 p aa9a0 P(a9,a0), where p aa9a0 is the conditional choice probability p aa9a0 ;P(a t = a|a t21 = a9, a t22 = a0), and P(a9, a0) is the probability distribution P(a9, a0);P(a t21 = a9, a t22 = a0) obtained as a solution of equation P(a,a9) = S a0 p aa9a0 P(a9,a0). The best average rewards obtainable by the restricted choice behaviors with statedefinition s t ;a t21 and no state variable can be calculated by restricting p aa9a0 as p aa91 = p aa92 = p aa9 and p a1 = p a2 = p a , respectively.

Learning rules for independent choice behaviors
Synapse-updating rules can be described by change Dw j in w j at time t, w j (t+1) = w j (t)+Dw j (t). Melioration [16] proposes to increase the choice probability of the option that has the largest expectation value of return. An implementation of melioration is described as p 1 (w) = w 0 , p 2 (w) = 12w 0 , Dw 0 = a(w 1 2w 2 ) and Dw aã d aat r t {w a ð Þ , where a is a positive constant, and d aat~1 if a t = a, and d aat~0 otherwise. The average returns AEr|1ae and AEr|2ae are estimated as w 1 and w 2 , and the choice probabilities are determined by w 0 updated by the estimated average returns. Local matching [9] is designed to directly achieve the matching law as p a w ð Þ~w a P n a 0~1 w a 0 and Dw a~a d aat r t {w a ð Þ . For actor-critic [1], direct actor [23] and Q-learning [1], we used a soft-max function as each choice probability: p a w ð Þ~e bwa P n a 0~1 e bw a 0 , where b is a positive constant. Individual updating rules are described as Dw a~a b aat r t {u ð Þ and Du = a(r t 2u) (actor-critic), Dw a~a r t d aat {p a ð Þ (direct actor) and Dw a~a d aat r t {w a ð Þ (Qlearning). The details of the algorithms and the relations to the matching strategy and the covariance rule [19] are discussed in Supporting Text S1.

Actor-critic model with state variables
An iterative method to achieve Eq. 7 was shown in [29,30]. Assuming a set of synapses corresponding to individual options in individual states {w as } and defining the choice probabilities in each state as p as w ð Þ~e bwas P n a 0~1 e bw a 0 s , we can obtain the stochastic gradient ascent rule for Eq. 7 as AEDw as ae = lbP(s)p as (Q as 2V s ), where l is a positive constant, and Q as : lim Sr tzt a t~a , j ð s t~s :T{SrTÞ and V s ;S a Q as p as represent the relative values of choosing a in state s (relative action-value) and of state s (relative state-value), respectively. Using the relations P s ð Þp as Q asS d sst d aat r t {SrTzV stz1 À Á T and P s ð Þp as V s~S d sst d aat V st T, we can obtain the actor-critic model as an implementation of the matching strategy: Dv s~a d sst r t {uzv stz1 {v s À Á , Dw as~l bd sst d aat r t {uzv stz1 {v s À Á , where d sst~1 if s t = s, and d sst~0 otherwise. The variable u estimates the average reward and the variable n s represents the statevalue of s estimated with the temporal difference (TD) error algorithm. While the actor-critic system is usually designed for maximizing a discounted sum of future rewards [1], the updating rule in Eq. 10 was derived to maximize the average reward [24,29,30].

Numerical simulations
In the simulations shown in Figures 2 and 3B, model parameters were set as a = lb = 0.05, and the initial values of all dynamical variables were set to 1. The value of b was set as b = 4 by default, while it was varied for the Q-learning simulations ( Figure 2B). To show the time evolution of reward in Figure 3B, we updated the local average y according to Dy = (r t 2y)/200 from an initial value of 0.64, which is the average reward obtained with even choice probabilities: p 1 = p 2 = 0.5.

Supporting Information
Text S1 Strategies of different learning rules. Several well-known learning algorithms are categorized into the matching, maximizing and other strategies.