^{1}

^{2}

^{*}

Conceived and designed the experiments: YS. Performed the experiments: YS. Wrote the paper: TF.

The authors have declared that no competing interests exist.

What kind of strategies subjects follow in various behavioral circumstances has been a central issue in decision making. In particular, which behavioral strategy, maximizing or matching, is more fundamental to animal's decision behavior has been a matter of debate. Here, we prove that any algorithm to achieve the stationary condition for maximizing the average reward should lead to matching when it ignores the dependence of the expected outcome on subject's past choices. We may term this strategy of partial reward maximization “matching strategy”. Then, this strategy is applied to the case where the subject's decision system updates the information for making a decision. Such information includes subject's past actions or sensory stimuli, and the internal storage of this information is often called “state variables”. We demonstrate that the matching strategy provides an easy way to maximize reward when combined with the exploration of the state variables that correctly represent the crucial information for reward maximization. Our results reveal for the first time how a strategy to achieve matching behavior is beneficial to reward maximization, achieving a novel insight into the relationship between maximizing and matching.

How do animals, including humans, determine appropriate behavioral responses when their behavioral outcomes are uncertain? Decision-making is a fundamental process of the brain for organizing behaviors, and depends crucially on how subjects have been rewarded in their past behavioral responses. Mechanism of reward-driven learning has extensively been studied theoretically and experimentally. A well-known example includes the reinforcement learning theory based on the temporal difference (TD) error algorithm

Nevertheless, animals often exhibit matching behavior in a variety of decision-making tasks_{a}_{1}+_{2}+…+_{n}_{a}_{1}+_{2}+…+_{n}_{a}_{a}

Decision-making models to reproduce the matching behavior have been proposed

We first prove that partial maximization of reward leads to matching behavior irrespective of the mathematical algorithm used for this computation. A crucial step is to define “the matching strategy” that plays a central role in the present study. We then demonstrate how the matching strategy substitutes for the maximizing strategy in a decision-making task that is difficult to solve, when matching is combined with an appropriate utilization of available information sources.

The analysis is easier if we express the matching law as follows_{a}_{a}_{a}

We analyze the outcome of the decision process without specifying the detail of neural decision system. To this end, we assume a set of ‘synapses’ _{1}, _{2}, …, _{m}_{a}_{a}_{j}

With the above definitions, we can describe the average reward per choice as _{j}

The first term contains the explicit dependence of the choice probability on _{j}_{j}

In order to maximize reward, the brain has to explore the correct dependence of the reward probability on the past choices. It seems, however, difficult to infer this dependency correctly with little knowledge on an accurate model of the environment. In such a difficult situation, the brain may simply omit the second term in Eq. 2 in its practical attempt to maximize reward,

Multiplying Eq. 3 by arbitrary variations {_{j}_{a}_{j}_{a}_{j}_{j}_{j}_{1}(_{2}(_{n}_{a}_{a}

Thus, the steady choice behavior exhibits matching when the decision system ignores the influence of subject's past choices on the expected outcome in aiming for the stationary condition of reward maximization. Hereafter, we call this suboptimal maximization strategy to achieve Eq. 3 “matching strategy”. By contrast, we call the strategy to directly solve Eq. 2 “the maximizing strategy”.

To demonstrate the above relationship between the matching and maximizing strategies, we study an alternative choice task (_{t}_{t}_{−1}) as _{2}(_{1}(_{1}(_{1}(_{t}_{t}_{t}_{−1} = 1,2 as _{1}(_{t}_{t}_{j}

The reward probability is given as a function of the current and most recent choices, but the subject makes each choice independently of the past choices. The task parameters are set as _{11} = 0, _{21} = 0.2, _{12} = 1 and _{22} = 0.4. The expectation values are given as 〈_{a}_{1}_{1}+_{a}_{2}(1−_{1}) and 〈_{1}+〈_{1}). The matching (vertical solid line) and maximizing (vertical dashed line) choice probabilities are obtained as solutions of equations 〈_{1} = 0 respectively. The matching strategy (〈

The matching strategy enables us to derive a variety of learning rules that lead to matching behavior (Supporting

The horizontal and vertical axes indicate the cumulative numbers of choices given to option 1 and 2, respectively. Dashed and solid line segments indicate the slopes corresponding to the maximizing and matching choice probabilities, respectively. See

The quantitative analysis conducted in _{12}+_{21})/2 = 0.6) than the best independent choice behavior in that task. Thus, to produce a better outcome in some situation, the subject is required to make each choice depending on the past choices or other available information. Below, we investigate the relationship between the matching and maximizing strategies, taking all possible choice behaviors into account.

To make the argument as general as possible, we include the case where the subject may receive sensory signals _{t}_{t}_{t}_{t}_{t}_{−1}, _{t}_{−1}, _{t}_{−1}, _{t}_{−2}, _{t}_{−2}, _{t}_{−2},…). A decision-making task specifies the conditional probability distribution _{t}_{+1}, _{t}_{t}_{t}_{t}_{t}_{t}_{t}

In practice, however, it is difficult to optimize the dependence of _{t}_{t}_{t}_{t}_{t}

We may call the above _{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}_{+1}, _{t}_{t}_{t}_{t}_{t}_{t}_{t}_{t}

Suppose that the decision system adopts a certain definition of state variables, _{t}_{t}_{as}_{t}_{t}_{as}_{as}_{as}

Taking the state dependence into account, the average reward is written as 〈_{s}_{a}_{a}_{s}(_{as}_{j}

The maximizing strategy attempts to achieve Eq. 6 taking the whole dependence on _{t}_{t}_{+1}. Therefore, the matching strategy ignores the dependence of _{t}_{+1}, _{t}_{t}_{t}_{j}_{j}_{t}_{+1} = _{t}_{t}_{s}_{,a}_{as}_{j}_{j}

Note that the terms omitted in the matching strategy differ for different definitions of the state. Then, using Eq. 7 and the probability conservation, we can extend the matching law to the case of state-dependent choice behaviors (Supporting

The extended matching law given as Eq. 8 depends also on the definition of the state.

We schematically illustrate the relationships between the maximizing and matching strategies with correct and incorrect definitions of the state variables (_{t}_{+1}, _{t}_{t}_{t}_{t}_{+1}, _{t}_{t}_{t}_{t}

(A) The performance of the matching and maximizing strategies based on correctly (red) or incorrectly (blue) defined state variables is shown schematically. (B) Actor-critic systems (_{t}_{t}_{−1} and _{t}_{−2}, specify the reward probability according to the following task parameters: _{111} = 0, _{211} = 0.6, _{121} = 0.9, _{221} = 1, _{112} = 1, _{212} = 0.6, _{122} = 1, and _{222} = 0 (_{t}_{t}_{−1} (magenta); correct state variables _{t}_{t}_{−1}, _{t}_{−2}) (red).

To explain the above results, we conduct numerical simulations of a simple alternative task in which the reward probability is given as a function of the current and most recent two choices (_{t}_{t}_{−1}, _{t}_{−2}) (see _{t}_{t}_{t}_{−1}, _{t}_{−2}). An actor-critic system (see _{t}_{t}_{−1} or no state variable, the best average rewards (magenta and blue dashed lines, respectively) are smaller than the globally maximal one, and the average rewards earned by the actor-critic systems operating on the incorrect state variables (magenta and blue curves) are still smaller.

Thus, the matching strategy is as efficient as the maximizing one if they are combined with a mechanism to explore and select a correct definition of state variables. However, the matching strategy in general deviates from the maximizing one for the choice behaviors restricted by an incorrect definition of state variables.

How subjects decide behavioral responses based on their experience and reward expectancy is a current topic in neuroscience. In particular, which choice behavior, matching or maximizing, is more fundamental in decision making has long been debated. The relationship between matching and maximizing behaviors has been often discussed in the restricted case where every choice is independent of the past choices. For instance, Loewenstein and Seung

Most importantly, we have clarified the general relationship between matching and maximizing strategies among all the possible choice behaviors. We have proved that the matching strategy can lead to the optimal choice behavior when the subject's decision system correctly discovers the information sources sufficient to specify the expected outcome, and can utilize the information through state variables. Differences between the matching and maximizing strategies can arise when the decision system assigns incorrect information sources to the state variables. Our results for the first time revealed how a strategy to achieve the matching behavior is beneficial to reward maximization, and how the ignorance of the relevant information leads to the matching behavior.

The information sources relevant to the expected outcome are task-dependent. In realistic situations, the subject would have no

Our proof of matching law (Eq. 3) is valid for a wide class of natural learning rules, including those employing a widely-used soft-max function for choice probabilities (see below). In the following, however, we explicitly describe the assumptions necessary to make our proof mathematically rigorous. For decision-making tasks, we assumed 1) discrete time step _{t}_{as}_{1}, _{2}, …, _{m}_{t}_{as}_{as}_{as}_{ij}_{i}_{j}

To examine the performance of the matching and maximizing strategies, we introduced a decision-making task in which reward is given (_{t}_{t}_{t}_{t}_{−1} and _{t}_{−2}). Each choice should be taken from one of two options (_{a}_{,a′,a″}_{aa}_{′a″}_{aa}_{′a″}_{aa′a″}_{aa}_{′a″}≡_{t}_{t}_{−1} = _{t}_{−2} = _{t}_{−1} = _{t}_{−2} = _{a}_{″}_{aa}_{′a″}_{t}_{t}_{−1} and no state variable can be calculated by restricting _{aa}_{′a″} as _{aa}_{′1} = _{aa}_{′2} = _{aa}_{′} and _{a}_{1} = _{a}_{2} = _{a}

Synapse-updating rules can be described by change Δ_{j}_{j}_{j}_{j}_{j}_{1}(_{0}, _{2}(_{0}, Δ_{0} = _{1}−_{2}) and _{t}_{1} and _{2}, and the choice probabilities are determined by _{0} updated by the estimated average returns. Local matching_{t}

An iterative method to achieve Eq. 7 was shown in _{as}_{as}_{as}_{as}_{s}_{s}_{a}Q_{a}_{s}_{a}_{s} represent the relative values of choosing _{t}_{s}

In the simulations shown in _{t}_{1} = _{2} = 0.5.

Strategies of different learning rules. Several well-known learning algorithms are categorized into the matching, maximizing and other strategies.

(0.19 MB DOC)

Matching strategy in state-dependent choice behaviors. The extensions of the stationary condition and the matching law are derived.

(0.19 MB DOC)

We thank Helena Wang for careful reading of the manuscript.