Implicit Value Updating Explains Transitive Inference Performance: The Betasort Model

Greg Jensen; Fabian Muñoz; Yelda Alkan; Vincent P. Ferrera; Herbert S. Terrace

doi:10.1371/journal.pcbi.1004523

Abstract

Transitive inference (the ability to infer that B > D given that B > C and C > D) is a widespread characteristic of serial learning, observed in dozens of species. Despite these robust behavioral effects, reinforcement learning models reliant on reward prediction error or associative strength routinely fail to perform these inferences. We propose an algorithm called betasort, inspired by cognitive processes, which performs transitive inference at low computational cost. This is accomplished by (1) representing stimulus positions along a unit span using beta distributions, (2) treating positive and negative feedback asymmetrically, and (3) updating the position of every stimulus during every trial, whether that stimulus was visible or not. Performance was compared for rhesus macaques, humans, and the betasort algorithm, as well as Q-learning, an established reward-prediction error (RPE) model. Of these, only Q-learning failed to respond above chance during critical test trials. Betasort’s success (when compared to RPE models) and its computational efficiency (when compared to full Markov decision process implementations) suggests that the study of reinforcement learning in organisms will be best served by a feature-driven approach to comparing formal models.

Author Summary

Although machine learning systems can solve a wide variety of problems, they remain limited in their ability to make logical inferences. We developed a new computational model, called betasort, which addresses these limitations for a certain class of problems: Those in which the algorithm must infer the order of a set of items by trial and error. Unlike extant machine learning systems (but like children and many non-human animals), betasort is able to perform “transitive inferences” about the ordering of a set of images. The patterns of error made by betasort resemble those made by children and non-human animals, and the resulting learning achieved at low computational cost. Additionally, betasort is difficult to classify as either “model-free” or “model-based” according to the formal specifications of those classifications in the machine learning literature. One of the broader implications of these results is that achieving a more comprehensive understanding of how the brain learns will require analysts to entertain other candidate learning models.

Citation: Jensen G, Muñoz F, Alkan Y, Ferrera VP, Terrace HS (2015) Implicit Value Updating Explains Transitive Inference Performance: The Betasort Model. PLoS Comput Biol 11(9): e1004523. https://doi.org/10.1371/journal.pcbi.1004523

Editor: Jill X. O’Reilly, Oxford University, UNITED KINGDOM

Received: May 6, 2015; Accepted: August 24, 2015; Published: September 25, 2015

Copyright: © 2015 Jensen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: This work was supported by US National Institute of Mental Health <http://www.nimh.nih.gov/>, grant number 5R01MH081153 awarded to VPF and HST. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Tests of transitive inference (TI) are among the oldest tools for assessing abstract thinking. First introduced by Piaget [1] to demonstrate the emergence of logic in child development, TI has since been studied in many species. The cognitive faculties that permit transitive inference are very general: To date, TI has been observed in every vertebrate species tested [2], including primates [3], rodents [4], birds [5], and even fish [6]. The widespread occurrence of this phenomenon suggests that TI procedures tap into deep and enduring learning systems.

There are obvious benefits to being able to compare scalar quantities like distance and amount, all of which are transitive by definition. Evidence also suggests that subjective evaluations of temporal duration [7] and subjective utility [8] are treated as scalar variables. For all of these characteristics, organisms will inevitably be faced with choices. Transitive comparisons avoids costly (and potentially risky) trial-and-error by allowing subjects to compare relative values following a minimal number of exposures. Provided an appropriately scalar encoding, such inferences can be achieved by direct comparison.

Some biologically relevant orderings are even more abstract, and may change rapidly. Social dominance hierarchies are an important example. Systematic analysis suggests that the vast majority of dominance relations in animals are transitive [9]. Status is often not obvious from physical appearance alone, and animals can avoid costly conflicts if they can discover and update hierarchies as third-party observers. Transitive inferences of social rank, based on observation alone, have been reported in pinyon jays [10], tilapia [8], and rhesus monkeys [11]. Furthermore, comparative studies in both corvid species [5] and lemur species [3] report a link between TI performance in a given species and the typical size of social groups in that species. Given that social groups can, in some species, consist of dozens or even hundreds of individuals, inferring social relations from partial information depends on an efficient algorithm.

In order to avoid confound, classical TI tasks are entirely abstract, using ordered lists assembled from otherwise arbitrary stimuli. For example, seven photographic stimuli are given the ordered labels A through G. During training, subjects are only shown randomly selected adjacent pairs (AB, BC, CD, DE, EF, and FG), and are required to select one stimulus in every trial. The only feedback provided is a reward (if the earlier list item was selected) or no reward (if the later item was selected). No other cues indicate that stimuli have an ordering, and no more than two items are ever simultaneously presented. Following training, preference is assessed for non-adjacent pairs (e.g. BD). If subjects select earlier items in novel pairs at above-chance levels, they are said to have performed a “transitive inference” because doing so exploits the transitive relationship that B > C and C > D implies B > D. Fig 1 depicts sample stimuli, trial structure, and stimulus pairings for a 7-item TI task.

Download:

Fig 1. The transitive inference procedure, as implemented for rhesus macaques responding using eye tracking.

(Top) Each session used a novel seven-item list, like the one depicted here. However, subjects were never presented with the entire list. (Middle) Each trial began with a fixation point. Following fixation, two stimuli appeared, and subjects received feedback upon a saccade to either stimulus. If the stimulus appearing earlier in the list was selected, a reward was delivered; if the other stimulus was selected, the animal was subjected to a timeout. Either outcome constituted the completion of a trial. In the event of an incomplete trial (e.g. subjects fixating but failing to saccade to a stimulus) was deemed incomplete and did not count toward the set of trials completed. All dashed lines and arrows represent eye movements and fixation areas, and did not appear on the screen. (Bottom) Subjects were initially trained only on the six adjacent pairs. Following adjacent pair training, subjects were then tested on all twenty-one pairs. These varied in their ordinal distance (with the pair AG being the largest). Additionally, six pairs were considered the critical transfer pairs (shaded in gray) because they were neither adjacent nor did they include the terminal items. Consequently, these are the pairs that provide the strongest test of transitive inference and symbolic distance effects.

https://doi.org/10.1371/journal.pcbi.1004523.g001

In the above example, only the stimuli A and G are differentially rewarded, and can therefore be identified on the basis of reward prediction error. Accordingly, these stimuli are correctly identified more often, (the terminal item effect). Correct choices among stimuli B, C, D, E, and F are more difficult to explain, however, because their expected value during training is 0.5. The pair BD is a critical pair during testing because that pair is novel and contains no terminal items. Learning models that rely on only the expected values of stimuli fail to make the inference and respond at chance levels [12].

Despite decades of research, controversy remains over what exactly is learned during TI tasks. One such debate regards whether such learning requires cognitive processes, or can instead be explained merely by associative mechanisms. The cognitive learning school of thought holds that inferring the order of BD based only on BC and CD implies internal representation of the ordered list [13]. On the other hand, the associative learning school maintains that TI can be explained by stimulus-response-outcome associations alone [14]. Although associative models of TI struggle to accommodate the full range of empirical findings [15], their mathematical formalism at least permits specific predictions [16]. Cognitive models, by contrast, have historically been too vague to permit the simulation of behavior [12].

Here, we attempt to resolve this difficulty by comparing the ability of computational models to explain aspects of TI performance observed in humans and monkeys. These include the transfer of knowledge from adjacent to non-adjacent pairs, and symbolic distance effects. One model, drawn from the machine learning literature, can only learn from frequencies of reward delivery. Another is a new model, called betasort, that can infer the relative list positions of the stimuli. We argue that these models are representative of associative learning on the one hand, and cognitive learning on the other.

Humans and monkeys performed a transitive inference task, and their performance was characterized in terms of several learning models, including betasort. Betasort successfully performed the transitive inference task at low computational cost, whereas the associative model was unable to learn during adjacent pair training or to show distance effects at transfer.

Models

Terminology and Notation

When situating TI in the current literature, it is important to define terms. The overarching topic of reinforcement learning (RL) pertains to how subjects learn by trial and error, whether through associative or cognitive processes. This approach is informed by the machine learning literature (popularized by Sutton and Barto [17]), which specifies a different distinction: “model-free” RL vs. “model-based” RL [18]. Typically, these two groupings of algorithms are presented in the following fashion:

A number of accounts of human and animal behavior posit the operation of parallel and competing valuation systems in the control of choice behavior. In these accounts, a flexible but computationally expensive model-based reinforcement-learning system has been contrasted with a less flexible but more efficient model-free reinforcement-learning system.

Otto and colleagues, 2013, p. 751 [19]

With few exceptions, the “model-based” algorithms used by computational neuroscientists rely on contingency tables that relate states and actions. This represents a vast range of potential models, which either are or seek to approximate the behavior of Markov decision processes (MDPs). “Model-free” algorithms are in turn typically assumed to have the following characteristics:

Each action is represented by an expected value of reward.
Values are updated as a function of discrepancy between the expectation and outcome, called reward-prediction error (RPE).
Predictions are made about available actions, so only values associated with available actions are updated.

Such algorithms can solve certain problems without contingency tables, instead using RPE to approximate the expected value of a given action. These ‘value function approximations’ often rely on dynamic programming techniques pioneered by Bellman [20]. These estimated values converge at the limit with the stochastic expectations of MDP models under certain conditions. When conditions are good for rapid convergence, RPE models give rise to adaptive behavior without instantiating a contingency table [21].

Defining ‘available actions’ in a clever fashion permits RPE models to generalize. For example, by recognizing that pressing a button with one’s left or right hand may be functionally equivalent, an algorithm can learn the general predicted value of ‘button pressing’ independent of which hand is used. The most powerful such generalizations yet observed are those of “deep Q-network” (DQN) learning [23], which performs well under many (but by no means all) testing conditions. Other value function approximations are sometimes labeled as “model-free,” such as the Rescorla-Wagner model [22]. To avoid misunderstanding, we henceforth will refer to MDP and RPE models directly, rather than refer to broad categories of algorithms.

Although value function approximation can produce successful behavior in many contexts, it routinely fails to yield effective solutions to TI problems. Because each ‘state’ (i.e. the pair of stimuli currently visible) is independent of the previous action, and because the stimuli themselves are assigned a rank arbitrarily, there are no explicit cues that RPE algorithms can use to enhance their predictions about the expected value of the non-terminal items. Furthermore, because subjects are only told whether a response was ‘correct’ or ‘incorrect’ (as opposed, for example, to being told the distance between items following every trial), no additional information is provided about the relationship between stimuli.

Let ch_t denote the index associated with a subject’s choice at time t. Let r_t indicate the delivery of a reward (or lack thereof), indicated by a value of 1.0 or 0.0 respectively. Let ℵ denote the set of all stimuli presently employed in the experiment. ℵ+_t denotes only those stimuli that are presented during the current trial, while ℵ −_t denotes those stimuli whose presence is implied by past experience but are not currently visible. Additionally, let nc_t denote the set of stimuli not chosen. The models here described also make a distinction between an updating policy (which modifies memory as a function of feedback) and a choice policy (which selects the next behavior). These are best understood as subroutines.

The Betasort Algorithm

Overview.

Betasort is designed to be a computationally inexpensive formalization of the spatial coding hypothesis [13, 24–27]. By coding stimulus position spatially, betasort can perform inferences over arbitrarily large sets of items. By treating item position as a density function, rather than a point estimate, the uncertainty associated with a position can also be represented. The feedback provided during the TI task is used to shift and consolidate those stimulus densities.

Betasort directly instantiates a spatial model, and so bears little functional resemblance to an MDP approximator. It is instead based on three principles. The first is the use of beta distributions. Although commonly used as sampling distributions for probabilities, we instead use them here to represent stimulus position on a unit scale. Betasort selects behaviors using these distributions, and then updates stimulus positions and their uncertainty. The second principle is that feedback should be used to update the position of a stimulus, rather than its expected value. Consequently, when the outcome of an action is satisfactory, one should consolidate the current position, rather than shift it. The third principle is that every stimulus representation should be updated during every trial, regardless of which stimuli are presented. Collectively, these principles provide a plausible mechanism for transitive inference. A schematic representation of the algorithm is provided in Fig 2.

Download:

Fig 2. Outline of the betasort algorithm over the course of one trial.

The algorithm’s logic is presented in both a schematic (left) and detailed (right) outline. Rectangles refer to operations, diamonds to logical branches, and octagons to loops that iterate over sets of items. Four phase are depicted: the choice policy (red), the relaxation of the contents of memory (green), the processing of explicit feedback (blue), and implicit inference (yellow).

https://doi.org/10.1371/journal.pcbi.1004523.g002

The position of a stimulus i is represented by two parameters: An “upper” parameter U_i and a “lower” parameter L_i, both positive. If U_i > L_i, then the stimulus position is closer to the top of the scale; if L_i > U_i, then it is closer to the bottom. As U_i and L_i both get larger, the uncertainty associated with the stimulus position decreases. The density function over a sample space from 0.0 to 1.0 is given by: (1) Here, Γ() represents the gamma function. When U_i = L_i = 1.0, the probability density is uniform; it grows increasingly normal as these parameters increase. In order to consolidate a stimulus position, rather than shift it, these parameters are increased by a proportion of their current value (i.e. and ). This distributes a single reward across both parameters, leaving the position intact while reducing its uncertainty.

Incrementing values of U_i and L_i is effectively Bayesian updating. The beta distribution at the time of a choice represents a subject’s prior belief about where the stimulus might be, based on the evidence collected up to that point. When the subject received feedback, this new evidence is factored in, changing the distribution to a posterior belief. The resulting posterior then acts as the prior for the subsequent trial. Although Bayesian updating of most continuous distributions is computationally expensive, the beta distribution is an exception because it is a conjugate prior. This means that, if our prior on an unknown probability is beta-distributed, then so too is the posterior. The parameters of this new posterior are identical to the prior values of U_i and L_i, plus a small increment that corresponds to the feedback. Consequently, updating the beta prior entails almost no computational cost.

Because of this elegant property, the beta distribution is commonly used as a sampling distribution for an unknown probability, on the basis of a set of binary outcomes [28]. If H and T are thought of as the accumulated number of Heads and Tails resulting from flipping a coin, then Beta(p;H, T) yields a credible interval for the probability p of the next toss coming up Heads. Each additional Head or Tail is added to its corresponding count, tightening the beta distribution around the coin’s true probability of Heads. In betasort’s case, however, the aim is not to estimate an unknown probability, but rather an unknown position along the unit scale.

Betasort also tracks the reward (R_i) and non-rewards (N_i) associated with trials that include each stimulus. Importantly, if a trial is rewarded, the value of R_i is increased for both stimuli. This is because R_i and N_i control the algorithm’s explore/exploit tradeoff, increasing the variability of behavior when the current representation is not functioning effectively.

Betasort’s choice policy (red in Fig 2) draws random values from each position distribution, and selects the largest from among the available actions. The policy uses one free parameter: noise (0.0 < τ < 1.0), which is the probability that betasort ignores its memory and selects an action at random. When τ = 1.0, the algorithm is entirely stochastic; when τ = 0.0, all choices are governed by memory. Note that, early in learning, choices governed by memory will also look like guessing, because of the substantial uncertainty about each stimulus position. Consequently, τ is not strictly a variable that governs ‘guessing behavior,’ but rather one that governs how often the algorithm disregards the contents of memory.

Betasort’s updating policy begins with the relaxation phase, (green in Fig 2), which makes use of another free parameter: recall (0.0 < ξ < 1.0), which scales the contents of memory downward during every trial prior to processing the feedback for that trial. For example, if U_i = 20 and L_i = 10, then given ξ = 0.9, these values will be updated to (ξ × U_i) = 18 and (ξ × L_i) = 9, respectively. These representations are further relaxed as a function of R_i and N_i: As the algorithm makes more mistakes, it discounts its representation more rapidly (and thus explores more); given fewer mistakes, it discounts more slowly (and thus exploits more).

Following trial feedback, betasort applies explicit feedback (blue in Fig 2) to those stimuli present in the current trial. If the choice was rewarded, both have their current positions consolidated. If the choice was not rewarded, their positions are shifted to improve performance during later trials. Next, betasort applies implicit inference (yellow in Fig 2) to the values of all stimuli not presented during the trial. If the choice was rewarded, all inferred positions are consolidated; if not, those stimuli that fall between the trial stimuli are consolidated, but those that fall outside the trial pair are shifted toward the edge of the sample space. Fig 3 presents relaxation, explicit feedback, and implicit inference during a single incorrect trial.

Download:

Fig 3. Visualization of Betasort’s adjustment of the beta distributions during a single trial in which an incorrect response is given.

For this example, the trial stimuli are the pair CE. The initial conditions show the beta distributions of a well-learned list, with means marked by a vertical line. During the choice phase, a value is drawn randomly from the beta distributions of each trial stimulus, and the stimulus with the larger random value is chosen. In this example, the algorithm incorrectly selects stimulus E, an unlikely but possible event. Immediately following the choice, but before feedback is taken into account, the positions of all stimuli are relaxed (using ξ = 0.6 for this example). This has the effect of making all density functions slightly more uniform, and reduces the influence of older trials in favor of more recent ones. During explicit feedback, the increases L_E by one, while also increasing U_C by one. This increases the odds of future selections of stimulus C, while decreasing the odds of future selections of stimulus E. Next, the algorithm makes implicit inferences about the positions of all known stimuli that did not appear during the trial. Because stimulus D falls between C and E, its count of successes and failures is consolidated and its position does not change. Stimuli A and B are positioned above the trial stimuli, and so are shifted upward. Stimuli F and G are shifted downward.

https://doi.org/10.1371/journal.pcbi.1004523.g003

Memory structure.

The betasort algorithm makes use of four 7 × 1 vectors to track feedback concerning the available stimuli: U, L, R, and N. The vector U indicates the degree to which stimulus i is close to the top of the unit scale. The vector L plays a similar role the bottom end of the scale. Jointly, U and L provide the parameters to the beta distributions that represent the estimated position of each stimulus on the unit span. Meanwhile, R and N store rewarded and unrewarded trials for each stimulus, respectively. Thus, if R_i = 10.5 and N_i = 4.5, then the algorithm estimates 70% probability of reward during trials in which i was present, based on the last 15 trials. Although all four vectors conceptually represent sums of discrete events, they support fractional values, resulting from the relaxing phase of the updating policy.

Choice policy.

At stimulus onset for every trial, each stimulus in the set ℵ+_t had a number X_i drawn at random either from a beta distribution, parameters governed by past learning U_i and L_i, or else draws these values from a uniform distribution (in which case behavior is entirely random). The odds of choosing entirely randomly is governed by the “noise” parameter τ, such that 0 ≤ τ ≤ 1: (2) A value of 1 is added to U_i and L_i in order to act as a prior on the probability distribution. This also prevents the distribution from approaching a singularity as a consequence of some edge conditions during updating.

The betasort choice policy is to select the alternative whose random value is largest: (3) This choice policy is only marginally more expensive than softmax, in that it involves only drawing random values from a handful of closed-form probability density functions. It also has the added benefit that, because the absolute values of U_i and L_i are preserved, they govern the narrowness of the beta distributions and thus model a subject’s growing accuracy as a function of increased experience.

Updating policy.

Betasort’s updating policy involves three stages: Relaxation, explicit updating, and implicit inference. Relaxation weakens the influence of old information in favor of more recent feedback. This is governed by the “recall” parameter ξ, such that 0 ≤ ξ ≤ 1. All four vectors (U, L, R, and N) are multiplied by ξ, so all qualify as “leaky accumulators” [29], steadily decreasing in absolute value. These losses are then counteracted by subsequent updating. In addition to ξ, the values of U and L are further relaxed by a vector of factors ξ_R, based on the reward rate accrued during trials in which a given stimulus was present. When accuracy is high, this additional relaxation is minimal; however, when accuracy is low, more aggressive relaxation yields greater variability in behavior, helping to keep the algorithm from being trapped in local minima. Collectively, relaxation makes the following modifications, for all stimuli i: (4)

Subsequent updating depends on a vector of “expected values” V of each stimulus. These are not expected values in the econometric sense, but instead represent the best estimate of the position of each stimulus along the unit span: (5) In the event that (U_i = L_i = 0.0), V_i is set to 0.5. Subsequent updating depends on (1) whether the trial resulted in a reward or not, (2) whether each stimulus i was part of the set present during the trial or not, and (3) the relative values of V.

If the response is rewarded, then the algorithm consolidates its current estimates. This is done by increasing every U_i by an amount equal to V_i, whereas every L_i is increased by an amount equal to (1 − V_i): (6) This is done regardless of whether the stimulus was present on the current trial. If, on the other hand, the response was not rewarded, then L_ch is increased by one, as is U_nc. Then, for all other stimuli not present during the trial, their values are updated as a function of their V_i relative to the stimuli presented: (7) Thus, in the cases where the response was incorrect, the algorithm consolidates the representation of those stimuli falling between the pair, and pushes those lying outside the pair outward toward the margins.

The entire process for updating is specified by the pseudocode entitled Algorithm 1 in S1 Text.

Although this procedure involves a number of logical comparisons, its adjustments are otherwise strictly arithmetic, and can be computed rapidly without recourse to bootstrapping.

Parameter estimation.

Although generating choices and updating memory can both be accomplished rapidly, computing the likelihood of an observed response is more computationally costly. Doing so requires computing the incomplete beta distribution: (8) In the case of a two-stimulus trial, the odds of stimulus A being chosen over stimulus B, the odds depend both on the noise parameter τ and integrating over two convolved distributions [30]: (9) Given this formula, computing log-likelihoods associated with the parameters (τ, ξ) for a set of observed data can be performed in much the same manner as with the Q/softmax algorithm.

Unfortunately, τ and ξ are not strictly orthogonal: Performance near chance can alternatively be explained by high values for τ or low values for ξ. To avoid unstable parameter estimates, values of τ were estimated heuristically, based on the observation that subjects reliably showed near-ceiling performance on the pairs AF, BG, and AG. Under the assumption that the integral above equals 1.0 for those pairs, a bit of arithmetic yields the following estimate: (10) Having set this parameter, we then used the fminsearch() optimizer packaged with Matlab 2014b (The MathWorks, Inc.) to identify the maximum likelihood parameter estimate for ξ. Parameters were obtained for each session.