Breaking the bonds of reinforcement: Effects of trial outcome, rule consistency and rule complexity against exploitable and unexploitable opponents

Jukka Sundvall; Benjamin James Dyson

doi:10.1371/journal.pone.0262249

Abstract

In two experiments, we used the simple zero-sum game Rock, Paper and Scissors to study the common reinforcement-based rules of repeating choices after winning (win-stay) and shifting from previous choice options after losing (lose-shift). Participants played the game against both computer opponents who could not be exploited and computer opponents who could be exploited by making choices that would at times conflict with reinforcement. Against unexploitable opponents, participants achieved an approximation of random behavior, contrary to previous research commonly finding reinforcement biases. Against exploitable opponents, the participants learned to exploit the opponent regardless of whether optimal choices conflicted with reinforcement or not. The data suggest that learning a rule that allows one to exploit was largely determined by the outcome of the previous trial.

Citation: Sundvall J, Dyson BJ (2022) Breaking the bonds of reinforcement: Effects of trial outcome, rule consistency and rule complexity against exploitable and unexploitable opponents. PLoS ONE 17(2): e0262249. https://doi.org/10.1371/journal.pone.0262249

Editor: Emmanouil Konstantinidis, University of Warwick, UNITED KINGDOM

Received: August 11, 2020; Accepted: December 21, 2021; Published: February 2, 2022

Copyright: © 2022 Sundvall, Dyson. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: JS and BJD received funding from the School of Psychology, University of Sussex for this work (this work was conducted as part of JS's PhD thesis at the University of Sussex). JS also received a personal grant for his PhD from Osk. Huttunen Foundation (grant number 194). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

When organisms compete for mutually exclusive outcomes, success requires the minimization of losses and the maximization of gains [1]. To achieve these complementary goals, organisms must avoid exploitation (loss minimization) but also be able to exploit their opponents (gain maximization). The sequential play of zero-sum games such as Rock, Paper, Scissors (RPS) or Matching Pennies (MP) are examples of controlled competitive spaces in which the relative success of these goals can be clearly assessed. Within these types of games, the only way to guarantee loss minimization is to behave according to a mixed strategy (MS) [2,3]. A mixed strategy is where no single response option should be played more than another, and the events of the previous round should not influence response selection on the next round. In sum, MS amounts to playing each round randomly. However, the extent to which humans are capable of sequential random decisions has been the source of some debate.

While several early studies seemed to indicate that people have trouble both recognizing and producing randomness (see [4,5] for reviews), critics also identified a number of problems associated with these approaches (see [3,6]). For example, randomness production tasks were often quite artificial, often contained instructional biases (nudging participants towards specific errors in randomness production), and suffered from a lack of incentive and feedback (e.g., no indication to the participant that they are being “sufficiently random”). As a remedy, Rapoport and Budescu suggested two-player zero-sum games would side-step these issues [3]. In a zero-sum game, randomness can become the implicit goal of the game, rather than an explicit task for the participants to “be random.” Zero-sum games are also easy to incentivize and allow for regular feedback (in the form of gains and losses, unlike in many pure randomness production tasks). Rapoport and Budescu found that participants playing a binary-choice zero-sum game against each other were noticeably better at approximating randomness than participants who were simply asked to produce random sequences of game choices without an opponent [3].

Although zero-sum games increase the likelihood of expressing randomness, there remain a number of common predictabilities in behavior based on reinforcement [7]. Actions followed by positive outcomes are more likely to be repeated (win-stay) whereas actions followed by negative outcomes are less likely to be repeated (lose-shift). That is, game feedback–wins and losses in a zero-sum game–may remove some forms of deviations from randomness, but may give rise to other forms. These associations between outcomes and future behavior are evidenced in both human [8] and animal work [2,9].

We will refer to tendencies to repeat decisions that yielded rewards and to switch away from decisions that did not yield rewards as “reinforcement biases” throughout the paper. We use this term in a similar manner to, e.g., [10,11]. In this tradition of using the term, the word “reinforcement” refers to the specific effect of the immediately previous trial and its outcome on the current decision in a series of decisions. Thus, the term “reinforcement bias” does not cover biases in forms of reinforcement learning where information from more than one previous trial affects decisions. Note that by “bias”, we do not necessarily mean that these decisions are “faulty” or “irrational”. A bias may in fact be rational due to, e.g., evolutionary reasons [12]. Moreover, biases in environments that do not reward or punish predictability or unpredictability (e.g., playing against a randomly playing opponent who does not try to exploit the player) do not matter in terms of outcomes. Thus, any way of playing in such an environment could be called “rational”. Our intention in this article is not to make claims about the rationality, or lack thereof, of our participants. Rather, people’s decisions in differing environments may be more or less optimal if an optimal strategy exists: there can be differences in learning. In environments where any strategy leads to the same result, there are no optimal strategies, but we can still examine whether people deviate from randomness in such environments that do not incentivize playing non-randomly.

Reinforcement biases seem quite robust, even in the face of negative feedback. Both Scheibehenne et al. [12] and Wilke et al. [13] found that win-stay / lose-shift behavior was common not only in situations where rewards were random, but also when win-stay / lose-shift behavior led to a decrease in reward in a simulated slot machine game. Similarly, Achtziger et al. [10] found that in a Bayesian belief updating task, where the reversed win-shift / lose-stay strategy was the optimal approach, suboptimal win-stay / lose-shift behavior persisted. Thus, it seems that people may adopt a strategy based on reinforcement in the narrow sense (the result of the round immediately before a given round) even when it leads to more frequent losses, with these losses not necessarily being enough to lead to players adopting another strategy.

Although lose-shift behavior might seem a simple mirror image of win-stay behavior, these mechanisms are under different degrees of control [14–17]. A common finding using three-response zero-sum games (such as Rock, Paper, Scissors) is that lose-shift behavior is more frequent than win-stay behavior [18,19]. In the following experiments, we explored the tension between the expression of random behavior and behavior guided by reinforcement.

Experiment 1

In Experiment 1, we explored the joint ability of participants to play randomly when there is no winning strategy, and to play against reinforcement by designing two types of opponent. Our exploitable opponent was designed with a bias towards shifting behavior across consecutive trials. In the context of Rock, Paper, Scissors, there are three responses and hence two forms of shift available. We chose opponent downgrading [20] for Experiment 1, whereby the opponent would shift to the response that would have lost to its previous selection (e.g., moving from Rock to Paper). This opponent bias then allows us to clearly define the optimal outcome-response associations for the participant (see Table 1). In order to maximize wins against this exploitable opponent, participants should win-downgrade (after winning, shift to the response that would have been beaten by your previous response), lose-upgrade (after losing, shift to the response that would have beaten your previous response), and draw-stay (after drawing, repeat your previous response).

Download:

Table 1. Optimal strategy for trial n+1 as a function of outcome at trial n against exploitable opponents in Experiments 1 and 2.

https://doi.org/10.1371/journal.pone.0262249.t001

We can then assess whether the requirement of optimal responding is in alignment or out of alignment with the standard reinforcement learning principles of win-stay and lose-shift. Specifically, win-downgrade and draw-stay behaviors are misaligned with these principles, as participants must change a response following positive outcomes (contra win-stay) and maintain a response following negative outcomes (contra draw-shift). Only lose-downgrade is consistent with changing a response following a negative outcome (lose-shift). Additionally, none of the optimal responses are aligned with myopic best reply (i.e., the assumption that the opponent will repeat their last move; see [11]), while the draw-stay response is aligned with inertia (repetition bias; see [11]). On the basis that it is difficult to express behaviors other than win-stay and lose-shift even when such behavior works against the maximization of wins [10,12,13], we predicted that both the proportion of win-downgrade and draw-stay behavior should be lower than lose-upgrade behavior, as a result of the misalignment of win-downgrade and draw-stay (and alignment of lose-upgrade) with reinforcement.

Method

Participants.

40 individuals (31 female) from the University of Sussex community participated in the study; mean age was 21.13 years (SD = 4.37) and 39 were right-handed. Sample sizes were based on previous studies from the lab showing reliable lose-shift biases within zero-sum game contexts (e.g., [18], N = 31; [19], Ns = 36; [21], Ns = 40). Participants received course credit or £10 (their choice, unless course enrollment required them to take the credit) for their participation. Informed consent was obtained from all participants before testing, and the experiment was approved by the Sciences Technology Research Ethics Committee (C-REC) at the University of Sussex (ER/JS753/1).

Materials.

Static pictures of separate white-gloved and blue-gloved hands signaling Rock, Paper and Scissors poses (from [22]) were displayed center screen at approximately 6° x 6°, with participants sat approximately 57 cm away from a 22” Diamond Plus CRT monitor (Mitsubishi, Tokyo, Japan). Stimulus presentation was controlled by Presentation 19 (build 03.31.15) and responses were recorded using a keyboard.

Design.

Experiment 1 had a 2x2 within-participants design with the factors of opponent (unexploitable, exploitable) and value (low, high; see S1 File). Each participant completed a block of 90 game trials in each of the four conditions (360 trials in total) in a semi-counterbalanced order across participants. The only constraint imposed on the counterbalancing orders was that no two consecutive blocks were allowed to be against the same opponent type; this was to avoid potential ceiling effects against exploitable opponents.

In the unexploitable condition, the opponent drew randomly without replacement from an equal distribution of responses (30 instances of Rock, Paper and Scissors each). Note that this is a slight deviation from true MS, where the draws would be with replacement (this deviation was made to eliminate the possibility of item bias). In the exploitable condition, the computer followed a downgrade rule for 70% of the time (63 trials) where the computer’s next response was the one that would have been beaten by the computer’s previous response. For the rest of the time (30%; 27 trials), the computer drew randomly, without replacement, from an equal distribution. This led to the following optimal outcome-strategy contingencies for the participant in the exploitable conditions: win-downgrade, lose-upgrade, draw-stay (see Table 1).

Procedure.

At the beginning of each block, the experimental program informed participants how much outcomes were worth. Regardless of the opponent condition, participants were informed that their opponent would play in a certain way that would be revealed to them at the end of the experiment. Participants were instructed to try and win as many rounds as possible.

For each trial, the participant was first presented with three pictures of a hand in a white glove representing the three possible choices, presented in the same order as the response buttons used. Opponent and player scores were displayed at the bottom of the screen. Upon pressing a response button, the choices made by the opponent (hand in a blue glove on the left) and the participant (hand in a white glove on the right) were displayed for 1000ms. This was replaced by an outcome screen for 1000ms informing the participant if they had won, lost, or drawn the trial. After a 500ms pause, the scoreboard and trial counter were updated after another 500ms pause, and the next trial began.

For every 9th trial in the block, after the participant had made their choice and before presenting the results, there was a 500ms pause and the program asked the participant to state their confidence of a win or a loss on a 5-point scale. The scale was from 1 for”extremely confident of win” through 3 for”unsure either way” to 5 for”extremely confident of loss”. These items were reverse coded in the final analyses (see S1 File). After another 500ms pause, the trial continued with the outcome reveal. At the end of each block, three short questionnaires were completed (see S1 File) and participants wrote down a short description of how they thought the opponent played. After the final block, a self-report personality inventory was completed, after which participants were thanked for their time and debriefed.

Results

Item selection and outcome at trial n.

We analyzed proportions of item selection at trial n for each block with a three-way repeated measures ANOVA with opponent (unexploitable, exploitable), value (low, high) and item choice (rock, paper, scissors) entered as factors (see Table 2). This and all subsequent analyses were conducted using R 4.0.2 [23]. For repeated measures ANOVAs, we used the anova_test function from the rstatix package [24]. In this and all other cases of ANOVAs, degrees of freedom were corrected using Greenhouse-Geisser estimates when Mauchly’s test indicated violations of sphericity. There was no main effect of item choice [F(1.61, 62.90) = 1.27, MSE = .02, p = .282, η_p² = .03] nor an interaction between item choice and value [F(2, 78) = 0.62, MSE < .01, p = .538, η_p² = .01] or item choice and opponent [F(2, 78) = 1.04, MSE < .01, p = .358, η_p² = .02]. There was no three-way interaction [F(1.58, 61.69) = .64, MSE < .01, p = 494, η_p² = .01]. Thus, there was no overall item bias, nor any item biases as a function of experimental condition.

Download:

Table 2. Item and outcome proportions as a function of opponent and value in Experiments 1 and 2.

https://doi.org/10.1371/journal.pone.0262249.t002

We also conducted the analysis using a linear mixed model, using the lmerTest package [25], with opponent, value and item choice as fixed effects, and with random intercepts for participants on average and for participants within each main effect and two-way interaction effect. For this analysis, we removed extreme outlier data points, which the ANOVA approach did not allow for (requiring the removal of all of a participant’s data due to one outlier value). We used the rstatix package [24] and its identify_outliers function. Extreme outliers were defined as values that were three times the cell-level interquartile range (Q3 –Q1) above Q3 or below Q1. We detected eight extreme outlier values and removed them from the linear mixed effects analysis. The results differed from the results of the ANOVA only in the main effect of item, which was now significant [F(2, 111.42) = 1.27, p = .001, η_p² = .04]. Rock (EMM = 35.3%, 95% CI = [33.8%, 36.7%]) was chosen significantly more often than paper (EMM = 32.2%, 95% CI = [30.8%, 33.7%]) or scissors (EMM = 31.6%, 95% CI = [30.1%, 33.1%]). Thus, the numerical trend in the data (see Table 2) was significant in this analysis. Conducting the same model without exclusions but with a robust or Bayesian linear mixed effects approach, using the robustlmm and blme packages [26,27], respectively, yielded similar results. Thus, the outlier values seem to have masked a small item bias effect.

We conducted an identical series of analyses for rates of different outcome types (win, lose, draw). In the three-way repeated measures ANOVA, there was a significant main effect for outcome type [F(1.29, 50.54) = 46.86, MSE = .02, p < .001, η_p² = .54], and a significant interaction between outcome type and opponent [F(1.17, 45.71) = 36.09, MSE = .03, p < .001, η_p² = .48]. There were no significant differences between wins and other outcome types in the unexploitable condition (33.4%, 34.1% and 32.4%, respectively; Tukey’s HSD; p > .05 for both comparisons). In contrast, in the exploitable condition, wins were more frequent than losses or draws (49.4%, 23.6% and 27.0%, respectively: Tukey’s HSD; p < .05 for both comparisons) confirming that as a group, participants acquired some knowledge of the correct strategy. The main effect of value [F(2, 78) = .24, MSE < .01, p = .785, η_p² < .0], and the three-way interaction [F(1.70, 66.48) = .30, MSE < .01, p = .705, η_p² < .01] were not significant.

We also conducted the outcome analysis using regular (excluding extreme outliers), robust (without exclusions), and Bayesian (without exclusions) linear mixed effects approaches, as we did for item choice data. We detected one extreme outlier. There were no meaningful differences between the results of the different linear mixed effects models, and the results of these models did not differ from the results of the ANOVA. There were also no differences between the results of pairwise comparisons using the full data or the data that excluded extreme outliers.

First-order repetition effects.

We conducted a similar series of analyses for first-order repetition effects (i.e., player strategy) as we did for item choice and outcome. We first analyzed proportion data using the last 89 trials in each block (the first trial having no history) with a four-way repeated-measures ANOVA, with opponent (unexploitable, exploitable), value (low, high), outcome at trial n (win, lose, draw) and player strategy at trial n+1 (stay, upgrade, downgrade) as factors (see Fig 1). The data show no particular pattern of responding in the unexploitable condition. In the exploitable condition, optimal responses were most likely following wins, less likely after draws, and least likely after losses.

Download:

Fig 1. Proportion response data from Experiment 1 in terms of the relationships between the outcome at trial n (win, lose, draw) and the strategy selected at trial n+1 (stay, downgrade, upgrade).

Participants competed against four opponents defined by the factors of value (low, high) and exploitability (unexploitable, exploitable). Error bars represent standard error.

https://doi.org/10.1371/journal.pone.0262249.g001

The main effect of strategy at trial n+1 was significant [F(1.52, 59.24) = 7.73, MSE = .17, p = .003, η_p² = .17]. There was also a significant interaction effect between opponent and player strategy at trial n+1 [F(2, 78) = 6.77, MSE = .03, p = .002, η_p² = .15], as well as between outcome at trial n and strategy at trial n+1 [F(1.93, 75.09) = 33.98, MSE = .13, p < .001, η_p² = .47]. There was a significant three-way interaction between opponent, outcome at trial n and strategy at trial n+1 [F(2.13, 83.23) = 31.12, MSE = .08, p < .001, η_p² = .44]. The interactions between value and player strategy at trial n+1 [F(2, 78) = 0.53, MSE = .04, p = .589, η_p² = .01], value, opponent and player strategy at trial n+1 [F(1.74, 68.03) = 0.45, MSE = .03, p = .611, η_p² = .01] and between all four factors [F(2.95, 114.87) = 0.20, MSE = .03, p = .89, η_p² = .01] were all non-significant, suggesting no effect of the value manipulation on behavior.

Against unexploitable opponents, there were no significant differences between the three different strategies at trial n+1 as a function of outcome at trial n (Tukey’s HSD, p > .05 for all comparisons; see top panel of Fig 1). In these regards, the data represent a fair approximation of random behavior at the group level. As an additional test of random behavior, we carried out two-tailed binomial tests for stay, upgrade and downgrade responses after wins, losses and draws for each individual, under the null hypothesis of 33.3% for each test. We classified an individual as having a bias if two conditions were fulfilled. First, the null hypothesis was rejected for a given decision type following a given outcome. Second, the rate of that type of decision following that outcome was higher than 33.3% (a rate significantly lower than 33.33% would imply a bias toward some other decision type). Moreover, we aggregated upgrade and downgrade biases into a general shift bias: a participant who had either an upgrade or downgrade bias or both after a given outcome type was classified as having a shift bias. There were 9 individuals in the low value condition and 11 in the high value condition for whom the null hypothesis could not be rejected (p > .05) for any decision type following any outcome type. That is, these participants had no biases in any direction as a function of reinforcement. In general, after any outcome type, the most common pattern of responding was one with no bias either toward or away from reinforcement (see Table 3).

Download:

Table 3. Numbers of participants with specific biases or no biases in the unexploitable blocks of Experiments 1 (N = 40) and 2 (N = 40).

https://doi.org/10.1371/journal.pone.0262249.t003

Against exploitable opponents, the optimal downgrade response following wins was more likely than the other two responses, as was the optimal stay response following draws (Tukey’s HSD, p < .05 for each comparison; see bottom panel of Fig 1). However, the proportion of the optimal upgrade responses following losses did not differ significantly from the other two responses (Tukey’s HSD, p > .05 for both comparisons). The proportion of win-downgrade responses was not significantly different from draw-stay responses (Tukey’s HSD, p > .05), but lose-upgrade responses were less frequent than win-downgrade and draw-stay responses (Tukey’s HSD, p < 05 for both comparisons). Taken together, the results suggest that the participants’ ability to maintain optimal strategic responding against an exploitable opponent was compromised following the experience of losing combined with the initiation of a response aligned with reinforcement, relative to the experience of drawing or winning combined with the initiation of a response misaligned with reinforcement.

We further explored the exploitable condition data by categorizing participants’ win-rates as successful or unsuccessful based on a one-tailed one-sample proportions test, with 33.3% wins as the null hypothesis, run separately for each participant in each block. Of the 40 participants, 8 failed to reach a win percentage significantly higher than chance on both blocks, and 15 other participants failed on one block out of two (8 in the low value and 7 in the high value block). For the remaining 17 successful participants, the percentages of optimal responding distributed across the three outcomes was similar to that of the entire sample (see top panel of Fig 2). In contrast, the unsuccessful participants did not show strategic learning following any outcome, and instead behaved similarly to an overall MS trend in both conditions (see bottom panel of Fig 2). Therefore, the observed differences in optimal behavior after different outcomes were not driven by the unsuccessful participants.

Download:

Fig 2. Proportion response data from unsuccessful and successful participants in the exploitable condition of Experiment 1 in terms of the relationship between outcome at trial n (win, lose, draw) and the strategy selected at trial n+1 (stay, downgrade, upgrade).

Error bars represent standard error.

https://doi.org/10.1371/journal.pone.0262249.g002

We also conducted the first-order repetition effect analysis using regular (with exclusions), robust (without exclusions), and Bayesian (without exclusions) linear mixed effects approaches. In these models, we included the factors in the ANOVA as fixed effects, with random intercepts for participants on average and for participants within each main effect and each two-way and three-way interaction effect. We detected four extreme outliers in the data. There were no meaningful differences between the results of the different linear mixed effects models, and the results of these models did not differ from the results of the ANOVA. There were also no differences between the results of pairwise comparisons using the full data or the data that excluded extreme outliers.

Discussion

In Experiment 1, behavior against unexploitable opponents was notable in achieving a rough approximation of random behavior in relation to game outcomes at a group level. At the individual level, the predicted pattern of win-stay, lose-shift and draw-shift biases seemed to be as infrequent as non-biased responding. This is quite different from previous studies where participants reliably exhibited a tendency to shift behavior following negative outcomes on the group level during RPS [18,22] or other types of games [12,13]. We did replicate a previously observed, small item bias in favor of choosing rock [8,18,28]. Thus, the results do not fully align with the presumption that humans should find it difficult to produce random behavior [29,30], but we must note that the results are based on measures of very specific kinds of deviations from randomness (item biases and reinforcement biases).

According to Rapoport and Budescu, MS behavior is more likely within the context of game environments where randomness is an optimal, but not explicitly stated, requirement of the task [3]. However, in Rapoport and Budescu’s study, randomness was achieved when a pair of human players played against each other. Human players likely attempt to exploit each other in a dynamic fashion, such that there is no stable strategy that leads to above-chance performance (see [31] for an exploration of a model of how humans play). This is at odds with the use of a computer opponent that fails to produce the same kind of “dynamic coupling” (see [31]) with a human player: the program in Experiment 1 paid no attention to how participants played nor did it attempt to exploit them.

With respect to performance against exploitable opponents, we examined differences in adopting strategies based on whether the strategies aligned with win-stay / lose-shift rules [10,13]. In Experiment 1, both win and draw trials required participants’ choices to go against reinforcement in order to be optimal (see Table 1). However, we observed that performance after both wins and draws was better than performance after losses. This suggests that even when the optimal strategy was in alignment with reinforcement for that kind of trial (i.e., lose-shift), unambiguously negative outcomes outweighed this potential advantage. Note that since participants on average experienced similar rates of both draws and losses (see Table 2), the participants’ behavior after losses cannot be explained simply due to the low frequency of losses. The possibility remains that poorer performance following losses might be a result of the complexity of the strategy required. In the context of RPS, shifting requires the additional step of selecting from one of two responses that are different from the previous trial. This is in contrast to staying, which involves the repetition of only a single option. Staying also aligns with decision inertia, i.e., a bias towards simply repeating choices [11]. We addressed this concern in Experiment 2.

Experiment 2

In Experiment 1, we found that against exploitable opponents with clearly defined counterstrategies, losing led to suboptimal decisions relative to either drawing or winning, even when the specific strategy was consistent with the default lose-shift rule. It is further notable that the rate of optimal decisions made after wins was not significantly different from the rate of optimal decisions after draws, even though the proportion of wins (49.4%) exceeded both losses (23.6%) and draws (27.0%). Participants thus had more opportunity to learn the correct exploitable choice following wins relative to draws, but this did not yield higher optimization following wins. Furthermore, participants also had roughly equal opportunity to learn the pattern of outcome-response contingencies following both draws and losses. However, given the difference between the rates of optimization following draws and losses, it would appear losses impact decision-making in ways that draws do not (similar to [32,33]). However, a problem with this interpretation is that the nature of the present outcome might interact with the complexity of the future action required by it. Specifically, shifting in a three-response game such as RPS requires at least one more processing step relative to staying: not only does one have to decide to switch, one has to decide which response to switch to. Therefore, lose-upgrade might be more difficult that draw-stay because of the requirement to upgrade, and not the actual loss.

To test this idea, we reconfigured the outcome-response pairings in Experiment 2 such that the optimal choice after losses was to stay. The other optimal responses were win-upgrade and draw-downgrade (see Table 1). Thus, the optimal strategy after a draw was now more complex but in line with reinforcement, whereas the optimal strategy after a loss was simpler but misaligned with reinforcement and aligned with decision inertia. Again, none of the optimal responses aligned with myopic best reply. If the reduction in optimal performance following losses in Experiment 1 was due to stay responses being easier than specific shift responses, then the difference in rates of optimal responding after losses and draws should reverse in Experiment 2.