Melioration Learning in Two-Person Games

Melioration learning is an empirically well-grounded model of reinforcement learning. By means of computer simulations, this paper derives predictions for several repeatedly played two-person games from this model. The results indicate a likely convergence to a pure Nash equilibrium of the game. If no pure equilibrium exists, the relative frequencies of choice may approach the predictions of the mixed Nash equilibrium. Yet in some games, no stable state is reached.


Introduction
Various learning models have been analysed in the game-theoretic literature. The best known ones, such as fictitious play or Bayesian learning, describe normative processes that enable the players to find an equilibrium during the repeated play of a game [1]. Those models presume that information about the preferences and past actions of all players is available. More recently, researchers have evaluated whether equilibria can be reached without knowing the preferences of other players [2] or even without considering the other players' presence [3]. The latter condition was called radically or completely uncoupled learning.
In completely uncoupled learning, a player's strategy is based only on his own previous actions and rewards. Some dynamics still ensure the convergence to Nash ε-equilibria or pure Nash equilibria [4]. More specifically, regret-testing [3,5] and interactive trial-and-error (ITE) learning [6] are two examples of completely uncoupled learning that imply this convergence.
Under the name of reinforcement learning, further completely uncoupled dynamics have been analysed in different fields. For instance in economics, one of these models stems from Roth and Erev [7]. In computer sciences, multiple studies in artificial intelligence deal with algorithms of reinforcement learning, e.g. Q-learning or SARSA [8]. Also some psychological models are entirely based on own experiences [9] and, hence, completely uncoupled.
In contrast to regret-testing or ITE learning, most models of reinforcement learning are not guaranteed to converge to an equilibrium in interactive situations. Instead of being designed to imply this convergence, they constitute simple and realistic representations of human learning. In particular psychological models have been built to represent the development of human behaviour as realistic as possible while keeping it analytically tractable, e.g. [10]. This paper strives for the usage of a simple psychological model of completely uncoupled learning. It is called melioration learning and may not converge towards equilibrium states. The next section describes the underlying theory of decision-making and its implementation as instance of the Q-learning algorithm. Afterwards, the model is applied to various two-person games. A connection to the previous literature is established by comparing its predictions to the outcomes of the Roth-Erev model [7].
Generally speaking, melioration learning states that behaviour is strengthened by highly valued events that are perceived as consequences of this behaviour. In the original literature, this process was phrased as "behaviour shifts toward higher local rates of reinforcement" (p. 75, [12]). The local reinforcement rate was defined as "the reinforcement actually obtained from an alternative [.] divided by the time allocated to it" (p. 76, [12]).
Elsewhere, Vaughan and Herrnstein [26] more formally described the process of melioration by a differential equation. Let there be a two-element choice set {1, 2}. Given a point in time t 2 (0, 1), p i (t) 2 [0, 1] denotes the relative frequency of having chosen alternative i 2 {1, 2}. The authors stated that the frequency p 1 (t) changes over time in accordance with dp 1 ðtÞ dt ¼ fv 1 ðtÞ Àv 2 ðtÞ ð Þ: ð1Þ In Eq (1), f : R ! R is a differentiable and strictly monotonically increasing function with f(0) = 0. The termv i ðtÞ (i 2 {1, 2}) stands for the local reinforcement rate of alternative i at time t.
Without specifying the function f of Eq (1), the melioration learning rule remains vague, and long-term behaviour cannot be analysed. In contrast to previous specifications [32][33][34], this paper presents a formal representation of melioration learning that is perfectly consistent with Eq (1) and builds on a well-established algorithm of reinforcement learning. More precisely, melioration is suggested to be formalised by an instance of the Q-learning algorithm [35] with ε-greedy strategy.
Q-learning is a form of temporal-difference (TD) learning and originates from a sub-field of artificial intelligence [8]. While TD models were initially used to represent classical conditioning [36], they can be "applied to stochastic sequential decision tasks to produce an analog of instrumental learning" (pp. 541-542, [37]). A general model of sequential decision tasks is specified in Definition 1 and illustrated in Fig 1. Definition 1 Let E be a finite set of choice alternatives. A situation of sequential decisionmaking is given by two stochastic processes ðX t Þ 1 t¼1 and ðR t Þ 1 t¼1 with values in E and [0, 1), respectively.
In the situation of Definition 1, decisions are made in discrete time steps t 2 N. At time t, the actor emits an action by choosing an element X t 2 E from the set of alternatives. Subsequently, a non-negative reward R t is received from the social environment. In this paper, the action-process ðX t Þ 1 t¼1 is specified by Algorithm 1, which contains an instance of Q-learning with ε-greedy strategy.

Algorithm 1 The melioration learning algorithm
Require: exploration rate ε 2 (0, 1), set of alternatives E 1: t 0 2: initialise Q 1 (j) 0, for all j 2 E 3: initialise K 1 (j) 0, for all j 2 E 4: repeat 5: t t + 1 6: r random number between 0 and 1 (uniformly distributed) 7: if ε > r then 8: choose a random action X t e 2 E using a uniform distribution 9: else 10: choose an action X t e such that Q t (e) = max j 2 E Q t (j) 11: end if 12: observe reward R t = y 13: K t+1 (e) K t (e) + 1 14: Q tþ1 ðeÞ Q t ðeÞ þ 1 K tþ1 ðeÞ ðy À Q t ðeÞÞ 15: for all j 6 ¼ e do 16: In Algorithm 1, an actor is assumed to maintain a set of Q-values {Q t (e)} e 2 E at every time step t 2 N. The Q-values are initially set to zero and iteratively updated. At every round, an alternative e 2 E is chosen randomly with probability ε or greedily otherwise. Greedy choice means that an alternative with the currently highest Q-value is selected. The Q-value Q t (e) of the chosen alternative e is modified by the realisation of R t such that it equals the average of all past rewards of e.
In the words of Herrnstein and Vaughan [11], Q t (e) corresponds to the local reinforcement rate of action e 2 E at time t 2 N. If the actor always chooses an action with the currently highest Q-value, the relative frequency of this action increases as required by Eq (1). Consequently, Algorithm 1 with ε = 0 conforms to the theory of melioration learning. A strictly positive exploration rate ε > 0 allows a trade-off between exploiting the currently best actions and exploring alternatives. If this rate decreases sufficiently slowly towards zero over time, past research proved that Q-learning converges to optimal behaviour under certain assumptions of stationarity [38,39]. For example, convergence is assured if, for every t 2 N, the reward R t is bounded and its expected value depends only on X t .
However, convergence of Q-learning is impeded if multiple persons interact and reinforcements are contingent upon the decisions of everyone (p. 451, [40]). While equilibria are reached in some instances of the prisoner's dilemma or the coordination game [41][42][43], the behaviour fails to converge in others. The results depend on the reward structure of the situation [44] as well as the particular version of Q-learning [45].
In the next section, various examples of two-person games are explored by agent-based simulations. The outcomes of Algorithm 1 are compared to the predictions of another model of reinforcement learning, which is widely known in economics and was developed by Roth and Erev [7]. Algorithm 2 specifies this model. Similar to Algorithm 1, an actor holds a set of values {P t (e)} e 2 E that reflect the previous experiences with the alternatives. In [7], these values are called propensities. At each time step, an alternative e 2 E is chosen with probability P t ðeÞ P j2E P t ðjÞ . The parameter ε maintains a level of exploration.

Algorithm 2 The Roth-Erev learning algorithm
Require: exploration rate ε 2 (0, 1), set of alternatives E 1: t 0 2: initialise P 1 (e) 1, for all e 2 E 3: repeat 4: t t + 1 5: choose action X t e 2 E randomly using the probabilities There are two small differences between Algorithm 2 and the original model of [7]. First, gradual forgetting is not considered because the melioration algorithm omits this feature as well. Second, the exploration quantity ε jEjÀ 1 y is added to all alternatives instead of just the "adjacent" ones. In [46], this approach was used for two-action games or if a linear order of the alternatives was absent.
The following analysis focuses on the Roth-Erev model instead of other learning processes because it is similar to melioration. Both models take a "mechanistic perspective on learning", which means that "people are assumed to learn according to fixed mechanisms or routines" (p. 903, [47]). Additionally, simple versions with only one parameter (the exploration rate) exist. Other models of reinforcement learning, such as regret-testing, ITE, Bush-Mosteller [48], or experience-weighted attraction [49], require additional assumptions and the specification of further parameters.

Results
Algorithms 1 and 2 were applied to different two-person games by means of agent-based simulations. The simulations were implemented in NetLogo [50]. All games are presented in normal-form. The two players, which are also called agents, are labelled by "x" and "y". Capitalised letters or integers depict the alternatives. The following rules specify the simulations.
• For each game, a simulation of 20000 pairs of agents was run. Every agent interacted with the same partner during the whole simulation.
• Half of the pairs of agents employed Algorithm 1 (melioration learning). The other half used Algorithm 2 (Roth-Erev). In both cases, ε was set to 0.1.
• Every player repeatedly chose one of the alternatives according to Algorithm 1 or 2 until 1000 choices had been made.
• The agents observed only their own choices and rewards. They were not aware of the structure of the game or the partner's choices and rewards.
• The payoff matrices show mean rewards. The actual rewards were drawn from normal distributions with standard deviations of one.
Statistical tests were omitted in the comparison of the two learning models because they are largely unnecessary. Since there were 10000 pairs of agents in each group, any standard test would have marked a difference as low as 150 pairs as statistically significant. For example, in the histogram of Fig 2, the first two bars at (A,A) show a difference of 178 pairs. The reader may decide whether the reported differences in numbers are theoretical or practical significant.
In the following, three classes of two-person games are distinguished. The first class contains games in which both players have a (weakly) dominant alternative. Second, games without dominant alternatives but with several pure Nash equilibria are considered. The last class covers games with exactly one mixed Nash equilibrium. This division is not exhaustive, but it clarifies the properties of melioration learning in two-person games.

Games with dominant alternatives
An alternative of a player is dominant if the choice of this alternative comes with a mean reward that is strictly greater than the mean reward of any other alternative given one choice of the partner and greater than or equal to the mean reward of any other alternative given the other choices of the partner (cf. weak dominance in [51], p. 77). A representative member of this class of games is the prisoner's dilemma. In the example of Fig 2, alternative B is dominant for both players. The outcome (B,B) is, therefore, a Nash equilibrium. All other outcomes are optimal.
In Fig 2, the frequency distribution of pairs of agents at the 1000th round of the simulation is shown (for the temporal development, see S1 Fig). It is distinguished between pairs of agents who learned by melioration (mel) and pairs of agents who used the Roth-Erev model (RE). Both types of agents predominantly chose the Nash equilibrium. Because of the exploration rate, also the non-equilibrium outcomes (A,B) and (B,A) occurred. In case of melioration learning, the frequencies approximated the expected ones: 10 000 Agents who used the Roth-Erev model showed slightly higher frequencies of non-equilibrium outcomes.
Another example of a game with dominant alternative is called "guess 2 3 of the average". Fig  3 contains a discrete version of this game with four alternatives. In this game, each player tries to guess what two-thirds of the average of both guesses will be. The agent who is closest to this value "wins" the game. In the particular example of  In case of melioration learning, the acquisition of the dominant alternative was due to the exploration rate. Exploration guaranteed that the fourth outcome (A,B) was selected occasionally, especially in the beginning of the simulation. For player x, this meant that the average value of alternative A (Q t (A)) was between 0 and 10. The Q-value of alternative B, on the other hand, was approximately 10. The reverse held for player y, which led to the combination (B,A) in rounds without exploration.
Result 1 In two-person games, the process of melioration learning yielded the choice of a (weakly) dominant alternative.

Games with multiple pure equilibria
The exploration rate was a key factor in the simulations of the previous section because it rendered dominated alternatives inferior. In games without dominant alternative, this argument did not apply, and actors were not drawn to a single alternative. Games with a strictly mixed Nash equilibrium are considered in the next section. In this section, games with at least two pure equilibria are analysed.
A basic game with two or more Nash equilibria is the coordination game. It refers to a class of situations in which the players prefer to coordinate their choices in some way. In the particular example of Fig 5, the outcomes (A,A) and (B,B) are pure Nash equilibria, and (A,A) payoff-dominates (B,B) because of higher mean rewards (p. 81, [52]). This game has an additional mixed equilibrium with probabilities A : 4 9 ; B : 5 9 À Á for both players. At the 1000th round of the simulation, the agents chose mainly a pure Nash equilibrium and the payoff-dominant one with a slightly higher frequency. In other words, most pairs of agents were able to coordinate their choices. The deviations to (A,B) and (B,A) were due to the    In comparison, the melioration model was more successful in the coordination of actions than the Roth-Erev model. The latter led to non-equilibrium outcomes more frequently than predicted by the exploration rate. This was even more apparent in the "battle of the sexes", which is a particular kind of coordination game. It describes an interaction between two persons with complementary preferences about two alternatives but with an additional preference for choosing the same one. A sample reward matrix is given by the left-sided table of A similar effect arose in the game of chicken (right-sided table of Fig 8), which resembles a basic conflict between two parties that requires the retreat of at least one of them to be solved. In this case, agents who learned by melioration predominantly chose one of the two pure Nash Finally, a game with more than two pure Nash equilibria was analysed. Fig 9 contains heat maps of a dispersion game with four alternatives. It is, in some respect, the opposite of a coordination game. Each agent prefers not to match the choice of the other agent. This means that all but the diagonal outcomes are optimal Nash equilibria. Consequently, most agents of the simulations were distributed evenly among the non-diagonal outcomes. Agents who applied the Roth-Erev model were more often found in non-equilibrium outcomes (see also S8 Fig).
Result 2 In two-person games without dominant alternatives, agents who learned by melioration arrived at one of the pure Nash equilibria. The frequency distribution over the equilibria depended on the structure of the game.

Games without pure Nash equilibria
Simulations of games without pure equilibria required a higher number of rounds until the behaviour of the agents had converged. Therefore, the following simulations were run with only 2000 pairs of agents but for 20000 rounds of the game. The relative frequencies of choice  were calculated for the whole period of 20000 rounds and for each agent separately. Furthermore, a slightly higher exploration rate (ε = 0.2) was assumed because it supported the speed of convergence (see S9 Fig).
First, the game "matching pennies" as shown in Fig 10 was analysed. It is a zero-sum game, and its single Nash equilibrium is given by the probabilities (A: 0.5,B: 0.5) for both players. Fig  10 contains histograms over the relative frequencies of alternative A. For both types of players, the relative frequencies were in accordance with the probabilities of the mixed Nash equilibrium. The agents displayed a mix of alternatives in which each was chosen half of the time.
A similar result was obtained in the game "rock-paper-scissors", which is zero-sum with three alternatives per player (Fig 11). The agents' behaviour approached the predictions of the mixed Nash equilibrium: A : 1 3 ; B : 1 3 ; C : 1 3 À Á . The rate of convergence is seen in S10 Fig.  Fig 12 displays a game that is not zero-sum and has a single mixed Nash equilibrium at x : A : 1 2 ; B : 1 2 À Á ; y : A : 5 7 ; B : 2 7 À Á À Á . In the past, this game was taken to model the interaction between criminals and police [54] and was, therefore, called inspection game [55]. The simulation demonstrated that the behaviour of agents who learned by melioration approached the Nash equilibrium (see also S11 Fig). Further simulations were run with different payoffs for player x given the outcome (A,A). This payoff refers to the punishment of a crime. Since the predictions of the Nash equilibrium for player x remained constant and the results of the simulations stayed in line with the Nash equilibrium, criminals who learned by melioration chose to commit a crime with a relative frequency of 0.5 regardless of the level of punishment.
In previous research, laboratory experiments indicated that the level of punishment has an effect on the crime rate. More specifically, the level of punishment was negatively correlated with the crime rate [55]. However, the experiments lasted for only 15 rounds of decision-making. If humans learn slowly, the behaviour might have not converged to a stable point yet. In   [55]. All agents used the melioration learning model. The mean value of 1000 agents is plotted on a logarithmic scale of time. In case of low punishment (upper row), the Nash equilibrium (0.5) was approached from above. If punishment was high (lower row), the equilibrium was approached from below. Hence, there was a long period in which crime rates were higher for low punishment than for high punishment. Also the inspection rates conformed qualitatively to the experimental results if it is focused on early rounds.
Last, some games impeded the convergence of the behaviour of agents who learned by melioration or the Roth-Erev model. One example is presented in Fig 14. This game was sometimes referred to as Shapley's game and known for its difficulties in regard to the convergence of learning algorithms [56]. It is similar to the game "rock-paper-scissors" except for the diagonal rewards, which are (0, 0) instead of (5,5). The Nash equilibrium is given by  decrease in the height of the waves, which could have led to a stable outcome. In case of the Roth-Erev model, the dynamic was slower, but no convergence was visible as well.
Result 3 In two-person games without pure Nash equilibrium, agents who learned by melioration chose several alternatives with strictly positive relative frequency. In some of the games, the long-term relative frequencies corresponded to the mixed Nash equilibrium. Other games prevented the convergence of the agents' behaviour.

Conclusion
A simple process of completely uncoupled learning was investigated. It differs from previous models such as regret-testing or trial-and-error learning because, on the one hand, it is derived from empirical research and, on the other hand, the convergence to equilibrium states in social interactions is not guaranteed.
Nevertheless, computer simulations revealed that the outcomes of melioration were largely in line with game-theoretical predictions. More specifically, actors who learned by melioration chose a dominant alternative in two-person games. If no alternative was dominant, mainly pure Nash equilibria occurred. The structure of the game, which includes the rewards of nonequilibria, affected the distribution of outcomes. Compared to the Roth-Erev model, pure equilibria were selected with a higher frequency, and the melioration model was more successful in the selection of optimal ones.
In contrast to earlier models of learning, very few assumptions about available information and cognitive skills are needed. The actors must remember their own choices, observe the subsequent rewards, and be able to aggregate them to average values. They can neglect the other actors, their decisions and outcomes. Furthermore, apart from the exploration rate, the In the past, melioration was often seen as too simplistic to adequately represent the complexity of human behaviour [57]. Yet, its predictions might be sufficiently accurate on a social level. Another advantage of melioration learning is that, with Q-learning, there is an algorithm that implements this theory and has been extensively studied in the past. First, this means that results about its convergence can be appropriated for an application in social theory. Second, multiple extensions of Q-learning exist. If melioration turns out to be too simple, there are many ways to adjust the model in order to be a more realistic representation of human behaviour.