From rationality to cooperativeness: The totally mixed Nash equilibrium in Markov strategies in the iterated Prisoner’s Dilemma

In this research, the social behavior of the participants in a Prisoner's Dilemma laboratory game is explained on the basis of the quantal response equilibrium concept and the representation of the game in Markov strategies. In previous research, we demonstrated that social interaction during the experiment has a positive influence on cooperation, trust, and gratefulness. This research shows that the quantal response equilibrium concept agrees only with the results of experiments on cooperation in Prisoner’s Dilemma prior to social interaction. However, quantal response equilibrium does not explain of participants’ behavior after social interaction. As an alternative theoretical approach, an examination was conducted of iterated Prisoner's Dilemma game in Markov strategies. We built a totally mixed Nash equilibrium in this game; the equilibrium agrees with the results of the experiments both before and after social interaction.


Introduction
The traditional approach to analyzing the decision-making process of the participants in game-like interaction is based on the individual rationality principle of each participant [1,2]. The Nash equilibrium and its numerous generalizations postulate the principle of the best response by each participant in the interaction to the behavior of others [3][4][5].
Such an approach enabled creating and researching numerous models of social and economic behavior noted, in particular, by several Nobel prizewinners in economics. At the same time, extensive empirical and experimental data on game-like interaction have been accumulated. In these data people's behavior cannot be explained only from the individual rationality position [6][7][8]. Thus, we must consider the social characteristics of decisions taken: 2. fairness, based on non-acceptance of inequality [9,11,12]; 3. trust and gratefulness [13,14]; and 4. level of social responsibility [15,16].
One of the standard methods in the theoretical description of data that does not correspond to the theory of rationality is a quantal response equilibrium (QRE) model. To date there have been several attempts at using the QRE approach in the analysis of experimental data. In [17] it was found that the experimental data on auctions is well interpreted using QRE. Moreover, in [18] it was shown that a QRE model complements the method of maximum likelihood by considering the irrationality of the players participating in experiments. The application of QRE to 2×2 games was researched in [4]. Another approach is the method of Markov chain introduction, demonstrated in [19]. The main problem in such research is the originality of the behavioral experimental data in each piece of research. This demands an individual approach and theoretical basis. In this paper we explain how a QRE model and Markov chains were applied to the data of real experiments with the Prisoner's Dilemma game and the Trust Game.

Materials and methods Participants
To analyze the social characteristics of people's behavior during game-like interaction in small groups (4-12 subjects), numerous experiments were conducted in 2013-2016 at the Laboratory of Experimental Economics (LEE) at the Moscow Institute of Physics and Technology (MIPT) in cooperation with the Skolkovo Institute of Science and Technology in 2013-2016, that clearly reveal one or another social characteristic of behavior. In this paper, the results of eight experiments are presented. In each of them, the number of participants was equal and consisted of 12 people; thus, the data on 96 participants (59 males, 37 females) were taken into consideration. For each experiment, MIPT students who were unknown to each other were selected as participants. Characteristics such as major, group, and year of studies were considered during the selection. Recruitment was by advertisements in the VKontakte social network (vk.com). Skolkovo Institute of Science and Technology Human Subjects Committee approved the study procedures involving human participants. Written informed consents were obtained from participants. Experimental data are readily available on Harvard Dataverse: http://dx.doi.org/10.7910/DVN/ZGW6ZP.

Design and procedures
During the experiment the participants were asked to play the following: 1. Prisoner's Dilemma (PD). Each of two participants has two strategies: Cooperation (Up or Left) or Defection (Down or Right). In the standard PD, two players are offered the same points, R, for Cooperation and a smaller gain, P, for Defection. If one of the players cooperates and another defects, the cooperator gains a smaller reward, T, but the defector takes a larger reward, S. Thus, there is a ration between prizes T>R>P>S (Table 1) [6]. Defection is more profitable than Cooperation in any partner's choice, but mutual Cooperation is more profitable for both than mutual Defection. The Nash equilibrium corresponds to mutual Defection (P, P), but the participants try to establish mutual Cooperation (R, R) [20].
2. Trust and Gratefulness (Trust Game) (TG). One of the participants (the Grantor) can entrust another participant (the Grateful) with some of his or her own money (from 0 to 10). The money obtained (invested) is tripled and the Grateful can share any part of this increased amount with the Grantor (Fig 1). In the totally mixed Nash equilibrium, there is no sense in gratefulness, and therefore there is no sense in trust, which leads to a zero result for both participants [21,22].
In the experiments, gratefulness and trust on average are significantly greater than zero. Each experiment was divided into three parts: Part 1, the Anonymous stage. The participants were invited to play 11 rounds in PD at first and then 11 rounds in TG. A specialized tool to design and carry out group experiments in experimental economics, z-Tree developed at the University of Zurich, was used [23]. The Table 1. Prisoner's Dilemma payoffs.

Cooperation Defection
Cooperation R, R S, T
https://doi.org/10.1371/journal.pone.0180754.t001 The totally mixed Nash equilibrium in Markov strategies in the iterated Prisoner's Dilemma participants were able to move to the next round only after all 12 participants made their choices. No one knew who their opponents were and in each round the pairs of participants changed randomly. After each round, the result of the round and the overall result for the current point in the game were displayed on the monitor. Part 2, the Socialization stage. The participants were invited to take part in interactive cooperation. First the participants memorize each others' names with the help of a snowball game: they sit in a circle, the first one gives his or her name and a personal characteristic that starts with the same letter as the name, the next participant repeats the name and the characteristic of the first participant and says his or her name and characteristic; then along the chain the game comes to the last person in the circle, who repeats all the names and all the characteristics. Then the participants in reverse order share their personal information: hometown, major, hobby, and interests. Then two captains are chosen as volunteers from among the participants. Other participants must choose the captain whose team they want to join and how many points they are prepared to pay for that. The participants find out their gain for the first part of the game. Then each of the participants except the captains must write on a piece of paper the name of the chosen captain and a number of points from, 0 to 50, that they are ready to pay in order to join the team of the chosen captain. The pieces of paper are personally given to the organizer, who sorts the piece by captains and points. In this way, two teams of four people with captains are formed. The remaining four participants, who paid less than the others, continue as individual participants; they are forbidden to communicate or even look at each other (Fig 2). The participants are informed about the procedure of distribution by teams beforehand, so all the steps are considered as circumspect and deliberate. At the end of the Socialization stage, the participants in the teams with captains have five minutes to find five common characteristics (eye color, favorite food, movie, etc.) and decide the name of the team.
Part 3, the Socialized stage. After Socialization, the participants are divided into three groups: two groups of four participants with captains and the four participants remaining. In this stage, the participants play PD and TG for 18 rounds within each group, i.e., the participants of group 1 with the captain played only with each other, it was the same for group 2, and the four participants remaining also played only with each other. Therefore, we have the behavioral data of participants before Socialization in the general group of 12 people and after Socialization in the respective groups of four.

Recent findings
In our study, we focus on the issues connected with the mechanism promoting cooperation. It was shown that the cooperation can be boosted by heterogeneous coupling between interdependent lattices [24], the link weight mechanism [25], and the size of the interaction neighborhood [26]. These findings explain the evolution of cooperation, especially the emergence of cooperation in the non-cooperative games such as PD [27].
Another approach to investigating cooperation is proposed in [1,[28][29][30], where it was shown that incorporating of Socialization in the experiment in PD increases cooperation. The average level of cooperation in Part 1, the Anonymous stage, is 21%, whereas in Part 3, the Socialized stage, the average level of cooperation in the socialized groups is 53% [28]. From the viewpoint of the theory of rationality, as we know, the participants should not choose the strategy of cooperation; therefore the explanation of participants' behavior in social experiments of this kind does not fit classic economic theory [31]. The increased level of cooperation is explained with the help of incorporating an additional component of the utility functionthe social one. In this way, general utility consists of economic (rational) and social utility. The social component is understood as the completion of a socially useful accomplishment. For example, the cooperative move of a participant gives equal gain to an opponent that leads to the increase of social utility. However, it was of interest to us to elicit how the obtained data agrees with other well-known models, so we turned to the idea of QRE.

About quantal response equilibrium
In this section we will discuss the attempts to explain the deviation of the results observed of participants' behavior from the theoretical Nash equilibrium on the basis of the concept of QRE. We will note that the QRE conception appeared at the intersection of game theory and experimental economics in order to explain behavior of participants in the laboratory experiments that was significantly different from the Nash equilibrium [32,33].
"QRE is an internally consistent equilibrium model, in the sense that the quantal response functions are based on the equilibrium probability distribution of the opponents' strategy choices rather than simply on arbitrary beliefs the players could have about those probabilities" [32]. One of the model's features is that it allows game modeling of players who make mistakes. QRE imposes a requirement that beliefs should correspond to the equilibrium choice of probabilities. In this way, QRE demands solutions in the fixated point of choices of probabilities similar to the Nash equilibrium. However, unlike the classic Nash equilibrium, QRE supposes that the pursuit for the best response is realized by participants only in the probabilistic sense: the better the answer is, the higher the probability that it will be chosen by a participant [10,34]. Other participants must choose the captain whose team they want to join and how many points they are prepared to pay for that. The four participants, who paid less than the others, continue as individual participants in Group 3; they are forbidden to communicate or even look at each other. The participants in the teams with captains form Group 1 and Group 2. https://doi.org/10.1371/journal.pone.0180754.g002 "The QRE has been compared with experimental observation and generally provides a better fit to the data than the NE" [35]. On this basis, we decided to evaluate the model using our experimental data.
According to [33], we introduce QRE through the logistic quantal response function: Here u ij is the expected payoff of player i with strategy j, (j2{1,. . ., . . . ; u iJ i Þ "If each player uses a logistic response function, QRE or logit equilibria are the solutions of P ij = α ij , where P ij is the frequency of strategy j in player i" [36].

QRE in the PD game
For the PD game that we considered, the QRE (Table 1) could be determined as follows. Let p be the probability of the cooperative move of a partner, then the expected gain from the cooperative action equals 5p+0(1-p) = 5p and the expected gain from uncooperative action equals 10p+1(1-p) = 9p+1 [4]. We define as the precision parameter, which is inversely related to the variance of the error (2). For every λ, p = QRE(λ) could be found from the formula (1) from the solution of the Eq (2) expðl Â 5pÞ At λ = 0 the probability of the cooperative move by QRE(λ) equals 0.5 (the chaotic behavior). With an increase of λthe probability of cooperation by QRE(λ) decreases and within the limit λ!1 strives to 0 and this corresponds to the single Nash equilibrium in PD. Thus, from the QRE position any percentage of cooperative moves less than 50% could be justified. It is quite suitable for games before Socialization. Mathematically that means the solution of the QRE(λ) = p equation is relative to the λ parameter for a given observed level of the cooperative moves p. In this case, this equation is easily solved: We give all the solutions to this equation for the series of experiments (Table 2) in fall, 2015.
For clarity, we represent this dependence graphically (Fig 3). We see that the maximum degree of cooperation before Socialization was achieved 12.10.2015 and is 41.7%, which corresponds to the significantly positive value λ = 0.126. The minimum degree of cooperation before Socialization was reached 15.09.2015 and is 12.1% at λ = 1.334 (Fig 3), which is far from the limit value. The average value of cooperation in the experiment series is 28% at λ = 0.44.
Thus, the calculations show that the behavior of the participants of the experiments in PD before Socialization is completely described by the QRE concept, which is the adopted deviation from the Nash equilibrium towards the easing of the requirements of the best answer.
However, after Socialization the situation radically changes.
As Table 3 shows, the level of cooperation after Socialization is over 50% in all the experiments, which is why the QRE concept is not fully applicable in this case.
This means that it is necessary to search for an alternative theoretical game model for the behavior of participants.

QRE in the TG
Let us find the QRE for the TG, which was also in a series of experiments in fall, 2015.
Unlike the static PD game, TG is a dynamic game with perfect information. The QRE concept is theoretically applicable in this case too. Let k = 0,. . .,10 be the trust level of player 1, and n = 0,. . .,3k be the gratefulness level of player 2. For a given level of k,n the winning of player 1 is 10-n+k, and the winning of player 2 is 3k-n. According to QRE, the probability p 2 (k,n)of  The totally mixed Nash equilibrium in Markov strategies in the iterated Prisoner's Dilemma thanks of a level n for a given level of trust k is determined by the formula For this reason, the expected winning u 1 (k)of player 1 with a level of trust k is Then the probability p 1 (k)of trust of level k in QRE is determined by the formula To find the parameter λ according to the results of the experiment let us calculate the average levels of trust k Ã and thanks n Ã and compare them with the theoretical expected level of trust k(λ) and gratefulness n(λ), which are calculated as kðlÞ ¼ Let us select parameter λ so that several levels (k(λ),n(λ)) would become as close as possible to the levels (k Ã ,n Ã ) observed in the experiment.
As the following results of calculations show, parameter λ even for games before Socialization is rather close to 0; thus the behavior of the participants according to QRE is treated as nearly chaotic for some experiments. Table 4 shows the average values of trust and gratefulness for each experiment in the fall, 2015 series with the calculated parameter λ and approximate values of trust and gratefulness, which approximate the specified average values in the best way. The average of value λ for this series is estimated at 0.17, which is significantly lower than the average value λ obtained previously for the PD game in the same series of experiments. Hence, we can conclude that the QRE concept poorly explains the results of experiments even before Socialization.

Model of iterated PD in Markov strategies
Let us construct and analyze a model of the iterated PD in Markov strategies. At the beginning, take into consideration the effect of the iterated PD several times with random partners. For simplicity let us assume that every participant responds only to the move made by his or her partner in the previous round. Such strategies are called Markov strategies or strategies with memory length equal to one [37,38]. Previously, the Markov chains were used more than once to find the equilibria for the Prisoner's Dilemma [19,[37][38][39][40]. In [19] only equilibria for "good", cooperative strategies were found. Based on the past results, we also decided to apply Markov strategies for the theoretical justification of the experimental data. However, we were interested not in extreme cases, which lead to total cooperation rarely observed in experiments, but in the internal equilibrium points, in which both cooperation and betrayal are selected with positive probabilities. For the PD game after Socialization, the following approach described in previous works [38,[41][42][43][44][45] was suitable the most: Let γ i denote reciprocal cooperation, i.e. the probability that a participant i will act cooperatively after the previous round in which his or her partner played cooperatively.
Let α i denote tolerance to defection, i.e. the probability that a participant i will act cooperatively after the previous round in which his or her partner played non-cooperatively.
For the given parameters of cooperation and tolerance of the pair of participants, we obtain a Markov process with a finite number of states [46][47][48]. In the stationary distribution, each player of the pair of participants will be in one of two possible states: {Cooperation, Defection}. The stationary probability p i c for a participant i to be in a cooperative state depends on the stationary probability p j c of a participant j6 ¼i and strategic parameters of reciprocal cooperation γ i and tolerance to defection α i of a participant i in the following way For the given strategies {α 1 , α 2 , γ 1, γ 2 } of two unknowns participants (1 -the first participant and 2 -the second) this system of two linear equations with is easily solved in explicit form: The composition of corresponding probabilities gives the stationary distribution for all four pairs of actions of the participants, and based on this distribution we can calculate the profits of the participants. Omitting the intermediate calculations, let us write the expression for the winning of participant 1: It should be remembered that (p 1 c , p 2 c ) in their turn depend on {α 1 ,α 2 ,γ 1 ,γ 2 } as indicated above. Therefore, there turns out to be some kind of a game in a normal form with nonlinear payoff functions. But a symmetric totally mixed Nash equilibria (in Markov strategies) {α, α, γ, γ} can be found in explicit form in this game: It can be checked whether this curve of the second order is a hyperbolic curve. Let us represent it in the intersection with a single square of tolerance and cooperation (Fig 4).
The upper point (0, 1) in Fig 4 corresponds to the standard tit-for-tat strategy with 100% reciprocal cooperation and zero tolerance to defection [49]. It is not an interior point of the space of strategic parameters, so the stationary distribution for the pairs of such strategies is determined uniquely and depends on the initial conditions. Without going into detail, let us assume that the participants always make a move cooperatively in the first round, then the pair of tit-for-tat strategies leads to complete cooperation.
For us, the equilibria with maximum tolerance to defection (0.3, 0.8) are of particular importance. We will treat a section of the hyperbolic curve below this point as equilibria before Socialization, and above the point as equilibria after Socialization. Let us pay attention to the fact that high levels of reciprocal cooperation (over 80%) are realized only during rather low tolerance.

Results and discussion
Let us apply the theoretical calculations obtained to the experimental data. We will consider the data on the PD before and after Socialization. Table 5 displays the cases of cooperative and non-cooperative moves, reciprocal cooperation and tolerance to defection in PD before and after Socialization. There Ncoop is the number of a partner's cooperative moves in the previous round; Nrecoop the number of cooperative moves in response to the cooperation in the previous round; Ndefault the number of a partner's non-cooperative moves in the previous round; and Ntolerant the number of cooperative moves after the partner's defection in the previous round.
This data allow us to assess the parameters α and γ for each experiment before and after Socialization. It is natural to assess value α as Ntolerant Ndefault , and value γ as Nrecoop Ncoop . This assessment is presented in Table 6. Now let us placed the obtained pairs of assessment from Table 6 on the plane (α,γ) together with a section of the hyperbolic curve falling within the unit square ( Fig 5).
From the data shown in Fig 5 we can formulate the following results:  Let us set off the horizontal distance (tolerance) from the experimental points to the hyperbolic curve to prove results 3 and 4.
The experiments in which the distance to the theoretical equilibria of tolerance is less than 0.1 are highlighted in Table 7. Now let us check how significant the deviation in profit is from the theoretical equilibria. To do this, we will make the following calculation.
1. Let us calculate the point on the hyperbolic curve with the same reciprocal cooperation for each experimental point.
2. For each point on the hyperbolic curve let us calculate the values of equilibrium of the player's profit.
3. Let us treat each experimental point as a deviation of one of the players on tolerance to betrayal from the equilibrium, considering that the other player adheres to the equilibrium.   The totally mixed Nash equilibrium in Markov strategies in the iterated Prisoner's Dilemma Let us calculate the decrease in the profit of the deviated player in the percentage of his equilibrium profit.
The results of this calculation are presented in Table 8. For clarity, let us order the values obtained in Table 8 in descending order and present them graphically (Fig 6). Fig 6 shows that the maximum deviation from the equilibria on winning is only 1.7%, and this deviation is no more than 0.5% in 75% of cases. This means that nearly all the experiments are in a position of ε-equilibria for the iterated PD in Markov strategies. This result is fundamental.

Conclusions
This research applied a QRE model to the results of an experiment designed to study cooperation and trust under the influence of social interaction. The resulting data were divided into two categories: before and after social interaction. The peculiarity of the results is that the data on cooperation after Socialization in the Socialization stage are significantly different from those of the experiments in PD. The calculations showed that the behavior of the participants before Socialization could be described with the QRE concept, which is an accepted deviation from the concept of Nash equilibrium that weakens the requirements of the best response. However, the standard QRE approach cannot be applied to describing the behavior of the participants after Socialization. Therefore, we have proposed a variant of the description of equilibria in the iterated PD in Markov strategies. For this game repeated in Markov strategies, we managed to explicitly find all the equilibria with a positive probability of reciprocal cooperation and tolerance to betrayal. The primary result is that all the experiments are in a position of ε-equilibria for the repeated PD game in Markov strategies. There remain questions of theoretical justification of the results of such games as the Trust Game and Ultimatum Game, the experimental data of which do not correspond to known theoretical game models in the framework of our research on the influence of social interaction.