Figures
Abstract
Using a reinforcement-learning algorithm, we model an agent-based simulation of a public goods game with endogenous punishment institutions. We propose an outcome-based model of social preferences that determines the agent’s utility, contribution, and voting behavior during the learning procedure. Comparing our simulation to experimental evidence, we find that the model can replicate human behavior and we can explain the underlying motives of this behavior. We argue that our approach can be generalized to more complex simulations of human behavior.
Citation: Bühren C, Haarde J, Hirschmann C, Kesten-Kühne J (2023) Social preferences in the public goods game–An Agent-Based simulation with EconSim. PLoS ONE 18(3): e0282112. https://doi.org/10.1371/journal.pone.0282112
Editor: Jaume Garcia-Segarra, Universitat Jaume I Departament d’Economia, SPAIN
Received: November 3, 2022; Accepted: February 8, 2023; Published: March 15, 2023
Copyright: © 2023 Bühren et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All data files for the agent-based model in EconSim are available from the open science framework: (https://osf.io/xhgfq).
Funding: The authors received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Economic theories assume (over)simplified models without complex interaction between economic actors and markets. The partial equilibrium model, for example, focuses on single markets without interdependencies between markets. Laboratory experiments can be used to test economic theories with humans. Experiments are a tool to examine a broad spectrum of economic problems, such as human behavior in competitive or cooperative games. The public goods game models dilemmas that modern society faces every day–from the cooperation of roommates or team members to the cooperation of the entire world in combating climate change [1]. In experiments, subjects decide in simplistic and unnatural environments. Hence, it seems doubtful that it will be possible to derive useful insights into the complexity of real economic behavior.
Agent-based simulations, on the other hand, can model this complexity. [2] set the foundation for agent-based modeling in the social sciences by showing that agents with “zero intelligence” can be sufficient to ensure a reasonably high allocative efficiency on markets [3]. However, as these models substitute human subjects with robot agents, realistic behavior has to be implemented to yield credible results. Experiments and agent-based models have already been successfully combined. [4] studied price expectations simultaneously with human and robotic subjects. Also, the results of laboratory experiments can be used to calibrate realistically behaving agents [5].
In theoretical economic models, the neoclassical homo oeconomicus is the standard assumption for the analysis of games and the determination of equilibrium strategies. By default, the homo oeconomicus is defined as a self-interest-maximizing individual implicating non-cooperative behavior in social dilemmas. However, this theoretical construct does not explain what is usually observed in experiments. In contrast to pure rational choice, at least in the first rounds of the public goods game, participants cooperate [6]. Behavioral models of social preferences better describe the observed behavior. In these models, the outcomes of others influence the own evaluation of an allocation. However, assumptions that do not solely rely on self-interest do not necessarily contradict neoclassic theory [7]. The Fehr-Schmidt model [8] allows people to be self-interested and inequity averse. Furthermore, it distinguishes between advantageous and disadvantageous inequity. Bolton and Ockenfels’s model of inequity aversion [9] does not differentiate between these two forms. Moreover, [10] consider the efficiency of the distribution and minimal outcome in a group.
While some people even cooperate in social dilemmas, such as conditional cooperators [11] or altruists [12] a lot of people tend to be selfish. Consequently, it is important to study mechanisms that may increase cooperation levels in groups. A powerful tool to enforce cooperation in public goods games is the punishment of free-riders. Experimental economists introduced many different punishment institutions. Informal punishment means that individuals may choose to sanction other group members directly [13]. On the other hand, formal punishment imposes a sanctioning rule, which specifies the occurrence and the extent of punishment [14, 15]. [16] compared both schemes in a setting with endogenous punishment, i.e., subjects voted either for formal or informal punishment. Evolutionary game theorists studied the effects of punishing individuals in the public goods game with replicator dynamics (see [17], for a recent overview). For instance, [18] show that tax-based pure punishment (not helping others but punishing free-riders) has an evolutionary advantage in sustaining cooperation over pure punishment without tax. Moreover, [19] find that a formal sanctioning institution that combines reward and punishment promotes cooperation at a lower cost compared to either reward or punishment alone.
The experimental results of [20]’s public goods game with controlled group formation and endogenous formal punishment indicate heterogeneous behavior of different subject groups in similar situations. While groups of “low cooperators” need punishment institutions to enforce cooperation, “high cooperators” act in a socially beneficial way even without punishment. Social preferences may explain these observations. Whereas subjects can be assumed to follow their own social preferences when deciding how much to contribute to the public good, we can directly incorporate the expected preference structures in agent-based simulations. Thus, possible motives of human behavior in dilemma situations could be better explained by comparing the experimental and simulation results [21].
In our paper, we show that reinforcement learning can replicate human behavior. For that reason, we construct a model of social preferences and use it in an agent-based simulation of a public goods game with a setting similar to the design of [20]. Our model consists of different motives that may explain the behavior observed in the experiment. By weighting these motives, we create utility functions embodying different preference structures, incorporated into our agents. With reinforcement learning, homogenous agents learn to play their optimal strategies in three punishment settings replicating the behavior of the human subjects in [20]. Reinforcement learning is a trial-and-error learning paradigm, based on the principle of operant conditioning, assuming learners evaluate actions by trying them and getting rewards as feedback. Operant conditioning is a psychological framework for learning based on the law of effect by [22], which states that actions, which reward the individual, are executed more often. Over time, the learner explores good actions for given situations and exploits the rewards of the actions, when satisfactory reward levels are reached. Reinforcement learning might be the closest to human learning among the various machine learning algorithms as it is based on biological learning [23]. Thus, we use it to replicate human behavior in our agent-based model. The game-theoretic or simulative reproduction of human behavior is not only interesting in theoretical economic models but also in practical applications such as wireless network sharing [24], traffic simulation [25], the simulation of escape routes in case of emergencies [26], or delay management in railway networks [27].
Further contributions of our paper are the replication of the public goods game of [20] in an agent-based simulation and the introduction of a newly combined social preference model. Additionally, we show that human behavior in public good experiments can be replicated using simple motives with only a few parameters. Suppose the experimental results constitute a Petri dish with some bacterial cultures flourishing in it. We do not know which motives were the exact drivers of the behavior. Instead, we reproduce several comparable Petri dishes with our model of social preferences and a reinforcement-learning algorithm as ingredients.
2. Effect of information on the demand for punishment in a public goods experiment
[20] combine endogenous punishment institutions in the public goods game (“institution formation game”) with a sorting mechanism based on participants’ previous contributions in a one-shot public goods game (“sorting game”). Thereby, homogeneous groups of participants with similar ex-ante cooperation levels play the public goods game, in which they can vote for punishment institutions. To examine the role of information on the demand for punishment, the authors implemented two treatments: Sorted and Sorted-Info. In both treatments, group members in the institution formation game are sorted–they contributed similar amounts of tokens in the sorting game. Thus, their initial level of cooperativeness is similar. In Sorted, players only learn about their group members’ contributions during the institution formation game; in Sorted-Info, they are additionally informed about their group members’ initial contributions in the sorting game.
2.1 Experimental design
The experiment of [20] consisted of two games–the sorting and the institution formation game. In the sorting game, subjects were randomly shuffled into groups of five to play a standard one-shot public goods game. In the game, the endowment was set to 350 tokens, and the participants simultaneously decided on their individual contribution gi ∈ [0,350] to the public good. The marginal product of the public good is a = 1.5 and the marginal per capita return . These parameters were chosen to guarantee the basic characteristics of the public goods game (a > 1 and ) and appropriate payoffs for the participants. In this case, the payoff function can be written as (1)
The endowment in the sorting game was set at 350 tokens because the authors tried to induce enough variation in the sorting game contributions and a less obvious focal point in the middle of the scale–as compared to the more standard 100 tokens, in which initial contributions may be clustered around 50 (see, e.g., [28]). Based on the contributions in this sorting game, groups of like-minded cooperators were formed for the institution formation game. High cooperators contributed more than 250 tokens, low contributors less than 150 tokens, and middle cooperators between 150 and 250 tokens.
In the Sorted-Info treatment, participants received more information on their group members of the institution formation game than participants in the Sorted treatment. In the Sorted-Info (but not in the Sorted) treatment, participants knew the sorting game contributions of their group members in the second game. This second game consisted of six phases with four rounds each with an endowment of E = 100 tokens every round. In the institution formation game, the authors used this standard endowment to make their results better comparable to previous literature. After each round, the contributions of the group members are displayed in random order. At the beginning of every phase, the groups vote on a punishment institution. The payoff function for a given round in a phase without punishment is (2)
If three or more players voted to play with punishment, the group had to decide between mild and severe punishment. In every round, every group member had to pay an institutional fee of 5 for mild or 20 for severe punishment deducted after the contribution stage. Every person contributing less than the maximum of 100 tokens to the public good automatically had to pay a fine of 50 under mild punishment and 90 in the severe punishment scheme. With punishment, the payoffs are calculated by (3) where f(k) is the fee and p(k) the fine in institution k. The signum function sgn(⋅) equals 1 when its argument is strictly positive (i.e., the contribution gi is less than 100) and 0 otherwise. Table 1 summarizes the punishment institutions.
Note that mild punishment does not solve the dilemma structure of the game, even though it decreases the difference in payoff between complete defection and complete cooperation. In contrast, the payoff maximizing strategy under severe punishment is to make a full contribution. These formal punishment institutions are similar to [14]. Even the mild punishment option has the potential to increase cooperative behavior because a fixed fine is imposed on players who contributed less than their whole endowment. However, the net effect of the institutions on payoffs is less clear because their implementation is costly.
2.2 Results
Groups of high cooperators achieved high contribution levels irrespective of the institution. Because of the costs of punishment, the average payoff is lower for cooperative groups with punishment than without. The contributions of middle and especially of low cooperators were significantly higher with punishment than without. However, middle cooperators only earned significantly more with punishment in Sorted and low cooperators in Sorted-Info (see Fig 2 in [20]).
As expected, in the information treatment, low cooperators had the highest demand for punishment. They implemented severe punishment significantly more often than high and middle cooperators (see Fig 5 in [20]). High cooperators had by far the lowest demand for punishment in Sorted-Info. However, without the information on the groups’ initial level of cooperation, the voting behavior throughout the institution formation game is detrimental to most of the groups’ payoffs. This finding is remarkable because, also in Sorted, subjects know the contributions during the second game but do not seem to learn from it. Even when low cooperators implemented severe punishment, they did not behave wisely in this institution. The main reason is that some of the low cooperators are willing to sacrifice money as long as they earn more than others in their group.
3. An agent-based model of the public goods game in EconSim
This section introduces an agent-based simulation of the public goods game in EconSim, a framework to create complex models, to investigate if agents with social preferences and reinforcement learning produce similar results as human participants in the experiment. We propose a model of social preferences based on different forms discussed in the previous literature (see Section 1) and the possible motives of subjects derived from the experimental observations.
3.1 The modular framework EconSim
EconSim allows complex agent-based models based on predefined and highly adjustable components [29, 30]. It distinguishes between different types of agents: households, firms, states, and central banks. Households may represent consumers who focus on maximizing their utility, while firms represent the producers in an EconSim model. The state’s role is to deliver the legal framework of the simulation environment, which can be endogenous through agents’ voting decisions. Central banks may be used to control the supply of money, which can be used as a general numéraire.
Goods are pivotal for a simulation with EconSim. They are not only consumables or production units but also indicate a group affiliation. They can be traded on markets or transformed according to transformation plans that define how the good is produced, stored, recycled, transformed, or scrapped. The model can incorporate an arbitrary number of goods with different relations to other goods. For instance, two goods representing raw materials may transform into a consumable good that is consumed by households to gain utility. Trades are executed on markets that connect buyers and sellers using a set of market mediation rules. As another important part of the simulation, institutions can be implemented that influence the rules of the simulation environment, the behavior of agents, and especially the action space of state agents.
A public goods game could be implemented in agent-based simulations that are simpler. However, we chose EconSim because this framework provides powerful and ready-made algorithms, which are verified and validated thoroughly. It is easy to change the properties and the decision-making in this simulation software. Furthermore, the modular architecture of EconSim allows us to expand our model with additional elements. For instance, we could introduce a revolution mechanism into the model or extend the game to a more complex model, which endogenizes the agents’ endowment. EconSim provides tools for future research to easily adapt our model, which is calibrated with experimental data.
3.2 Reinforcement learning
Traditional economic theory assumes rational decision-makers, which is challenged by experimental and empirical evidence. Reinforcement learning, however, applies bounded rationality by simulating non-optimizing but satisficing behavior similar to human decision-making [31]. It does not require specific information about the environment or other players. The agent chooses an action in a given state based on expected rewards. According to the action, the state of the simulation changes, and the reward function calculates the agent’s factual reward. As a consequence, the agent learns by updating the expectations of the rewards [21, 32]. “Reward” is the usual term for a measurement of success in the context of reinforcement learning. Later, we will use utility function values as rewards. Payoffs in the public goods game are not to be misunderstood as rewards because the utility function value may differ.
Agent-based models provide an environment, in which heterogeneous agents interact with each other [33]. Each period in an agent-based simulation can be viewed as one state of a set of possible states. Agents can choose an action of a set of possible actions, which leads to a reward at the end of a period. Reinforcement learning enables agents to optimize their expected reward over time. Similar to other machine learning techniques, the modeler does not have to implement pre-determined rules of behavior, which increases the realism of the simulation. By varying the reward function and/or the learning parameters, different types of agent behavior can be created. Reinforcement learning allows for adaptive behavior and learning on the individual level. Therefore, its use in an agent-based model to simulate decision-making and learning is convenient. According to the review of [34], reinforcement learning is the most prevalent machine learning method in agent-based modeling. Either single agents learn to maximize their own goal, or multiple agents learn to adapt towards a common target [35].
In our agent-based model, we use a simplified version of Roth and Erev’s reinforcement learning algorithm [36]–a variant of the normalized exponentially smoothed attractivity-based strategy selection algorithm (NESASS) by [30]. Every possible action possesses an attractivity value, which determines the probability of choosing this action and is initialized either randomly or by a pre-defined value. The following function defines updating of the attractivity q(a, t) at the end of period t and, thus, the learning process: (4)
The new attractivity value q(a, t) for the chosen action at is a weighted sum of the old value q(a, t—1) and the received reward r(st, a) of the action. In this algorithm, agents do not have information about the current state st. The parameters αnew and αold are the weighting factors. If αnew + αold = 1, this update procedure follows a classic exponential smoothing. Not chosen actions become devalued in attractivity controlled by a depreciation parameter φ. Note that reward and attractivity are different. Alternatively, the attractivity value of an action can be interpreted as the agents’ reward expectation for this particular action. Rewards are the metric for the experienced success in a particular period.
Based on the attractivity values, the probability p(a, t) that action a is chosen at any time t is defined by the following softmax function: (5)
The variable qmax is a hyper-parameter and denotes the attractivity value of the currently most attractive action, and A is the set of actions available. To dynamically adapt the learning process, the temperature parameter μ regulates the sensitivity for choosing the best action regarding its attractivity. With low values of μ, the agent tends to greedily choose the best action, even when it is only slightly better. The limit of p is (6) and equivalent to drawing from a discrete uniform distribution. Thus, high values of μ indicate more explorative behavior. The μ-value should depend on the learning process and should consider uncertainty. In our model, we update μ(t) as follows: (7)
This update scheme is active if, in the last two periods t and t—1, there is exactly one action with the highest attractivity value ( or ), otherwise μ(t) = μ(t—1). Therefore, it can be active at the end of period 2 for the first time when at least one updated attractivity is higher than the initial value. For instance, if we initialize the attractivity vector with 50 and the chosen actions achieve a reward of 40 and 45, then the update scheme is not active. This is because there are multiple actions with the same (depreciated) attractivity value of 50 ⋅ (1 - φ)2, assuming a low depreciation parameter. In the case of decreases because no new best action is found in period t. If this happens regularly, the action space seems to be sufficiently explored. If not, there is uncertainty about the optimum within the action space, and μ rises. Additionally, we introduce the parameter ε (epsilon greedy). The product ε ⋅ μ(t) delivers a small probability with which the agent chooses from a discrete uniform distribution, ignoring current attractivity values. In a dynamic environment, where best actions may change over time, ε induces the exploration of new actions even if the optimum seems to be found. As we want to achieve a stable convergence without random decisions, we multiply ε by μ(t) to reduce explorative behavior after agents’ learning is achieved (low μ(t)). The ε-greedy parameter is not to be misunderstood as a control parameter for agents to act greedy. It is a means to manage the exploration and exploitation problem in a reinforcement learning approach. With no or too low ε-greedy, learning might converge too fast into a local optimum as the strategy space is not explored well enough. If the ε-greedy value is too high, we would expect no convergence at all as random decisions dominate. See [23] for a detailed description. We provide a pseudocode of our algorithm in S1 File.
With the described reinforcement-learning algorithm, we present a way how agents learn to choose their cooperation level according to the resulting rewards (payoffs) of their different decisions. In Section 3.6, we apply this learning mechanism to the voting on the punishment institutions. Reinforcement learning is also known as operant conditioning and describes a trial-and-error process until a sufficiently good decision in a given situation is discovered. This process of exploitation and exploration is regularly found in animals and humans [37]. Thus, the algorithm used can mimic human behavior.
3.3 The setup in EconSim
S1 Fig shows the graph of our agent-based model. It replicates the experimental design of [20] by using households as agents, a public good, and punishment institutions. We use three types of agents to resemble the outcome of the experiment’s sorting mechanisms. The model consists of a standard good serving as tokens endowed by the households. At the beginning of each period, the households decide how much to contribute to the public good. Based on the contributions, payoffs and objective function values are calculated. Moreover, the simple majority of a group of five households decides if a punishment institution is implemented and (if yes) whether this institution is mild or severe. Fig 1 shows a flow chart of the essential modules of our simulation model.
3.4 Our model of social preferences
The goal of our agent-based simulation is to show that a simple learning algorithm combined with a model of social preferences can replicate the experimental findings of [20]. We developed this model of social preferences inspired by possible motives that may explain subjects’ behavior in the experiment. It describes an agent by a weighted average of five different motives, defined by the utility functions presented in Table 2. A purely selfish agent only looks at its own payoffs. The utility of an inequity-averse agent is high if all agents have relatively equal payoffs. Competitive agents want to earn more than others [38]. The fourth motive aims to maximize social welfare, and the utility of altruistic agents simply increases with their contributions. We normalize the utilities to a range from 0 to 100 to make them better comparable across motives and types of agents.
In line with other models of social preferences, agents may combine different motives. The model of [9] combines selfishness as well as advantageous and disadvantageous inequity aversion. [5] combine selfishness with social welfare and reciprocity. To keep it simple, we assume an additive combination of different motives, where wip is the weight of a social preference p in agent i. Regardless of the motives, the fee f(k) causes every agent to have a clear tendency to avoid punishment when it is not necessary (see Section 2.1).
(8)This objective function will serve as the reward function. Thus, it is a measure of an agent’s success in single rounds. The reinforcement-learning algorithm is used to make agents decide and learn autonomously based on success. In our simulation, agents decide about their contribution level in each round. The attractivity of a contribution is its expected utility function value in the chosen punishment institution. This attractivity is updated in every round. The new expected utility of the chosen contribution level is the weighted sum of the old and the new value. All other contributions’ expected utility values are depreciated by one percent in this round. Thus, our agents remember old experiences but also forget about the success of contribution levels that were not played for a while. Moreover, after round 10, agents decide with a 20 percent probability by imitating the previous round’s decision of another agent. With the resulting objective function value, the attractivity of the imitated decision is updated. Imitation of others is also a form of exploration. Agents may learn from other agents’ contributions by imitating a random co-player. In some cases, it will not be helpful to only learn from the actions that lead to the highest reward for another agent. Examples are dynamic and unstable environments or agents’ heterogeneous preference structures. In our simulation, when severe punishment is active, an agent with a contribution of less than 10 will have a higher payoff than agents contributing the efficient amount of 100. Only learning from the most successful agent would be inefficient in this case.
3.5 Exogenous punishment institutions
The agents in our simulation represent the three types described in [20]: low, middle, and high cooperators. Using their objective function, agents can be modeled by choosing the weights wip for every given social preference motive (see Tables 2 and 3). To predict the agents’ tendency for cooperation, we calculate their marginal utility of an additional unit of contribution for a given weight combination in Table 3. In this analysis, we neglect the effect of inequity aversion because we assume homogenous agents. For them, inequity aversion will not change the game-theoretic prediction because all agents will maximize their utility by the same contribution level. However, inequity aversion affects our simulated results. See Section 3.7 for a detailed description. For the social preference motives, we chose the weights of Table 3 such that low cooperators have a negative marginal utility and high cooperators have a positive marginal utility. In contrast, the middle cooperator has no clear preference for or against cooperative behavior.
During our simulation, weights are kept constant. The main setup follows the experimental design of the institution formation game by [20]. Instead of 24 rounds, however, our simulation runs 300 periods to allow agents’ learning based on reinforcement learning. Table 4 summarizes the parameter settings for the reinforcement-learning algorithm. As social preferences are directly modeled into the agents, the sorting game of [20] is not needed. For the simulations, we build different types of agents, based on our model of social preferences, who may behave similarly to the subjects of the experiment. Similar to [20], we create groups of five agents following the same type of cooperation. Our simulation resembles the Sorted treatment of [20]–agents are sorted into groups of “like-minded people” but are not informed about any sorting criteria. In contrast to their design, we first exogenously impose one of the three punishment institutions before introducing endogenous institutions (see Section 3.6). Exogenous punishment institutions are standard in public goods experiments (see [39], for a meta-analysis), endogenous punishment institutions, in which subjects vote on the institution, are relatively new [16, 40]. The datasets of our simulation results are published at https://osf.io/xhgfq, and the results of the different types of agents are described below.
Rational low cooperators.
Our simulation starts with solely self-interested agents (type i = 1, rational low cooperators) whose only interest is to maximize their own payoff without looking at the results of others. We use this first step to validate the program by comparing the results to game-theoretic predictions. Furthermore, we compare the results to the low cooperators of [20]. We compare the average contributions and payoffs under the different institutions in the experiment with our simulation: a single run and the arithmetic mean of 100 simulation runs (Figs 2 and S3 and Table 5). Learning is highly dependent on coincidence as agents decide randomly at the beginning and have to find their individual optima. Furthermore, agents might be tricked by the dynamics of the simulation. If an agent plays its personal optimum, but all others randomly contribute at a relatively low level, this agent will undervalue this choice. Subjects in the experiment do not have an underlying objective function or at least it is not directly observable, but the patterns emerging from their behavior may be reconstructed by using only simple motives like the ones described above.
Arithmetic mean of 100 simulation runs. In gray: confidence interval (95%) around the mean.
S3 Fig shows the contributions and payoffs of a group of five rational low cooperators, while Fig 2 shows the average contributions and payoffs after 100 simulation runs. In the beginning, agents decide randomly. Yet they quickly learn that lower contributions lead to higher individual payoffs in the setting without punishment. This behavior, however, leads to low payoffs of around 100 on average for every group member, which corresponds to the Nash equilibrium. Similarly, in the experiment, low contributors in Sorted earned an average payoff of 108 without punishment.
Under mild punishment, we observe similar results as in the simulation without punishment. This is because the Nash Equilibrium does not change with mild punishment–it is designed as a non-deterrent sanction scheme [14]. Agents still maximize their individual payoff by free-riding. Due to the costs of punishment (5 tokens fee and 50 tokens fine), the agents’ average payoff is reduced to 45.31 tokens in later periods using this institution. Similarly, low cooperators in the Sorted treatment of [20] earn less with punishment than without–but not significantly less and, with an average of 101 tokens in the mild institution, much more than agents in the simulation.
The individual optimum with severe punishment is identical to the social optimum, namely, every agent fully contributes to the public good. Even though some agents seem to have problems finding the optimal contribution level, imitation leads to the optimum with a higher average payoff than without punishment. Note that imitation will not necessarily help if agents learn from the best agent only. Agents with a lower contribution than ten tokens despite the punishment get a higher payoff than fully contributing ones. Therefore, only these agents would be imitated, resulting in a worse payoff for the imitating individual. This is why in the case of imitation, the last decision of a random agent is copied. In the end, agents earn an average payoff of 129.99 (reduced by the fee of 20) and commit to complete cooperation. In the experiment, low cooperators in Sorted contributed only 84 out of 100 under the severe punishment institution. They even earned slightly less than under the no punishment institution (payoffs of 106 vs. 108). The type rational low cooperator behaves as a homo oeconomicus after learning is converged. Thus, our model can match classical economic theory.
High cooperators.
Next, we try to mimic the behavior of the high cooperators in the experiment of [20]. In the simulation, this type of agent (i = 2) uses an objective function that consists of selfishness and altruism, social welfare, and some inequity aversion (see Table 3). This agent type resembles subjects who are interested in their own payoff. However, altruistic preferences, concerns for social welfare, and inequity aversion dominate the agents’ utility function of type 2. S4 Fig shows the development of contributions and payoffs of five high cooperators in a single run, while Fig 3 represents the average results of 100 simulation runs with this setting. Even though agents do not necessarily learn to contribute the entire endowment, there is a clear tendency to choose a contribution level slightly below full cooperation leading to an average payoff of 146.96 without punishment. Likewise, in the experiment, it was 141 tokens. Under mild punishment, the high cooperator agents increase their contribution to almost full cooperation with 98.20 tokens. However, this small increase does not compensate for the costs of punishment leading to a slight decrease in payoff, in line with the experimental findings. Severe punishment makes it even worse and results in a payoff of around 130 for high cooperators in the simulation and 122 in the experiment.
Arithmetic mean of 100 simulation runs. In gray: confidence interval (95%) around the mean.
Middle cooperators.
Between the types of low and high cooperators, we calibrated the type of middle cooperator (see Table 3). This type of agent is more self-interested, less altruistic, and less inequity-averse than the high cooperator. Yet agents of type 3 are less self-interested than rational low cooperators. Figs 4 and S5 show the simulation results for these agents in a single simulation run and, on average, after 100 runs. After the learning periods, their average contribution is around half of the endowment under no punishment. However, middle cooperators seem to agree less on a cooperation level than the other types. Thus, their payoffs are heterogeneous with an average of 128.01 tokens. In the experiment, it was only 114 tokens. Under mild and severe punishment, the agents of type medium cooperator (almost) fully cooperate. In the experiment, punishment also leads to significantly higher payoffs for middle cooperators in Sorted, but not to full cooperation.
Arithmetic mean of 100 simulation runs. In gray: confidence interval (95%) around the mean.
Competitive low cooperators.
While the middle and high cooperators in the simulation reflect the behavior observed in the experiment quite well, agents of type i = 1 cannot explain the suboptimal contributions of low cooperators in Sorted. Indeed, this was one of the most challenging results of [20]. Some participants seem to be willing to earn more than others in their group regardless of the total amount of the payoffs. This behavior inspired our competitive type of agent (i = 4). These agents are still mainly self-interested, similar to the type of low cooperator. However, they are additionally described by competitive traits and modest social welfare preferences. Figs 5 and S6 show that the competitive low cooperators free-ride under any institution at the end of the learning process. Under no and mild punishment, the results are similar to the rational low cooperator. Yet under severe punishment, after a longer learning process, agents hurt themselves trying to come out best. This simulation result is in line with the assumption of competitive preferences of some of the low cooperators in [20].
Arithmetic mean of 100 simulation runs. In gray: confidence interval (95%) around the mean.
To conclude, our model of social preferences allows modeling agents that behave as if the actual participants possessed the same preferences. The agent-based simulation qualitatively reproduces the aggregate results of the cooperator classes in the experiment. Thus, our model meets the requirements of validation level 1 described by [41]. To explain the behavior of participants categorized as low cooperators, however, we need two classes of agents–rational and competitive low cooperators.
3.6 Endogenous punishment institutions
Following the simulation with fixed exogenous institutions, we extended the simulation by agents’ voting decisions on the punishment institutions. Similar to [20], agents vote if they want to introduce punishment or not and if the punishment scheme should be mild or severe. The simple majority in a group decides. In contrast to the experimental design, agents give these two voting decisions simultaneously, i.e., we also know if they prefer mild or severe punishment when the majority of the group votes against punishment. Furthermore, the simulation starts with a 1500-period learning phase, in which the institutions are determined randomly every fourth round. This procedure ensures that agents “understand” every institution before their voting decisions every fourth of the next 100 rounds. This resembles the experimental instructions trying to assure subjects’ understanding by examples and incentivized control questions. We use the same reinforcement-learning parameterization as in Section 3.5 (Table 4), except for reducing the attractivity depreciation parameter to φ ≔ 0.001 and the probability to imitate to pimit ≔ 0.05. The first adaption guarantees successful learning in a larger strategy space. The second is needed for our endogenous punishment institutions because our previous imitation mechanism is of limited use for changing institutions.
The agents’ decisions on the institutions are based on the expected reward (more precisely utility) in each institution. We use the same reinforcement learning routine that is used to determine the agents’ contribution. There are four different institution voting choices (no punishment/mild punishment if the majority voted for punishment; no/severe; yes/mild; yes/severe). The expected utility of the institutions is formed by: (9)
For each institution choice, the expected utility qinst(t) is formed by summing the product of probability p(a, t), given that the institution is active, and the expected reward q(a, t) for each possible contribution level. Following that, the probability an agent chooses an institution is given by: (10)
Agents do not learn from their voting decisions directly. Instead, they update the expected utility of their contributions after each period only for the active institution.
Table 6 presents the average results of the EconSim model with endogenous institutions in the last 50 periods after 50 simulation runs. Competitive low cooperators do not implement punishment and only earned around 100 tokens. Yet rational low cooperators chose severe punishment in 82% (18% no punishment) of the times and managed to get a payoff of 125, middle cooperators implement mild punishment (78%; 22% no punishment) and achieved 144 tokens, and high cooperators nearly achieved the highest possible payoff (of 150 without punishment) by choosing no punishment (80%; 20% mild punishment).
Thus, the agent types learned to implement the institutions that maximize their objective functions. This contradicts the institution implementation in [20]’s Sorted treatment, but corresponds to the Sorted-Info treatment (see Fig 5 in [20]). Whereas in Sorted-Info, over the 24 rounds, subjects learn to choose institutions that help them, this learning procedure is extremely limited in Sorted. Note that these two treatments only differ by the additional information of contributions in one round, subjects get this information either 24 or 25 times. The initial contribution level seems to be essential for the human learning process. Our agents’ learning process is effective without this information. Yet, in total, they have the chance to learn in 1600 instead of 24 periods.
3.7 Sensitivity analysis
To provide some insights about the influences of the parameters of the reinforcement-learning algorithm and to find a good parameter combination for the replication of the experiment, we made some preliminary simulation runs with five agents of type i = 1 (see Tables 2 and 3). This type of agent corresponds to the selfish homo oeconomicus. In line with our simulation studies in sections 3.6 and 3.7, we run the preliminary study with 300 periods each run and, to have some statistical validation, 100 simulation runs per parameter combination. Unlike in sections 3.5 and 3.6, punishment is deactivated. Thus, the game-theoretic prediction is zero contributions to the public good. We label the learning process in a single run as successfully converged when μ(t) (see Section 3.2) of each agent reaches the defined minimum threshold of 0.01 with maximum contributions of each agent of five for at least ten periods. S1 Table presents different parameter combinations and their respective performances. Parameter combinations 1–17 are conducted with a one-at-a-time approach to explore the effects of single parameter changes concerning the baseline combination 0, assuming not too much interaction between the parameters. Parameter combinations 18–20 are variations in multiple dimensions.
Imitation, namely copying another’s action, seems to be a powerful mechanism leading to fast convergence (see S2 Fig). The agents quickly find the optimal or close-to-optimal contribution levels with high reliability. Combinations with a high probability of imitation are 12, 15–18, and 20. Especially, imitating the best agent leads to good results (13–15). As we pointed out in Section 3.3, copying the best agents’ contribution may yield undesired outcomes in simulation runs with severe punishment. This is why we opted for the imitation of random agents in the last two sections. The weights in the exponential smoothing of the attractivity also seem to have a strong influence on convergence: Simulation runs perform better if the models focus more on new values. Lower attractivity depreciation also leads to better performance. Parameter combination 20 is the combination that we used in the previous sections because it performed in a fast and stable way.
To gain some understanding of the agent’s utility function with additively connected social preferences, we calculate the marginal utility of the contribution of an agent with one of our five motives (see Table 3). Selfishness and competitiveness both have a strictly negative marginal utility, meaning that an agent with only these motives will show uncooperative behavior. In contrast, social welfare and altruism yield strictly positive marginal utilities. Agents with only these motives will be cooperative. The utility of an inequity-averse agent depends on the average contribution level of the other agents in the group: The marginal utility will be positive if the others’ contribution level is higher than their own contribution, and vice versa.
Combining only the motives with marginal utilities independent of other agents’ contributions, we can calculate the decisions of a homo oeconomicus characterized by this utility function. As we assume bounded rationality and have limited computing time, our agents might not find the optimum in every case, especially when the marginal utility of the contribution approaches zero. Moreover, including inequity aversion makes it difficult to tell whether an agent will contribute at low or high levels. This is why we provide some simulation runs with varying weights of inequity aversion (ui2) and constant weights for the other motives (ui1 = 0.7, ui3 = 0, ui4 = 0, ui5 = 0.3). We calculate a slight tendency to uncooperative behavior with a marginal utility of . However, our simulation results show medium contribution levels. With an increasing weight of inequity aversion, the resulting contributions at the end of our simulation runs are close to the game theoretic prediction (Fig 6). In similar simulation runs with a more cooperative tendency (ui1 = 0.66, ui3 = 0, ui4 = 0, ui5 = 0.34; marginal utility ), we observe high contribution levels that decrease with higher weights of inequity aversion (Fig 7). For both parameter constellations, cooperation declines over time. This raises the question of whether inequity aversion yields a tendency for defection in our model. However, the simulation runs with a purely inequity-averse agent (ui2 = 1) suggest that there is not such a tendency (Fig 8).
Selfishness: 0.70, competitiveness: 0, social welfare: 0, altruism: 0.30. Marginal utility of the own contribution: -0.03. In gray: confidence interval (95%) around the mean.
Selfishness: 0.66, competitiveness: 0, social welfare: 0, altruism: 0.34. Marginal utility of the own contribution: 0.03. In gray: confidence interval (95%) around the mean.
Selfishness: 0, inequity aversion: 1 competitiveness: 0, social welfare: 0, altruism: 0. Marginal utility of the own contribution: 0. In gray: confidence interval (95%) around the mean.
When we introduce punishment, the game theoretic properties change only for the selfish motive because it is the only motive that depends on the payoff. Punishment adds a discontinuity into the function ui1 and decreases the partial utility of the selfishness motive by the penalty level p(k) in case of gi < 100. A rational agent would ignore the punishment if its utility level is higher for any contribution level lower than 100, or formally ∃g’ ∈ {0, 1, …, 99} with Ui(gi = g’) > Ui(gi = 100). As we weigh the motives to get a combined utility function Ui, we can calculate if punishment is high enough to affect the agent at a specific contribution level gi = (100—h) with h ∈ {1, 2, …, 100}. Rational agents will cooperate when there is no h for which the following inequality is true: (11)
Thus, we can predict whether agents will be influenced by punishment or not. As we assume homogenous agents, the game-theoretic properties will not change with varying weights of inequity aversion because every agent will prefer the same level of contribution. It is sufficient to analyze the utility at the contribution level gi = 0. In our simulation model with boundedly rational agents, the game-theoretic prediction seems to attract behavior. Still, this depends on the strength of attraction (marginal utility of contributions) and the dynamics of learning.
4. Conclusion and discussion
We introduce an agent-based model of a public goods game with punishment institutions, in which we build agents based on a social preference model inspired by previous theory and experimental evidence. We are able to calibrate our model of social preferences in a way that agents in the simulation show very similar behavior to humans in the experiment of [20]. The reinforcement-learning algorithm can replicate human behavior in our model with the assumption of simple motives. Coming back to the petri dish analogy of the introduction, we accomplished the replication of Petri dishes similar to the original Petri dish, the experimental result. Therefore, our reinforcement-learning algorithm can be used to imitate human behavior in more complex settings in the future. This might include sophisticated agent-based models with complex trading and production of goods. Furthermore, the created model of the public goods game can be used to test how well it can predict human decisions in different designs.
Additionally, we introduced a new model of social preferences, which combines different motives by an additive connection. We allow contradictive motives because they may be part of everyday decision-making. Nevertheless, we still cannot be sure if participants in the experiment actually behave according to these motives. For instance, we do not consider trust and reciprocity in our model. Our model of social preferences is only outcome-based and not intention-based. While this protects our agent-based model against self-fulfilling beliefs [42], intention-based preferences may be more realistic [5] and a promising venture for future research in agent-based simulations. Including an intention-based model could help to address the question of how agents react to information about other agents. Our model is limited in this respect.
Another potential for future research is to endogenize the social preferences of agents. The standard assumption in behavioral economics is that social preferences are given traits of subjects that do not change. Yet it might be more realistic to assume that social preferences are characteristics that are influenced by treatments or the behavior of other subjects. In an agent-based model, a start population with given social preferences may develop into a generation with another structure of social preferences. To this end, reproduction and mutation rules should consider the efficiency of different forms of social preferences in different institutions. Finally, we see potential in implementing our motives into other settings of agent-based models, replications of experiments, or complex economic simulations of society. For example, they could be able to explain preferences for the redistribution of income. Furthermore, our model of social preferences can be used in a simulation model that studies the effects of inequality on voting behavior.
Supporting information
S1 Fig. Graph of the agent-based model in EconSim-GUI.
https://doi.org/10.1371/journal.pone.0282112.s001
(TIF)
S2 Fig. Boxplots of the convergence time of different parameter combinations (see S1 Table for detailed parameter settings).
In red: chosen parameter configuration in sections 3.6 and (slightly adjusted) 3.7.
https://doi.org/10.1371/journal.pone.0282112.s002
(TIF)
S3 Fig. Dynamics of contributions and payoffs of low cooperators under no, mild, and severe punishment in a single run including the arithmetic mean of the agents.
https://doi.org/10.1371/journal.pone.0282112.s003
(TIF)
S4 Fig. Dynamics of contributions and payoffs of high cooperators under no, mild, and severe punishment in a single run including the arithmetic mean of the agents.
https://doi.org/10.1371/journal.pone.0282112.s004
(TIF)
S5 Fig. Dynamics of contributions and payoffs of middle cooperators under no, mild, and severe punishment in a single run including the arithmetic mean of the agents.
https://doi.org/10.1371/journal.pone.0282112.s005
(TIF)
S6 Fig. Dynamics of contributions and payoffs of competitive low cooperators under no, mild, and severe punishment in a single run including the arithmetic mean of the agents.
https://doi.org/10.1371/journal.pone.0282112.s006
(TIF)
S1 Table. Sensitivity analysis of the reinforcement learning parameters.
Underlined: changes relative to Combination 0.
https://doi.org/10.1371/journal.pone.0282112.s007
(PDF)
S1 File. This file contains a pseudocode of the used reinforcement learning algorithm (NESASS) and some additional comments.
https://doi.org/10.1371/journal.pone.0282112.s008
(PDF)
References
- 1. Tavoni A, Dannenberg A, Kallis G, Löschel A. Inequality, communication, and the avoidance of disastrous climate change in a public goods game. Proc Natl Acad Sci U S A. 2011; 108:11825–9. Epub 2011/07/05. pmid:21730154.
- 2. Gode DK, Sunder S. Allocative Efficiency of Markets with Zero-Intelligence Traders: Market as a Partial Substitute for Individual Rationality. Journal of Political Economy. 1993; 101:119–37.
- 3.
Hommes CH, LeBaron BD, editors. Heterogeneous agent modeling. Amsterdam, Netherlands, Kidlington, Oxford: North-Holland an imprint of Elsevier; 2018.
- 4. Hommes C, Sonnemans J, Tuinstra J, van de Velden H. Coordination of Expectations in Asset Pricing Experiments. Rev Financ Stud. 2005; 18:955–80.
- 5.
Heckbert S. Experimental economics and agent-based models. 18th World IMACS/MODSIM Congress.; 2009. pp. 2997–3003.
- 6. Ledyard J. Public Goods: A Survey of Experimental Research. Public Economics.; 1994.
- 7.
Weimann J, Brosig-Koch J. Methods in experimental economics. An introduction. Cham, Switzerland: Springer; 2019.
- 8. Fehr E, Schmidt K. A Theory of Fairness, Competition, and Cooperation. The Quarterly Journal of Economics. 1999; 114:817–68. Available from: http://www.jstor.org/stable/2586885.
- 9. Bolton G, Ockenfels A. ERC: A Theory of Equity, Reciprocity, and Competition. American Economic Review. 2000; 90:166–93.
- 10. Charness G, Rabin M. Understanding Social Preferences with Simple Tests. The Quarterly Journal of Economics. 2002; 117:817–69. Available from: http://www.jstor.org/stable/4132490.
- 11. Fischbacher U, Gächter S, Fehr E. Are people conditionally cooperative? Evidence from a public goods experiment. Economics Letters. 2001; 71:397–404.
- 12. Fehr E, Fischbacher U. The nature of human altruism. Nature. 2003; 425:785–91. pmid:14574401.
- 13. Fehr E, Gächter S. Cooperation and Punishment in Public Goods Experiments. American Economic Review. 2000; 90:980–94.
- 14. Tyran J-R, Feld LP. Achieving Compliance when Legal Sanctions are Non-deterrent. Scand J Econ. 2006; 108:135–56.
- 15. Putterman L, Tyran J-R, Kamei K. Public goods and voting on formal sanction schemes. Journal of Public Economics. 2011; 95:1213–22.
- 16. Kamei K, Putterman L, Tyran J-R. State or nature? Endogenous formal versus informal sanctions in the voluntary provision of public goods. Exp Econ. 2015; 18:38–65.
- 17. Wang S, Liu L, Chen X. Incentive strategies for the evolution of cooperation: Analysis and optimization. EPL. 2021; 136:68002.
- 18. Wang S, Liu L, Chen X. Tax-based pure punishment and reward in the public goods game. Physics Letters A. 2021; 386:126965.
- 19. Chen X, Sasaki T, Brännström Å, Dieckmann U. First carrot, then stick: how the adaptive hybridization of incentives promotes cooperation. J R Soc Interface. 2015; 12:20140935. pmid:25551138.
- 20. Bühren C, Dannenberg A. The Demand for Punishment to Promote Cooperation Among Like-Minded People. European Economic Review. 2021; 138:103862.
- 21.
Duffy J. Agent-based models and human subject experiments. Handbook of computational economics; Vol. 2: Agent-based computational economics. Amsterdam [u.a.]: Elsevier, North-Holland, 2006; 2006.
- 22. Thorndike EL. The Law of Effect. The American Journal of Psychology. 1927; 39:212.
- 23.
Sutton RS, Barto A. Reinforcement learning, second edition. An introduction. 2nd ed. Cambridge: MIT Press; 2018.
- 24. Papavassiliou S, Tsiropoulou EE, Promponas P, Vamvakas P. A Paradigm Shift Toward Satisfaction, Realism and Efficiency in Wireless Networks Resource Sharing. IEEE Network. 2021; 35:348–55.
- 25. Doniec A, Mandiau R, Piechowiak S, Espié S. A behavioral multi-agent model for road traffic simulation. Engineering Applications of Artificial Intelligence. 2008; 21:1443–54.
- 26. Tan L, Hu M, Lin H. Agent-based simulation of building evacuation: Combining human behavior with predictable spatial accessibility in a fire emergency. Information Sciences. 2015; 295:53–66.
- 27.
Albert S, Kraus P, Müller JP, Schöbel A. Passenger-Induced Delay Propagation: Agent-Based Simulation of Passengers in Rail Networks. In: Baum M, Brenner G, Grabowski J, Hanschke T, Hartmann S, et al., editors. Simulation Science. Cham: Springer International Publishing; 2018. pp. 3–23.
- 28. Gunnthorsdottir A, Houser D, McCabe K. Disposition, history and contributions in public goods experiments. Journal of Economic Behavior & Organization. 2007; 62:304–15.
- 29. Kesten-Kühne J. EconSim—A simulation framework for modeling complex and dynamic economies.; 2020.
- 30.
Kesten-Kühne J. EconSim. Ein modulares Framework für agentenbasierte Modelle zur Untersuchung komplexer und dynamischer Wirtschaften. 1st ed. Wiesbaden: Springer Fachmedien Wiesbaden; Imprint: Springer Gabler; 2020.
- 31.
Gigerenzer G, Selten R. Bounded rationality. The adaptive toolbox; [report of the 84th Dahlem Workshop on Bounded Rationality: the Adaptive Toolbox, Berlin, March 14–19 1999. Cambridge, Mass.: MIT Press; 2001.
- 32.
Chen S-H. Agent-Based Computational Economics. How the idea originated and where it is going. 1st ed. London: Taylor and Francis; 2015.
- 33.
Dignum V, Padget J. Multiagent Organizations. In: Weiss G, editor. Multiagent systems. 2nd ed. Cambridge, Mass.: MIT Press; 2013.
- 34. Dahlke J, Bogner K, Mueller M, Berger T, Pyka A, Ebersberger B. Is the Juice Worth the Squeeze? Machine Learning (ML) In and For Agent-Based Modelling (ABM). arXiv; 2020.
- 35. Busoniu L, Babuska R, Schutter B de. A Comprehensive Survey of Multiagent Reinforcement Learning. IEEE Trans Syst., Man, Cybern C. 2008; 38:156–72.
- 36.
Erev I, Roth AE. Simple Reinforcement Learning Models and Reciprocation in the Prisoner’s Dilemma Game. In: Gigerenzer G, Selten R, editors. Bounded Rationality. The MIT Press; 2002.
- 37. Mehlhorn K, Newell BR, Todd PM, Lee MD, Morgan K, Braithwaite VA, et al. Unpacking the exploration–exploitation tradeoff: A synthesis of human and animal literatures. Decision. 2015; 2:191–215.
- 38.
Charness G, Masclet D, Villeval M-C. Competitive preferences and status as an incentive. Experimental evidence. Bonn: IZA; 2010.
- 39. Balliet D, Mulder LB, van Lange PAM. Reward, punishment, and cooperation: a meta-analysis. Psychol Bull. 2011; 137:594–615. pmid:21574679.
- 40. Markussen T, Putterman L, Tyran J-R. Self-Organization for Collective Action: An Experimental Study of Voting on Sanction Regimes. The Review of Economic Studies. 2014; 81:301–24.
- 41.
Fagiolo G, Guerini M, Lamperti F, Moneta A, Roventini A. Validation of Agent-Based Models in Economics and Finance. In: Beisbart C, editor. Computer Simulation Validation. Fundamental Concepts, Methodological Frameworks, and Philosophical Perspectives. Cham: Springer International Publishing AG; 2019. pp. 763–87.
- 42. Schmidt K. Social Preferences and Competition. Journal of Money, Credit and Banking. 2011; 43:207–31.