Spike-based Decision Learning of Nash Equilibria in Two-Player Games

doi:10.1371/journal.pcbi.1002691

Figure 1.

Neuronal architecture implementing the two players.

Each player () is represented by a population of decision making neurons (shown of each) which receive an input spike pattern and generate an output spike pattern . The population decision is represented by a readout unit, with being more likely when more decision making neurons fire at least one output spike. The synaptic weights are adapted as a function of , , and the reward signal delivered by a critic.

More »

Expand

Table 1.

Average bank payoff for our version of blackjack.

More »

Expand

Figure 2.

Playing blackjack with pRL converges toward pure Nash equilibrium.

(A) Average strategy () after (open circles) and (filled circles) games where the gambler (blue) is a neural net as well as the croupier (black). The dotted vertical lines left of and show the separation line of drawing/not drawing another card for the optimal Nash strategy pair. (B) Average strategy () after games for a neural net as gambler playing against a croupier that follows a given strategy (blue), (red) or (green). The colored dotted lines left of show the separation line of drawing/not drawing another card for the optimal strategy given that the croupier stops drawing at (from left to right). (C) Average reward () of the gambler for the scenario described in (B). The colored dotted lines show the maximal reachable average reward. (D) Average strategy () over the last out of a total of games for a neural net (red) or human (green) as gambler playing against a croupier that follows a given strategy . The initial weights of the network were chosen such that the strategy in the first trials (blue) mimics the strategy of humans instructed about the game rules (black).

More »

Expand

Table 2.

Payoff matrix of the inspector game.

More »

Expand

Figure 3.

pRL but not TD-learning fits data and follows a mixed Nash equilibrium.

(A) Choice behavior for pRL versus pRL (employee green, employer red) and human versus human (employee black, employer gray) [11]. The cost of inspection was stepped from to to , respectively, and this does also correspond to the shirk rate in Nash equilibrium (thick black lines). The inspection rate in the Nash equilibrium would always be . (B) Average choice behavior of pRL vs pRL (dark green circles) and TD vs TD (light green circles), pRL for the employee vs computer algorithm for the employer (blue squares), human vs human (black), human as an employee vs computer algorithm (orange) and monkey vs computer algorithm (cyan) for trials/block as function of the inspection cost. The solid line indicates the Nash equilibrium. (C) Reward as function of the inspection cost for trials/block. Coloring as in (B). pRL simulations are more similar to the experimental data than the TD simulations. (D) Average choice behavior as in (B) but for trials/block. The inspect rates for pRL vs pRL (TD vs TD) (dark (light) red circles) and pRL vs computer algorithm (purple squares) are shown too. The lines indicate the Nash equilibrium for the employee (diagonal) and the employer (horizontal). pRL behaves according to the Nash equilibrium, whereas TD does not. (E) Time course of the probability to shirk with inspection cost for pRL vs algorithm (blue line) and pRL vs pRL (TD vs TD) (dark (light) green line). For the latter the probability of the employer to inspect is shown too (dark (light) red line). pRL oscillates around the Nash equilibrium (drawn lines), whereas TD completely deviates from Nash. (F) Time course of the probability to shirk or inspect respectively with inspection cost for pRL vs pRL (green respectively red, solid) as in E, but shifted up for clarity and overlaid with the negative change in the shirk rate (green dashed) and the change in the inspect rate (red dashed) to show the counteractive behavior.

More »

Expand

Figure 4.

Covariance learning rules may lead to a mixed Nash equilibrium, but also to deterministic non-Nash strategies. pRL fits data better than basic reinforcement models.

Time course of the probability to shirk (A,C) and inspect (B,D) with inspection cost for pCOV vs algorithm (A,B) and pCOV vs pCOV (C,D). In each panel the horizontal lines depict the Nash equilibrium, and for 10 simulation runs inspection and shirk rates are shown (same color in (A,B) and (C,D), respectively, correspond to the same run). Only a small fraction of all runs converge or oscillate around the Nash equilibrium, while the other runs result in a deterministic strategy pair. The initial distribution of synaptic weights was Gauss with mean and standard deviation . The learning rate was set to , but did not change the proportion of runs converging to the pure strategy. (E) Average choice behavior of pRL vs pRL (green), RE1 vs RE1 (blue), RE3 vs RE3 (red) and human vs human (black) for trials/block as function of the inspection cost. The light red circles show the average choice behavior for RE3 vs RE3 and trials/block. Individual runs converged to a pure strategy, hence the shown averages over 200 runs reflect the percentage of runs converging to a pure shirk strategy. (F) Reward as function of the inspection cost for trials/block. Coloring as in (E). The solid lines indicate the Nash equilibrium.

More »

Expand

Table 3.

Probability distribution of hand values after drawing the last card.

More »

Expand