Figure 1.
Neuronal architecture implementing the two players.
Each player (
) is represented by a population of decision making neurons (shown
of each) which receive an input spike pattern
and generate an output spike pattern
. The population decision
is represented by a readout unit, with
being more likely when more decision making neurons fire at least one output spike. The synaptic weights
are adapted as a function of
,
,
and the reward signal
delivered by a critic.
Table 1.
Average bank payoff for our version of blackjack.
Figure 2.
Playing blackjack with pRL converges toward pure Nash equilibrium.
(A) Average strategy () after
(open circles) and
(filled circles) games where the gambler (blue) is a neural net as well as the croupier (black). The dotted vertical lines left of
and
show the separation line of drawing/not drawing another card for the optimal Nash strategy pair. (B) Average strategy (
) after
games for a neural net as gambler playing against a croupier that follows a given strategy
(blue),
(red) or
(green). The colored dotted lines left of
show the separation line of drawing/not drawing another card for the optimal strategy given that the croupier stops drawing at
(from left to right). (C) Average reward (
) of the gambler for the scenario described in (B). The colored dotted lines show the maximal reachable average reward. (D) Average strategy (
) over the last
out of a total of
games for a neural net (red) or human (green) as gambler playing against a croupier that follows a given strategy
. The initial weights of the network were chosen such that the strategy in the first
trials (blue) mimics the strategy of humans instructed about the game rules (black).
Table 2.
Payoff matrix of the inspector game.
Figure 3.
pRL but not TD-learning fits data and follows a mixed Nash equilibrium.
(A) Choice behavior for pRL versus pRL (employee green, employer red) and human versus human (employee black, employer gray) [11]. The cost of inspection was stepped from to
to
, respectively, and this does also correspond to the shirk rate in Nash equilibrium (thick black lines). The inspection rate in the Nash equilibrium would always be
. (B) Average choice behavior of pRL vs pRL (dark green circles) and TD vs TD (light green circles), pRL for the employee vs computer algorithm for the employer (blue squares), human vs human (black), human as an employee vs computer algorithm (orange) and monkey vs computer algorithm (cyan) for
trials/block as function of the inspection cost. The solid line indicates the Nash equilibrium. (C) Reward as function of the inspection cost for
trials/block. Coloring as in (B). pRL simulations are more similar to the experimental data than the TD simulations. (D) Average choice behavior as in (B) but for
trials/block. The inspect rates for pRL vs pRL (TD vs TD) (dark (light) red circles) and pRL vs computer algorithm (purple squares) are shown too. The lines indicate the Nash equilibrium for the employee (diagonal) and the employer (horizontal). pRL behaves according to the Nash equilibrium, whereas TD does not. (E) Time course of the probability to shirk with inspection cost
for pRL vs algorithm (blue line) and pRL vs pRL (TD vs TD) (dark (light) green line). For the latter the probability of the employer to inspect is shown too (dark (light) red line). pRL oscillates around the Nash equilibrium (drawn lines), whereas TD completely deviates from Nash. (F) Time course of the probability to shirk or inspect respectively with inspection cost
for pRL vs pRL (green respectively red, solid) as in E, but shifted up for clarity and overlaid with the negative change in the shirk rate (green dashed) and the change in the inspect rate (red dashed) to show the counteractive behavior.
Figure 4.
Covariance learning rules may lead to a mixed Nash equilibrium, but also to deterministic non-Nash strategies. pRL fits data better than basic reinforcement models.
Time course of the probability to shirk (A,C) and inspect (B,D) with inspection cost for pCOV vs algorithm (A,B) and pCOV vs pCOV (C,D). In each panel the horizontal lines depict the Nash equilibrium, and for 10 simulation runs inspection and shirk rates are shown (same color in (A,B) and (C,D), respectively, correspond to the same run). Only a small fraction of all runs converge or oscillate around the Nash equilibrium, while the other runs result in a deterministic strategy pair. The initial distribution of synaptic weights
was Gauss with mean
and standard deviation
. The learning rate was set to
, but
did not change the proportion of runs converging to the pure strategy. (E) Average choice behavior of pRL vs pRL (green), RE1 vs RE1 (blue), RE3 vs RE3 (red) and human vs human (black) for
trials/block as function of the inspection cost. The light red circles show the average choice behavior for RE3 vs RE3 and
trials/block. Individual runs converged to a pure strategy, hence the shown averages over 200 runs reflect the percentage of runs converging to a pure shirk strategy. (F) Reward as function of the inspection cost for
trials/block. Coloring as in (E). The solid lines indicate the Nash equilibrium.
Table 3.
Probability distribution of hand values after drawing the last card.