Mediating Artificial Intelligence Developments through Negative and Positive Incentives

The field of Artificial Intelligence (AI) is going through a period of great expectations, introducing a certain level of anxiety in research, business and also policy. This anxiety is further energised by an AI race narrative that makes people believe they might be missing out. Whether real or not, a belief in this narrative may be detrimental as some stake-holders will feel obliged to cut corners on safety precautions, or ignore societal consequences just to"win". Starting from a baseline model that describes a broad class of technology races where winners draw a significant benefit compared to others (such as AI advances, patent race, pharmaceutical technologies), we investigate here how positive (rewards) and negative (punishments) incentives may beneficially influence the outcomes. We uncover conditions in which punishment is either capable of reducing the development speed of unsafe participants or has the capacity to reduce innovation through over-regulation. Alternatively, we show that, in several scenarios, rewarding those that follow safety measures may increase the development speed while ensuring safe choices. Moreover, in {the latter} regimes, rewards do not suffer from the issue of over-regulation as is the case for punishment. Overall, our findings provide valuable insights into the nature and kinds of regulatory actions most suitable to improve safety compliance in the contexts of both smooth and sudden technological shifts.

Introduction parties involved will make monitoring and compliance enforcement difficult (if not impossible). Therefore, for all to enjoy the benefits provided by safe, ethical and trustworthy AI, it is crucial to design and impose appropriate incentivising strategies in order to ensure mutual benefits and safety-compliance from all sides involved. Given these concerns, many calls for developing efficient forms of regulation have been made [2,8,9].
In this paper, we aim to understand how different forms of incentives can be efficiently used to influence safety decision making within a development race for domain supremacy through AI (DSAI), resorting to population dynamics and Evolutionary Game Theory (EGT) [10][11][12]. Although AI development is used here to frame the model and to discuss the results, both model and conclusions may easily be adopted for other technology races, especially where a winner-takes-all situation occurs [13][14][15].
We posit that it requires time to reach DSAI, modelling this by a number of development steps or technological advancement rounds [16]. In each round the development teams (or players) need to choose between one of two strategic options: to follow safety precautions (the SAFE action) or ignore safety precautions (the UNSAFE action). Because it takes more time and more effort to comply with precautionary requirements, playing SAFE is not just costlier, but implies slower development speed too, compared to playing UNSAFE. We consequently assume that to play SAFE involves paying a cost c > 0, while playing UNSAFE costs nothing (c = 0). Moreover, the development speed of playing UNSAFE is s > 1 whilst the speed of playing SAFE is normalised to s = 1. The interaction is iterated until one or more teams establish DSAI, which occurs probabilistically, i.e. the model assumes, upon completion of each round, that there is a probability ω that another development round is required to reach DSAI-which results in an average number W = (1 − ω) −1 of rounds per competition/race [12]. We thus do not make any assumption about the time required to reach DSAI in a given domain. Yet once the race ends, a large benefit or prize B is acquired that is shared amongst those reaching the target simultaneously.
The DSAI model further assumes that a development setback or disaster might occur, with a probability assumed to increase with the number of occasions the safety requirements have been omitted by the winning team(s) at each round. Although many potential AI disaster scenarios have been sketched [1,17], the uncertainties in accurately predicting these outcomes have been shown to be high. When such a disaster occurs, the risk-taking participant loses all its accumulated benefits, which is denoted by p r , the risk probability of such a disaster occurring when no safety precaution is followed (see Materials and Methods section for further details).
As shown in [16], when the time-scale of reaching the target is short, such that the average benefit over all the development rounds, i.e. B/W , is significantly larger compared to the intermediate benefit b obtained in every round, there is a large parameter space where societal interest is in conflict with the personal one: unsafe behaviour is dominant despite the fact that safe development would lead to a greater social welfare (see region II in Figure 2 and Supporting Information (SI) for details). The reason is that, those who completely ignore safety precautions can always achieve the big prize B when playing against safe participants. The two other zones, i.e. region I and region III in Figure 2, do not suffer from a dilemma between individual and group benefits as is the case for region II. Whereas in region I safe development is preferred due to excessively high risks, region III prefers unsafe, risk taking behaviour, both from an individual and societal perspective.
From a regulatory perspective, only region II requires additional measures that ensure or enhance safe and globally beneficial outcomes, avoiding any potential disaster. Large-scale surveys and expert analysis of the beliefs and predictions about the progress in AI, indicate that the perceived time-scale for supremacy across domains through AI as well as regions is highly diverse [18,19]. Also note that despite focusing on DSAI in this paper, the proposed model is generally applicable to any kind of long-term competing situations such as technological innovation development and patent racing where there is a significant advantage (i.e. large B) to be achieved by reaching an important target first [13][14][15]. Other domains include pharmaceutical development where firms could try to cut corners by not following safe clinical trial protocols in an effort to be the first to develop a pharmaceutical produce (i.e. a cure for cancer), in order to take the highest possible share of the market benefit [20]; Besides tremendous economic advantage, a winner of a vaccine race such as for Covid-19 treatment, can also gain significant political and reputation influence [21].
In this paper, we explore whether and how incentives such as reward and punishment can help in avoiding disasters and generate a wide benefit of AI-based solutions. Namely, players can attempt to prevent others from moving as fast as they want (i.e., an elementary form of punishment of wrong-doers) or help others to speed up their development (rewarding right-doers), at a given cost. Slowing down unsafe participants can be obtained by reporting misconduct to authorities and media, or by refusal to share and collaborate with companies not following the same deontological principles. Similarly, rewards can correspond to support, exchange of knowledge, staff, etc. of safety conscious participants. Note that reasons for intervening with the development speed of competitors may also be nefarious, e.g. cyber-attacks, in order to get a speed advantage. The current work only considers interventions by safe players as a result of the unsafe behaviour of co-players. We show that both negative and positive incentives can be efficient and naturally self-organize (even when costly). However, we also show that such incentives should be carefully introduced, as they can have negative effects otherwise. To this end, we identify the conditions under which positive and negative incentives are conducive to desired collective outcomes.

Related Work
Although there have been a number of proposals and debates on how to avert, regulate, or mediate a race for technological supremacy [2,4,8,9,[22][23][24], few formal modelling studies were proposed [1,16]. The current paper takes the next step, further filling this gap. Namely, it will resort to Evolutionary Game Theory (EGT) methods to investigate how positive and negative incentives can improve the outcomes of DSAI and, more generally, a broad class of innovation race dynamics.
Incentives such as punishment and rewards have been shown to provide important mechanisms to promote the emergence of positive behaviour (such as cooperation and fairness) in the context of social dilemmas [25][26][27][28][29][30][31][32][33][34][35]. Notwithstanding, all existing modelling approaches to AI governance [1,16] do not study how incentives can be used to enhance safety compliance. Moreover, there have been incentive-modelling studies addressing other kinds of risk, such as climate change and nuclear war, see e.g. [32,36,37]. Following from an analysis of several large global catastrophic risks [17], it has been shown that the race for domain supremacy through AI and its related risks are rather unique. Analyses of climate change disasters primarily focus on participants' unwillingness to take upon themselves some personal cost for a desired collective target, and implies a collective risk for all parties involved [32]. In contrast, in a race to become leader in a particular AI application domain, the winner(s) will extract significant advantage relative to that of others. More importantly, this AI risk is also more directed towards individual developers or users than collective ones.

DSAIR model definition
Let us depart from the innovation race or domain supremacy through AI race (DSAIR) model developed in [16]. We adopt a two-player repeated game, consisting of, on average, W rounds. At each development round, players can collect benefits from their intermediate AI products, subject to whether they choose playing SAFE or UNSAFE. By assuming some fixed benefit, b, resulting from the AI market, the teams share this benefit in proportion to their development speed. Hence, for every round of the race, we can write, with respect to the row player i, a payoff matrix denoted by Π, where each entry is represented by Π ij (with j corresponding to a column), as follows The payoff matrix can be explained as follows. First of all, whenever two SAFE players interact, each will pay the cost c and share the resulting benefit b. Differently, when two UNSAFE players interact, each will share the benefit b without having to pay c. When a SAFE player interacts with an UNSAFE player, the SAFE one pays a cost c and receives a (smaller) part b/(s + 1) of the benefit b, while the UNSAFE one obtains the larger part sb/(s + 1) without having to pay c. Note that Π is a simplification of the matrix defined in [16] since it was shown that the parameters defined here are sufficient to explain the results in the current time-scale. We will analyse evolutionary outcomes of safety behaviour within a well-mixed, finite population consisting of Z players, who repeatedly interact with each other in the AI development process. They will adopt one of the following two strategies [16]: • AS: always complies with safety precaution, playing SAFE in all the rounds.
• AU: never complies with safety precaution, playing UNSAFE in all the rounds.
The payoff matrix defining averaged payoffs for AU vs AS is given by where, solely with the purpose of presentation, we denote p = 1 − p r . As was shown in [16] by considering when AU is risk-dominant against AS, three different regions can be identified in the parameter space s-p r (see Figure 2), details are provided in SI): (I) when p r > 1 − 1 3s , AU is risk-dominated by AS: safety compliance is both the preferred collective outcome and selected by evolution; (II) when 1 − 1 3s > p r > 1 − 1 s : even though it is more desirable to ensure safety compliance as the collective outcome, social learning dynamics would lead the population to the state wherein the safety precaution is mostly ignored; (III) when p r < 1 − 1 s (AU is risk-dominant against AS), then unsafe development is both preferred collectively and selected by social learning dynamics.
It is worthy of note that adding a conditional strategy (that, for instance, plays SAFE in the first round and thereafter adopts the same move its co-player used on the previous round) does not influence the dynamics or improve safe outcomes (see details in SI). This is contrary to the prevalent models of direct reciprocity in the repeated social dilemmas context [12,38,39]. Therefore, additional measures need to be put in place for driving the race dynamics towards a more beneficial outcome. To this end, we came to explore in this work the effects of negative (sanctions) and positive (rewards) incentives.

Punishment and reward in innovation races
Given the DSAIR model one can now introduce incentives that affect the development speed of the players. These incentives reduce or increase the speed of development of a player as this is the key factor in gaining b as well as B once the game ends [16]. While there are many ways to incorporate them, we assume here a minimal model where the effect on speed is constant and fixed over time, hence not cumulative with the number of unsafe or safe actions of the co-player. Given this constant assumption, a negative incentive reduces the speed of a co-player taking an UNSAFE action to a lower but constant speed-level. Similarly, a positive incentive increases the speed of a co-player that took a safe action to a fixed higher speed-level. In both cases these incentives are attributed in the next round, after observing the UNSAFE or SAFE action respectively. Moreover, both positive and negative incentives are considered to be costly, meaning that the strategy that awards them will reduce its own speed by providing the incentive. Given these assumptions the following two strategies are studied in relation to the AS and AU strategies defined earlier: • A strategy PS that always plays SAFE but will sanction the co-player after she has played UNSAFE in the previous round. The punishment by PS imposes a reduction s β on the opponent's speed as well as a reduction s α on her own speed (see Figure 1, orange line/area).
• A strategy RS that always chooses the SAFE action and will reward a SAFE action of a co-player by increasing her speed with s β while paying a cost s α on her own speed (see Figure 1, blue line/area).
The analysis performed in the Results section aims to show whether having PS or/and RS in the population leads to more societal welfare in the region (II), where there is a conflict between individual and societal interests. The methods used in this analysis are discussed in the next section.

Evolutionary Dynamics for Finite Populations
We employ EGT methods for finite populations [12,40,41], whether in the analytical or numerical results obtained here. Within such a setting, the players' payoffs stand for their fitness or social success, and social learning shapes the evolutionary dynamics, according to which the most successful players will more often tend to be imitated by other players. Social learning is herein modeled utilising the so-called pairwise comparison rule [40], assuming that a player A with fitness f A adopts the strategy of another player B with fitness f B with probability assigned by the Fermi function, where β conveniently describes the intensity of selection. The long-term frequency of each and every strategy in a population where several of them are in co-presence, can be computed simply by calculating the stationary distribution of a Markov chain whose states represent those strategies. In the absence of behavioural exploration or mutations, end states of evolution inevitably are monomorphic. That is, whenever such a state is reached, it cannot be escaped via imitation. Thus, we further assume that, with some mutation probability, an agent can freely explore its behavioural space (in our case, consisting of two actions, SAFE and UNSAFE), randomly adopts an action therein. At the limit of a small mutation probability, the population consists of at most two strategies at any time. Consequently, the social dynamics can be described using a Markov Chain, where its state represents a monomorphic population and its transition probabilities are given by the fixation probability of a single mutant [42,43]. The Markov Chain's stationary distribution describes the time average the population spends in each of the monomorphic end states (see already the examples in Figure 3 for illustration).
Denote by π X,Y the payoff a strategist X obtains in a pairwise interaction with strategist Y (defined in the payoff matrices). Suppose there exist at most two strategies in the population, say, k agents using strategy A (0 ≤ k ≤ Z) and (Z − k) agents using strategies B. Thus, the (average) payoff of the agent that uses A and B can be written as follows, respectively, Now, in each time step, the probability of change by ±1 of a number of k agents using strategy A can be specified as [40] T The fixation probability of a single mutant adopting A, in a population of (Z − 1) agents adopting B, is specified by [40,43] When considering a set {1, ..., s} of distinct strategies, these fixation probabilities determine the Markov Chain transition matrix The normalized eigenvector of the transposed of M associated with the eigenvalue 1 provides the above described stationary distribution [42], which defines the relative time the population spends while adopting each of the strategies.
Risk-dominance An important approach for comparing two strategies A and B is that of in which direction the transition is stronger or more probable, that of an A mutant fixating in a population of agents employing B, ρ B,A , or that of a B mutant fixating in the population of agents employing A, ρ A,B . In the limit, for large population size (large Z), this condition can be simplified to [12] π A,A + π A,B > π B,A + π B,B .

Results
Negative incentives are a double-edged sword As explained in Methods PS reduces the speed of an AU player from s to s − s β , while reducing its own speed from 1 (since it plays always SAFE) to 1 − s α . Hence one can define s = 1 − s α as the new speed for PS and s = s − s β as the new speed for AU.
Depending on the values of s α and s β , these speeds may also be zero or even negative, which represent situations where no progress is being made or where punishment even destroys existing development, respectively. In the following we consider these situations in two different ways. First, a theoretical analysis is performed for the situation where s β = s α . Second, this assumption is relaxed and a numerical study of the generalised case is provided. There are two scenarios to consider when s β = s α : (i) when s α ≥ s and (ii) when it is not. In scenario (i), s and s are non-positive, resulting in an infinite number of rounds since the target can never be reached. The average payoffs of PS and AU when playing against each other are thus −c and 0, respectively (assuming that when a team's development speed is non-positive, its intermediate benefit, b, is zero). The condition for PS to be risk-dominant against AU (see Equation 6 in Methods, and noting that the payoff of PS against another PS is the same as that of AS against another AS) reads For sufficiently large B (fixing W ), this condition is reduced to, p r > 1 − 1/s. That is, PS is risk-dominant against AU for the whole region (II), thereby ensuring that safe behaviour becomes promoted in that dilemma region.   where Thus, for sufficiently large B, PS is risk dominant against AU when which is simplified to: This condition is easier to achieve for smaller r. Since r is an increasing function of s α , to optimise the safety outcome, the highest possible s α should be adopted, i.e. the strongest possible effort in slowing down the opponent should be made. Figure 4a shows the condition for different values of s α in relation to s (fixing the ratio s α /s). Numerical results in Figure 4b for a population of PS, AS and AU corroborate this In (I), both lead to no AU, as desired. In (II), punishment is more efficient except for when reward is rather costly but highly cost-efficient (the areas inside the white triangles). It is noteworthy that RS has very low frequency in all cases, as it catalyses the success of AS. In (III), RS always leads to the desired outcome of high AU frequency, while PS might lead to an undesired result of a reduced AU frequency (over-regulation) when highly efficient (non-red area). Parameters: b = 4, c = 1, W = 100, B = 10000, s = 1.5, β = 0.01, population size, Z = 100.
analytical condition. Equation 7 splits the region (II) into two parts, (IIa) and (IIb), where PS is now also be preferred to AU in the first one. In part (IIa), the transition is stronger from AU to PS than vice versa (see Figure 3b). Recall that in the whole region (II) the transition is stronger from AS to AU, thus leading to a cyclic pattern between these three strategies.
When relaxing the assumption that s β = s α (see SI for the detailed calculation of payoffs), the effect of punishment for all variations of the parameters can be studied. The results are shown in Figure 5 (bottom row), for all the three regions shown in Figure 5 in inverse order. First, when looking at the right panel (bottom row) of Figure  5, one can observe that punishment does not alter the desired outcome (safety behaviour is the preferred outcome) in region (I), i.e. safe behaviour remains dominant. Significant less unsafe behaviour is observed in region (II) , i.e. the middle panel (bottom row) of Figure 5, where it is not desirable, especially when s α is small and s β is sufficiently large (purple area). However, punishment has an undesirable effect in region (III), i.e. the left panel (bottom row) of Figure 5, as it leads to reduction of AU when punishment is highly efficient (see the non-red area) while AU remains the preferred collective outcome in that region. The reason is that, for sufficiently small s α and large s β (such that s > 0 and s > s ), PS gains significant advantage against AU, thereby dominating it even for low p r .
In summary, reducing the development speed of unsafe players leads to a positive effect, especially when the personal cost is much less than the effect it induces on the unsafe player. Yet at the same time, it may lead to unwanted sanctioning effects in the region where risk-taking should be promoted.

Reward vs punishment for promoting safety compliance
Here we investigate how positive incentives, as explained in Methods, influence the outcome in all three regions. The payoff matrix showing average payoffs among three strategies AS, AU and RS reads The payoff of RS against another RS is given under the assumption that reward is sufficiently cost-efficient, such that 1 + s β > s α ; otherwise, this payoff would be Π 11 . On the one hand, one can observe that RS is always dominated by AS. On the other hand, the condition for RS to be risk-dominant against AU is given by: which, for sufficiently large B (fixing W ), is equivalent to Hence, RS can improve upon AS when playing against AU whenever s β > s α (recall that the condition for AS to be risk-dominant against AU is p r > 1 − 1/(3s)). It is different from the peer punishment strategy PS that can lead to improvement even when s β ≤ s α .
Thus, under the above condition, a cyclic pattern emerges (see Figure 3b): from AS to AU, to RS, then back to AS. In contrast to punishment, the rewarding strategy RS has a very low frequency in general (as it is always dominated by the non-rewarding safe player AS). Nonetheless, RS catalyses the emergence of safe behaviour. Figure 5 (top row) shows the frequencies of AU in a population with AS and RS, for varying s α and s β , in comparison with those from the punishment model, for the three regions. One can observe that, in region (II), i.e. the middle panel (top row) of Figure  5, punishment is more (or at least as) efficient than reward in suppressing AU except for when incentivising is rather costly (i.e. sufficiently large s α ) but highly cost-efficient (s β > s α ) (the areas inside the white triangles; see also Figure 7 in SI for clearer difference with larger β). It is because only when incentive is highly cost-efficient, RS can take over AU effectively (see again Equation 9); and furthermore, the larger both s α and s β are, the stronger the transition from RS to AS, to a degree that can overcome the transition from AS to AU. For an example satisfying these conditions, where s α = 1.5 and s β = 3.0, see Figure 10 in SI.
In regions (I) and (III), i.e. the right and left panels (top row) of Figure 5, similarly to punishment, the rewarding strategy does not change the outcomes, as is desired. Note however that differently from punishment, in region (I), i.e. the right panel (top row) of Figure 5, only AS dominates the population, while in the case of punishment, AS and PS are neutral and together dominate the population (see Figure 5, comparing panels c and f). Most interestingly, rewards do not harm region (III), i.e. the left panel (top row) of Figure 5, which suffers from over-regulation in the case of punishment because of the stronger transitions from RS to AS and AS to AU. Additional numerical analysis shows that all these observations are robust for larger β (see SI, Figure 7).
In SI, we also consider the scenario where both peer reward and punishment are present, in a population of four strategies, AS, AU PS and RS (see Figures 8 and 9). Since PS behaves in the same way as AS when interacting with RS, there is always a stronger transition from RS to PS. It results in an outcome in terms of AU frequency similar to the case when only PS is present, suggesting that, in a self-organized scenario, peer-punishment is more likely to prevail than peer-rewarding when individuals face a technological race.
Finally, it is noteworthy that all results obtained in this paper are robust if one considers that with some probability in each round UNSAFE players can be detected resulting in those UNSAFE players losing all payoff in that round [16]. This observation confirms the observation in that in a short-term AI regime only participants' speeds matter (in relation to the disaster risk, p r ), and controlling the speeds is important to ensure a beneficial outcome (see also [16]).

Conclusion
In this paper we study the dynamics associated with technological races, those having the objective of being the first to bring some AI technology to market as a case study. The model proposed, however, is general enough for applicability to other innovation dynamics which face the conflict between safety and rapid development [13,20]. We address this problem resorting to a multiagent and complex systems approach, while adopting well established methods from evolutionary game theory and populations dynamics .
We propose a plausible adaptation of a baseline model [16] which can be useful when thinking about policies and regulations, namely incipient forms of community enforcing mechanisms, such as peer rewards and sanctions. We identify the conditions under which these incentives provide the desired effects while highlighting the importance of clarifying the risk disaster regimes and the time-scales associated with the problem. In particular, our results suggest that punishment -by forcibly reducing the development speed of unsafe participants -can generally reduce unsafe behaviour even when sanctions are not particularly efficient. In contrast, when punishment is highly efficient, it can lead to over-regulation and an undesired reduction of innovation, noting that a speedy and unsafe development is acceptable and more beneficial for the whole population whenever the risk for setbacks or disaster is low compared to the extra speed gained by ignoring safety precautions. Similarly, rewarding a safe co-player to speed up its development may, in some regimes, stimulate safe behaviours, whilst avoiding the detrimental impact of over-regulation.
These results show that, similarly to peer incentives in the context of one-shot social dilemmas (such as the Prisoner's Dilemma and the Public Goods Game) [25][26][27][28][30][31][32][33][34][35], strategies that target development speed in DSAIR can influence the evolutionary dynamics, but interestingly, they produce some very different effects from those of incentives in social dilemmas [44]. For example, we have shown that strong punishment, even when highly inefficient, can lead to improvement of safety outcome; while punishment in social dilemmas can promote cooperation only when highly cost-efficient. On the other hand, when punishment is too strong, it might lead to an undesired effect of over-regulation (reducing innovation where desirable), which is not generally the case in social dilemmas.
Our model and analysis of elementary forms of incentives thus provides an instrument for policy makers to ponder on the supporting mechanisms (e.g. positive and negative incentives), in the context of technological races [45][46][47]. Concretely, both sanctioning of wrong-doers (e.g. rogue or unsafe developers/teams) and rewarding of right-doers (e.g. safe-compliant developers/teams) can lead to enhancement of the desirable outcome (it being that of innovation or risk-taking in low risk cases, and safety-compliance in higher risk cases). Notably, while the former can be detrimental for innovation in low risk cases, it leads to a stronger enhancement for a wider range of effect-to-cost ratio of incentives. Thus, when it is not clear from the beginning what is the risk level associated (with the technology to be developed), then positive incentives appear to be the safer choice than negative ones (in line with historical data on rewards usage in innovation policy in the UK [46] as well as suggestions for Covid-19 vaccine innovation policy [21]). This is the case for many kinds of technological races especially when data about the effect of a new technology is usually lacking and only becomes available when it has been created and used enough (see the Collingridge Dilemma [48]), as are the cases of the domain supremacy race through AI [18,19] and the race for creating the first Covid-19 vaccines [21,49]. On the other hand, when one can determine early on that the associated level of risk is sufficiently high (i.e. above a certain threshold as determined in our analysis), negative incentives might provide a stronger mechanism. For instance, high risk technologies such as new airplane models and medical products [50] might benefit from putting strong sanctioning mechanisms in place.
In short, our analysis has shown, within an idealised model of an AI race and using a game theoretical framework, that some simple forms of peer incentives, if used suitably (to avoid over-regulation, for example) can provide a way to escape the dilemma of acting safely even when speedy unsafe development is preferred. Future studies may look at more complex incentivising mechanisms [47] such as reputation and public image manipulation [51,52], emotional motives of guilt and apology-forgiveness [53,54], institutional and coordinated incentives [28,36], and the subtle combination of different forms of incentive (e.g., stick-and-carrot approach and incentives for agreement compliance) [32,34,[55][56][57]. T.L. acknowledges support by the FuturICT2.0 (www.futurict2.eu) project funded by the FLAG-ERA JCT 2016.

Supporting information
Details of analysis for three strategies AS, AU, CS Let CS be a conditionally safe strategy, playing SAFE in the first round and choosing the same move as the co-player's choice in the previous round. We recall below the detailed calculations for this case, as described in [16], just for completeness. The average payoff matrix for the three strategies AS, AU, CS reads (for row player) The conditions (i) SAFE population has a larger average payoff than that of UNSAFE one, i.e. Π AS,AS > Π AU,AU , meaning by definition that a collective outcome is preferred and (ii) when is it the case that AS and CS are more likely to be imitated against AU (i.e., risk-dominant) will be derived below. First, for condition (i), it must hold that Thus, which is equivalent to (since B/W b) This inequality means that, whenever the risk of a disaster or personal setback, p r , is larger than the gain that can be gotten from a greater development speed, then the preferred collective action in the population is safety compliance. Now, for condition (ii), which are both equivalent to (since B/W b) The two boundary conditions for (i) and (ii), as given in Equations 13 and 16, splits s − p r parameter space into three regions, as exhibited in Figure 6a: (I) when p r > 1 − 1 3s : This corresponds to the AIS compliance zone, in which safe AI compliance is both preferred collectively and that unconditionally (AS) and conditionally (CS) safe development is the social norm (an example for s = 1.5 is given in Figure 6b: p r > 0.78); (II) when 1 − 1 3s > p r > 1 − 1 s : This intermediate zone is the one that captures a dilemma because, collectively, safe AI developments are preferred, though the social dynamics pushes the whole population to the state where all develop AI in an unsafe manner. We shall refer to this zone as the AIS dilemma zone (for s = 1.5, 0.78 > p r > 0.33, see Figure 6c); (III) when p r < 1 − 1 s : This defines the AIS innovation zone, in which unsafe development is not only the preferred collective outcome but also the one the social dynamics selects.
Calculation for π P S,AU and π AU,P S in general case Below R denotes the average number of rounds; B 1 and B 2 the benefits PS and AU might obtain from the winning benefit B when either of them wins the race by being the first to have made W development steps; b 1 and b 2 the intermediate benefits PS and AU might obtain in each round of the game; p loss is the probability that all the benefit is not lost when AU wins and draws the race; Clearly, all these values depend on the development speeds (s for PS and s for AU).  Other parameters are the same as in Figure 5 in the main text. The observations in that figure is also robust for larger intensities of selection.   Figures 5 and 7: s α = 1.5, s β = 3. We observe that the frequency of AU is lower in case of reward than that of punishment. Other parameters as in Figure 2.