Policy search with rare significant events: Choosing the right partner to cooperate with

This paper focuses on a class of reinforcement learning problems where significant events are rare and limited to a single positive reward per episode. A typical example is that of an agent who has to choose a partner to cooperate with, while a large number of partners are simply not interested in cooperating, regardless of what the agent has to offer. We address this problem in a continuous state and action space with two different kinds of search methods: a gradient policy search method and a direct policy search method using an evolution strategy. We show that when significant events are rare, gradient information is also scarce, making it difficult for policy gradient search methods to find an optimal policy, with or without a deep neural architecture. On the other hand, we show that direct policy search methods are invariant to the rarity of significant events, which is yet another confirmation of the unique role evolutionary algorithms has to play as a reinforcement learning method.


INTRODUCTION
We consider a particular class of reinforcement learning problems where only rare events can result in non-zero rewards and when the agent can experience at most one positive reward in a limited time.This problem is closely related to the problem of learning with rare significant events in reinforcement learning [2,5,11], where rare events can significantly affect performance (e.g. in network and communication systems or control problems where failure can be catastrophic).In this paper, we consider that while significant events occur independently of the agent's actions, the agent's policy determines if a positive reward should be obtained when such an event occurs.Significant events are thus defined as unique opportunities to obtain a positive reward and stop the game.Each opportunity can either be seized for an immediate reward or ignored if the agent hopes to get a better reward in the future.
We address this problem in the context of an independent, on-line and on-policy episodic learning task with continuous state and action spaces.The practical application addressed in this paper is that of an agent learning to choose a partner for a task that requires cooperation (e.g., predators hunting a large prey or individuals selecting a lifelong mate).The agent can choose to cooperate or not with a potential partner, based on the effort this partner is willing to invest in the cooperation.At the same time, the agent must invest enough so that its partner also accepts to cooperate.In this setup, the agent may face partners willing to invest various amount of energy in cooperation (i.e., a possibly significant event), or even refuse to cooperate whatever the agent is ready to invest (i.e. a non-significant event).
Results from theoretical biology [4,7,10,17] have shown that cooperation with partner choice is optimal only under certain conditions.First, the number of cooperation opportunities must be large enough that an agent can refuse to cooperate with a potential partner and still have the opportunity to meet a more interesting partner.Second, if an agent and its partner both decide to cooperate, the actual duration of this cooperation must be long enough to make cooperation with an uninteresting partner significantly costly (which is the case when there can be only one single successful cooperation event).Under these conditions, the optimal strategy for an agent is to be very demanding in choosing its partner.
The question raised in this paper is whether reinforcement learning algorithms actually succeed in learning an optimal strategy when the necessary conditions are met.We are particularly interested in how the rarity of significant events influences convergence speed and performance of policy learning.Indeed, it is not clear how gradient-based policy search method can deal with a possibly large number of non-significant events that provide zero-reward.
We use two state-of-the-art methods for on-policy reinforcement learning with continuous state and action spaces: (1) a deep learning method (PPO [23]) for gradient policy search and (2) an evolutionary method (CMAES [13]) for direct policy search.While both methods provide similar results when the agent is always presented with significant events, policy search methods are not equals when such events become rarer.While the direct policy method is oblivious to rarity of significant events, the gradient policy search method suffers significantly from rarity.
The paper is structured as follows: the reinforcement learning problem with significant rare events and single reward per episode is formalized, and the partner choice learning problem is presented as a variation of a continuous prisoner's dilemma.Algorithms and results are then presented, and learned policies are analysed and compared.

METHODS 2.1 Learning with Rare Significant Events
Formally, we consider an independent learner  • , called the focal agent, which is placed in an aspatial environment.At each time step,  • is presented with either a cooperative partner  +  ∈  + or a non-cooperative partner  −  ∈  − . + (resp. − ) is the finite set of all cooperative (resp.non-cooperative) agents, with both  and  ∈ N and  > 0,  ≥ 0. When presented with a non-cooperative partner  −  , the focal agent's reward will always be zero.When presented with a cooperative partner  +  , the focal agent's reward will depend on its own action and that of its partner.(see Section 2.2 for details).
Our objective is to endow the focal agent  • with the ability to learn how to best cooperate, which implies to negotiate with its potential partners and decide whether cooperation is worth investing energy in, or not (see Section 2.3 for details).The focal agent faces an individual learning problem as it must optimize its own gain over time in a competitive setup, whether its partners are also learning agents or not.For cooperation to occur between the focal agent and a partner, the partner must willing to cooperate (ie.be one of  +  ) and both the focal agent and the cooperative partner must estimate that one's own energy invested in cooperation is worth the benefits.
We use the standard reinforcement learning framework proposed by Sutton and Barto [27] to formalize the learning task from the focal agent's viewpoint, which is essentially a single agent reinforcement learning problem.
The focal agent  • interacts with the environment in a discrete time manner.At each time step  = 0, 1, 2, ...,  • is in a state  ∈ R which describes its current partner's investment value, and plays a continuous value  ∈ R which represents its decision to cooperate ( > 0) or not ( <= 0).
Let   be the parametrised policy of the focal agent, with  ∈ R  .The learning task is to search for  * , such as: With  the global function to be optimized, defined as: with reward   at time .Rewards are defined such that  ∈ R and depends on the current state  and action , and are produced according to the probability generator defined as follow: The probability  ∈ [0, 1] determines the probability to encounter a cooperative agent (i.e. one of  +  ).The value of  depends on the setup, and determines how rare significant events occur when  < 1.0.A probability of  = 1.0 means the focal agent  • encounters a cooperative partner at each time step , with a possible positive reward (if cooperation is accepted by both agents) that depends on the payoff function.Non-zero rewards become rarer (but still possible) as  → 0. Note that payoff (, ) is non-zero only if both the focal agent and its cooperative partner accept to cooperate.Cf.Section 2.3 for details on the negotiation process.
The problem presented here is very similar to that of Rare Significant Events as formulated by Frank et al. [11].However, our problem differs on two aspects.Firstly, we consider on-line onpolicy search of a parametrised policy, where the frequency of significant events cannot be controlled.Secondly, and even more importantly, a learning episode stops right after the focal agent and one cooperative agent have reached a consensus to cooperate.If no cooperation is triggered, an episode stops after a maximum number of iterations  , defined as: It results that the expected number of meetings  is held constant independently from the value of  (i.e.E() = 100).It is therefore possible to obtain episodes of different lengths but with the same number of significant events.
The situation that is modelled here corresponds to many collective tasks observed in nature [3,25,28], where each agent has to balance between looking for partners and cooperating with the current partner, the latter possibly taking significant time.As a matter of fact, it has been shown elsewhere [4,7,9,10,17] that optimal partner choice strategies can be reached only when the cost of cooperation is large (ie. the duration of cooperation is long with regards to looking for cooperative partners).

Partner Choice and Payoff Function
Whenever the focal agent  • and a cooperative partner  +  interact together, they play a variation of a continuous Prisoner's Dilemma.Cooperation actually takes place if both agents deem it worthwhile.The two-step procedure for partner choice is the following: (1) each agent simultaneously announce the investment they are willing to pay to cooperate; (2) each agent then chooses to continue the cooperation based on the investment announced by its partner and its own.
To simplify notations, we use  • and  +  to represent both the agents and the investment values they play, i.e.  • (resp. +  ) plays  • (resp. +  ).The gain received by the focal agent  • is defined as: With ,  ≥ 0 and  +  > 0. This payoff function combines both a prisoner's dilemma and a public good game, and was first introduced in Ecoffet et al. [10].Two different equilibria1 can be reached for  • : •   = .This is a sub-optimal equilibrium, which corresponds to an agent cheating, a typical outcome in the prisoner's dilemma where an agent maximizes its own gain, but also minimizes its exposure to defection.This ensure the best payoff for the agent if it is unable to distinguish a cheater from a cooperator.
•   =  +.This is the optimal equilibrium, where both agents cooperate to maximize their long-term gain.The public good game is included in the payoff function to help distinguish between agents that are simply ignoring the cooperation game ( • = 0), from those who takes part in it, even if they defect ( • ≥   ).
The focal agent can get the optimal payoff if it plays  • =   and its partner plays  +  ≥   , which can occur if particular conditions are met when partner choice is enabled.Partner choice can lead to optimal individual gain whenever a successful cooperation removes the possibility for further gain with other partners.In other words: the focal agent can meet with any number of possible partners but will take the gain of the first and single mutually accepted cooperation offer.
In this paper, we set  = 5 and  = 5, therefore   = 5 and   = 10.The maximum payoff the agent can obtain is to invest  • =   with its partner investing equally  +  =   .In this context,  ( • ,  +  ) = 50.The focal agent's investment is bounded as 0.0 ≤  • ≤ 15.0.This is similar for  +  . ( • ,  +  ) and payoff (, ) (introduced in Equation 3) differs as the  function relates to the game theoretical setting while the payoff function relates to the reinforcement learning problem.On the one hand, the payoff function computes the focal individual's reward whether or not cooperation was initiated.On the other hand,  computes the focal individual's gain that results from a cooperation game between two agents that accepted to cooperate.However, both functions are linked.From a notational standpoint,  represents the investment value of the focal individual  • , and  represents the decision to cooperate and depends on both  and that of its partner  +  (which is implicit).The return value of payoff (, ) depends on whether cooperation was initiated or not.If both agents decided to cooperate, then the focal agent's payoff is payoff (, ) =  ( • ,  +  ), with  ( • ,  +  ) ≤ 50 in this case.If cooperation fails, the focal agent's payoff is payoff (, ) = 0 (which is obtained without having to compute ).The    function in Equation 3 can be written as follow, with updated notations and assuming  • > 0 (resp. +  > 0) means the focal agent (resp.partner) is willing to cooperate:

Behavioural Strategies
For each interaction, the focal agent's investment value  • ∈ [0, 15] is computed, and when the investment value of its partner is known, its decision to cooperate  • ∈ R is computed to determine if cooperation should be pursued or not.Each value is provided by a dedicated decision module: • the investment module which provides the cost  • that the focal agent is willing to invest to cooperate.This module takes no input as it is endogenous to the agent (i.e. the proposed cost  • is fixed throughout an episode); • the choice module takes both the focal agent's own investment value ( • ) and that of its partner ( +  or  −  ), and computes  • , which is used to determine if cooperation is an interesting choice ( • > 0) or not ( • ≤ 0).The choice module is essentially a function  ℎ ( • ,   ) →  • with   ∈  + ∪  − .The parameters of the function are learned, and the decision to cooperate is computed (as the decision to cooperate is conditioned by the partner's investment).With respect to the focal individual, Section 3 describes how the investment and choice modules are defined and how learning is performed depending on the learning algorithm used.
Cooperative partners  +  and non-cooperative partners  −  also use similar decision modules, providing investment and choice values.However, all use deterministic fixed strategies, which may differ from one partner to another.Firstly, non-cooperative partners  −  all follow the same strategy.Both the investment value  −  and the decision to cooperate  −  are always 0, ∀.Secondly, cooperative partners  +  each follows a stereotypical cooperative strategy depending on the value .Each cooperating partner invests a fixed value  +  ∈ [0, 15] defined as: Each cooperative partner then accepts to cooperate if the focal agent's investment value  • is greater or equal to their investment, which is written as follow: In the following, there are   = 31 cooperating partners ( +  ∈  + ,  ∈ {1, . . ., 31}).Following Eq.8, this means cooperating partner  + 1 (resp. + 2 , ...,  + 31 ) plays 0 (resp.0.5, ..., 15).

PARAMETER SETTINGS AND ALGORITHMS
We use two reinforcement learning algorithms: a gradient policy search algorithm (PPO) and a direct policy search algorithm (CMAES).Both algorithms are used to learn the parameters of the focal agent's decision modules.For both algorithms, the performance of a policy (i.e. the return or the fitness, depending on the vocabulary used) during one episode is computed as the sum of rewards during the episode (cf.Section 2.1), which is either zero, or the value of the unique non-zero reward obtained before the episode ends.

Proximal Policy Optimization
The deep reinforcement learning Proximal Policy Optimisation (PPO) [23] is a variation of the Policy Gradient algorithm [27].

Parameters Values
Learning Though, as the expected value of a certain state-action pair varies according to the policy itself, updating a new policy from samples acquired from an old policy may cause inaccurate predictions, as the expected value of an action-state pair may be wrong with respect to the new policy.PPO ensures that the policy generated from the samples of the new policy does remain in a so-called trust region at each learning step.
As we are dealing with episodes and do not want to encourage the focal agent to act in the least amount of time steps as possible, the discount factor is set to  = 1.0, as recommended by Sutton and Barto [27, p.68].The PPO hyper-parameters used are reported in Table 1.
The investment and choice modules are both represented as Artificial Neural Networks (ANN).A module is composed of both a decision network and a Value function, as PPO runs as an actorcritic algorithm.The Value function network has the same layout as the decision network, but only output the (continuous) value of the state.
The decision network for the investment module is a simple neural network with one single input set to 1.0, no hidden layer and two outputs: the investment mean  and standard deviation .The investment  • is picked along the distribution N (,  2 ) and clipped between 0 and 15.The continuous stochastic action selection is essential to the PPO search algorithm.
The decision network for the choice module is a multilayer perceptron with two input neurons and two output neurons (for accepting or refusing cooperation).The output neurons use a linear activation function, and a softmax probabilistic choice is done to choose which action to make (accept or decline).Hidden units use an hyperbolic tangent activation function.A bias node is used, that projects on both the hidden layer(s) and output neurons.The Value Function estimator use the same architecture as the choice neural networks, with only one output.
In Section 4, two different architectures are evaluated, which we refer to as PPO-MLP and PPO-DEEP.While both use the decision network for the investment module described before, they differ with respect to the architecture used for the choice module.PPO-MLP implements a single hidden layer with 3 neurons, and PPO-DEEP implements a deep architecture with two hidden layers, each with 256 neurons.While PPO-DEEP may seem overpowered at first sight, over-parametrization has been shown to be very effective in deep learning as multiple gradients can be followed in wide neural networks [1,8,19].
All parameter values and module architecture result from an extensive search (summarised in the Supplementary Materials).In particular, a grid search was performed to select the best values for each parameters, including the learning rate ( ).The number of Simple Gradient Descent iterations, the batch size and the minibatch size had little impact on neither performance nor convergence.In addition, we performed additional experiments to evaluate the impact of using (1) a discount factor  < 1.0 (i.e.0.9, 0.99 and 0.999) and ( 2) PPO without actor-critic.None of these settings provided better (or even comparable) results to those obtained with the parameters used in Table 1.

Covariance Matrix Adaptation Evolution Strategy
The Covariance Matrix Adaptation Evolution Strategy (CMAES) is an optimisation algorithm that does black box optimisation and is derivative-free [13].The goal of CMAES is to find  * that maximizes (or minimizes) a continuous function  .CMAES does not require the function to be convex or differentiable, and relies on stochastic sampling around the current estimate of the solution.
CMAES creates a population of size  using a multivariate Gaussian distribution.Each individual of the population is evaluated and CMAES then updates its distribution estimation based on the average of the sampled agents weighted by their evaluation rank.Furthermore, the covariance matrix of the multivariate Gaussian distribution is updated so that the distribution is biased toward the most promising direction.The investment module is represented as a single real value (the investment), which is clipped between 0 and 15 when used.The partner choice module is a neural network with 2 inputs, one hidden layer with three neurons and two neurons on the output layer used to compute the probability to accept or refuse cooperation.A softmax probabilistic choice is made to choose which action to make.A bias node is also used, neurons from the hidden layer use an hyperbolic tangent activation function, and the output units use a linear activation function.There are 17 neural network weights.
The parameters for both modules are compiled into a single vector of real values.To make the search space similar to that of PPO, dummy parameters are added to the vector (i.e.values which can be modified by the algorithm, but with no impact on the outcome) to reach a total number of 34 real values (i.e.Θ ∈ R 34 ).
Table 2 summarizes the parameters used for the CMAES algorithm.As CMAES is mostly parameter-free, there were no need to perform extensive preliminary search, and we used the default values.We choose   = 1.0 for the initial standard deviation and a vector of zeros as initial guess.The population size  is the default population size in the python CMAES implementation [12], i.e.  = 4 + ⌊3 × ln( )⌋ = 14 with  the number of dimensions in the model.Once the  candidate solutions are evaluated, a new population is generated according to their performance.A new population is generated every 14 episodes, and so forth until the evaluation budget is consumed.
A candidate solution for the focal agent is evaluated on one episode only, which length may vary depending on when the focal

Parameter Value
Population size 14 Number of episode per evaluation 1 R 34 Table 2: Parameters for the CMAES algorithm agent and its partner both accepts to cooperate (maximal duration defined in Eq. 4).

RESULTS
The environment, the models and the learning algorithms are implemented with ray2 , rllib3 and pytorch 4 .We use the cma 5 package in python for the CMAES implementation.Source code is available at https://github.com/PaulEcoffet/RLCoopExp/releases/tag/v1.1.
For a given value of probability of rare significant events , we performed 24 independent runs for each algorithm.A run lasts 200 000 episodes.The maximum duration of an episode is fixed as described in Section 2.1 so the expected number of significant events remains identical independently from the actual rarity throughout one episode (cf.equation 4).In practical, an episode lasts at most 100 (resp.200, 500, 1000) iterations for  = 1.0 (resp.0.5, 0.2, 0.1).
Performance of the current policy is plotted every 4000 iterations, which corresponds to the batch size used by both PPO instances for learning.As episodes last significantly shorter than 4000 iterations this means the policy's performance is averaged.For CMAES, we extract the best policy of the current generation and re-evaluate it 10 times (i.e. for 10 episodes) to get a similarly averaged performance.Results are shown on figures with a data point every 1000 episodes.

Learning when All Events are Significant
Figure 1 shows the performance throughout learning for CMAES, PPO-DEEP and PPO-MLP when  = 1.0 (i.e. the focal agent faces only cooperative partners).Each Figure shows 24 curves corresponding the 24 independent runs.Both PPO versions and CMAES are shown to learn near optimal policies (   → 50) in almost all runs.CMAES is the fastest to converge, and PPO-DEEP (despite the huge number of dimensions) is faster than PPO-MLP.On the other hand, CMAES offers less robustness as 20 (out of 24) runs with CMAES reach a performance above 40, to be compared to 23 (out of 24) runs with PPO-MLP and 24 runs with PPO-DEEP.
In order to better compare the quality of the policies learned by each algorithm, the best policy from the end of each run is selected and re-evaluated for 1000 extra episodes without learning.Results are shown in Figure 2 with all three methods faring similar performance.The median value for CMAES (47.64) is only slightly more than that of PPO-DEEP (46.99) and PPO-MLP (45.58).
Therefore, we conclude that all three algorithms provide excellent and comparable results when only significant events are experienced ( = 1.0).

Learning when Significant Events are Rare
Figure 3 show the performance of the agent throughout its learning with both PPO algorithms and the CMAES algorithm for different conditions of rare significant events ( ∈ {0.1, 0.2, 0.5}), as well as with the control condition when all events are significant ( = 1.0, taken from the previous Section).Each figure shows the mean performance of 24 independent runs per conditions, compiling each setup by tracing the median performance and 95% confidence interval from the 24 runs.CMAES is only marginally impacted when significant events become rarer (i.e. < 1.0), with all setups showing convergence towards a similar performance value close to the optimal (above 40).While PPO-DEEP fares better than PPO-MLP for  < 1.0, both are largely affected.In the extreme case where  = 0.1, the average performance of 35.7 ± 5.2 for PPO-DEEP and 24.9 ± 4.2 of PPO-MLP, to be compared to 46.2 ± 3.2 for CMAES.
Figure 4 shows the results for the additional analysis where the best policy from each run for each condition  ∈ {0.1, 0.2, 0.5, 1.0} is selected and re-evaluated for 1000 extra episodes without learning and with the condition  = 1.0 (i.e.only significant events matter here).Results confirm that the difference in the performance of policies obtained with CMAES compared to either versions of PPO widens as significant events become rarer ( < 1.0) with both PPO-MLP and PPO-DEEP faring significantly worse than CMAES (p-value < 0.0001, Mann-Whitney U-test).

Analysing the Best Policies for Partner Choice
In order to better understand why policies' performance differ among learning algorithms and conditions, the agent's policy obtained at the end of each run is extracted and analysed (i.e.24 policies per algorithm per condition).Figure 5 illustrates the outcome of the Investment Module ( • ), i.e. the investment value offered by the focal agent when faced with a potential partner.It is obtained by measuring the investment value of the focal agent6 from 1000 episodes with  = 1.0 and without learning.Policies learned with CMAES play close to   = 10, which is the optimal play for the payoff function (Section 2.2), whatever the frequency of significant events.As expected, this is different for policies learned with PPO, as the outcome values of the Investment Module are significantly lower when the frequency of significant events decreases ( < 1.0).
Figure 6 illustrates the investment values played by cooperative partners, when the focal agent accepts to cooperate (whether or not cooperation will actually take place, as it also depends on the partner's acceptance).In other words, it represents how demanding is the focal agent with respects to its partners' intention to invest in cooperation.The probability to accept cooperation is computed for the policies of each run.Each policy is presented with all 31 possible cooperative partners, 100 times each, to estimate the focal agent strategy.While CMAES produced consistent policies that follow quasi-identical strategies for all conditions (ie.accepting partners that invest close to the optimal   = 10 or above), this is not the case for PPO policies which are less demanding for lower value of , with many of the policies learned by PPO-MLP with condition  = 0.1 actually accepting any partners).PPO-DEEP policies fare better than PPO-MLP policies, but still worse than policies learned with CMAES when significant events are rarer.
Figure 7 takes a detailed look at the results shown in Figure 6.It shows the strategy profile for partner choice by the best policy obtained with each algorithm in each condition.Focal agents obtained with CMAES follow an efficient and clear-cut strategy: they play the optimal investment value ( • =   = 10, green vertical line) and accept partners only when those play a similar or better value ( +  ≥ 10, blue line).Policies obtained with PPO-DEEP and PPO-MLP either follow roughly the same profile with a more stochastic behaviour (PPO-MLP policies for  = 1.0 and 0.5, PPO-DEEP policies for  = 1.0, 0.2 and 0.1) or display a selective strategy, choosing partners only when they play close to the optimal investment value  +  ≈   .Only PPO-MLP produced policies which are clearly suboptimal for  = 0.2 and  = 0.1, with a mean investment below the optimal investment value  • <   .

CONCLUDING REMARKS
In this article, we focused on an on-policy reinforcement learning problem of an autonomous agent that needs to maximize its gain when interacting with other agents, with whom our agent may or may not decide to cooperate.The peculiarity of this problem is to present a (very) small number of significant events during which the agent can obtain only one single positive reward.The challenge is therefore to learn how to best choose a partner, by making a compromise between the chances of finding a better partner, and the cost of an interaction.) and all algorithms (top: CMAES, center: PPO-DEEP, bottom: PPO-MLP).For each setup, only the best policy is shown.Each graph plots the probability to accept cooperation for the focal agent following the best policy (y-axis) depending on its partner's proposed investment (x-axis).Data are computed by presenting each of the 31 possible cooperative partners to the focal agent for 100 iterations as policies are stochastic.The green vertical line represent the mean investment of the focal agent.
We studied the dynamics of two reinforcement learning methods: a gradient policy search algorithm and a direct policy search algorithm with an evolution strategy.Both algorithms succeeded in learning policies that make an optimal use of partner choice when interaction opportunities are frequent.However, the two algorithms differ fundamentally when interaction opportunities are rare.The direct policy search algorithm shows total robustness, while the gradient policy search algorithm collapses, resulting in sub-optimal policies.
The robustness of the direct policy search method can be expected as the sequential and temporal aspects of the task is lost within one evaluation.As long as the evaluation time is long enough to sample the whole population of relevant partners, there is no cost nor change in the algorithm dynamics to deal with a situation where significant events are lost within a longer sequence, but still of the same number.Such independence to action frequency and delayed rewards have actually been observed elsewhere, though for different problems (e.g.: robotic control problem [22]).This is of course different for the gradient policy search method, where increased rarity means that many learning steps will be performed with zeroreward, resulting in poor gradient information most of the time.Not only this slows down learning, even with a similar number of iterations, but it also prevents learning from converging towards a truly optimal partner choice strategy.This remains true even when a large search space is considered, in which over-parametrization in deep neural networks help gradient search [8,19].
The broader motivation behind this work is to identify reinforcement learning problems for which evolutionary algorithms as a direct policy search method offer a competitive advantage over gradient policy research methods (cf. also [6,14,18,20,22,24,26]).The take-home message that emerges from this paper is that one of these problems occurs when important events are rare, for which direct policy search shows an invariance to rarity.
As a final remark, it may be tempting to relate the problem of rare significant events with that of sparse rewards, which has gain a lot of attention recently [15,16,21].However, they differ fundamentally as significant events may be rare, but eventually occur.This is not the case with sparse rewards, which occurrences are conditioned by the policy itself (e.g. a robotic arm must be within the length of a target to trigger a reward) and may never be obtained.We also argue that problems where significant events are rare rather than sparse may be more numerous than expected: a complex environment offers multiple learning opportunities, as long as one is able to seize them as they arise.

A.2 Re-evaluation performance statistical score
We perform two-tailed Mann-Whitney's U-tests to compare the distributions of the performance for the PPO-MLP, PPO-DEEP and CMA-ES agent for each probability p of meeting a x + partner.The table of the median performance of each learning algorithm for each p is reported in table 1

A.3 Timing
We measure the execution time (wall time) for both PPO-MLP and CMAES on a single CPU.To do so, we take the best agent out of the 24 simulations for each algorithm and each condition, and restart the learning from its state for 15 minutes.We then divide the total time taken by the algorithm by the number of episode time step the algorithm fulfilled.CMA-ES overhead is constant, as CMA-ES updates are more frequent per episode time steps as p increases.Indeed, the shorter the episodes are, the more often CMA-ES updates triggers.PPO-MLP and PPO-DEEP updates are constant with respect to the episode length, as they always update when 4000 episode time steps are completed.The PPO-MLP execution time compare to the CMA-ES execution time is explain by the RLlib overhead as well as its learning computational cost.The PPO-DEEP execution time is greater than PPO-MLP execution time.It is explained both by the larger network that requires a more computationally intensive evaluation, as well as the gradient descent which involve massively more weights.
The total time divided by the number of iteration is reported in Figure 4 as well as in Table 4. Speed ratio are shown in Table 3 0.

Figure 1 :Figure 2 :
Figure1: Performance of the best policy throughout learning with CMAES (top), PPO-DEEP (center) and PPO-MLP (bottom), with 24 independent runs per method, for 200 * 10 3 episodes.There are 20/24 runs that produced a policy where performance above 40 with CMAES, 20/24 for PPO-DEEP and 23/24 for PPO-MLP.Note that PPO-DEEP produces 24/24 runs with performance above 40 around episode 80 * 10 3 , with performance occasionally degrading and immediately recovering for some runs afterwards due to the learning step size (see Annex for further analysis).

Figure 5 :
Figure 5: Investment value of the focal agent given by the Investment Module for the best learned policies with CMAES (blue), PPO-DEEP (orange) and PPO-MLP (green) algorithms, for each condition .Each violin graph represents the results of the outcome of the 24 best policies for a given algorithm and condition after being re-evaluate for 1000 episodes without learning.

Figure 6 :
Figure 6: Decision to accept to cooperate taken by the focal agent, when facing a cooperative partner with a particular investment value.Results for CMAES (blue), PPO-DEEP (orange) and PPO-MLP (green) are shown as violin graph.X-axis: algorithms and conditions, Y-axis: partner's investment value for which the focal agent accept to cooperate.

Figure 7 :
Figure7: Analysis of the Partner Choice module for all conditions (by columns:  ∈ {0.1, 0.2, 0.5, 1.0}) and all algorithms (top: CMAES, center: PPO-DEEP, bottom: PPO-MLP).For each setup, only the best policy is shown.Each graph plots the probability to accept cooperation for the focal agent following the best policy (y-axis) depending on its partner's proposed investment (x-axis).Data are computed by presenting each of the 31 possible cooperative partners to the focal agent for 100 iterations as policies are stochastic.The green vertical line represent the mean investment of the focal agent.

Figure 4 :
Figure 4: Average time per episode time steps (Learning included) for CMAES, PPO-DEEP and PPO-MLP for the different conditions p.

Table 1 :
Parameters for the PPO algorithmPolicy gradient algorithms maximize the global performance by updating the parameters  of the policy  (cf.Eq. 2).

Table 1 :
and the U-statistics and p-values of the tests are reported in table 2. Median of the re-evaluations, 24 runs per condition

Table 2 :
Statistical results of the two-tailed Mann-Whitney U-test comparing the performance of the agents using PPO-MLP, PPO-DEEP and CMAES in the re-evaluation setup.n = 24 for each condition and algorithm.

Table 3 :
Speed ratio between algorithms

Table 4 :
Table of computational wall time per environmental time step for CMA-ES, PPO-DEEP and PPO-MLP with one single core.The average time per environmental step includes the time needed by the learning algorithm to update the policy.