Confidence Sharing: An Economic Strategy for Efficient Information Flows in Animal Groups

Social animals may share information to obtain a more complete and accurate picture of their surroundings. However, physical constraints on communication limit the flow of information between interacting individuals in a way that can cause an accumulation of errors and deteriorated collective behaviors. Here, we theoretically study a general model of information sharing within animal groups. We take an algorithmic perspective to identify efficient communication schemes that are, nevertheless, economic in terms of communication, memory and individual internal computation. We present a simple and natural algorithm in which each agent compresses all information it has gathered into a single parameter that represents its confidence in its behavior. Confidence is communicated between agents by means of active signaling. We motivate this model by novel and existing empirical evidences for confidence sharing in animal groups. We rigorously show that this algorithm competes extremely well with the best possible algorithm that operates without any computational constraints. We also show that this algorithm is minimal, in the sense that further reduction in communication may significantly reduce performances. Our proofs rely on the Cramér-Rao bound and on our definition of a Fisher Channel Capacity. We use these concepts to quantify information flows within the group which are then used to obtain lower bounds on collective performance. The abstract nature of our model makes it rigorously solvable and its conclusions highly general. Indeed, our results suggest confidence sharing as a central notion in the context of animal communication.


Model definitions
We study a simple model for information sharing and dissemination within a population of agents. By executing an algorithm A, at any given time t, each agent a holds an external real variable x a (t, A) ∈ R (representing, e.g., its physical location) and possesses some internal state y a (t, A), which we call memory. To simplify the presentation, whenever the algorithm A is clear from the context, we may omit its description, writing, e.g., X a (t) instead of X a (t, A). In addition, we term the external state x a (t) as location from now onwards. Initially, the location x a (0) of each agent a is randomly chosen according to some arbitrary unbiased distribution Φ a centered around an "environmental" target value θ * ∈ R. We assume an environment that is symmetric with respect to this external state. For example, if x a signifies the direction of motion then such symmetry implies rotational invariance so that all 360 o of motion are initially equivalent and only relative angle measurements are meaningful. Informally, the initial location x a (0) is perceived as an inaccurate sample of the environment. We typically assume that the variance of Φ a is know to agent a, also in the more liberal settings we may assume that the full knowledge of the parameterized distribution family {(Φ a ; θ)} θ∈R is available to each agent a. (The family is parameterized by the location of the target θ ∈ R).
The goal of each agent a is to relocate itself so that at any given time t, its location x a (t) would be as close as possible to θ * . Convergence to the desired value θ * is achieved by both social interactions and environmental cues 1 , where in-between such events agents are free to adjust their internal state and modify their location by making a move. That is, an agent a adjusts its location x a (t) by moving, where a move instruction is specified by some distance quantity ∆(X), shifting the agent from its current location x to location x + ∆(X). We view x a (t) as an estimator of θ * . In particular, we require that at any given time t, the location x a (t) is an unbiased estimator of θ * .

Communication
We focus on pair-wise interactions which can be either uni-or bi-directional (our results transfer to interactions with a larger number of agents in a straightforward manner). The information transferred in such interactions may contain passive signals that are possibly accompanied by active signals. Passive information is obtained as agent a measures its relative distance from agent b, that is,d where the additive noise term, η, is chosen from some arbitrary distribution N (η) whose variance is known to the agents. Active signals are modeled as messages that expose some part of the internal memory state of agent b to the observing agent a.
At this point we note, that in contrast to the internal state y a (t), we do not assume that the absolute location x a (t) is known to agent a. That is, we view the surrounding as "relative", meaning in particular, that initially, when x a (0) is sampled from Φ a , agent a has no information about the value of its own location x a (0). In particular, the internal state y a (0) of each agent a contains merely var(Φ a ) and is independent of the actual value x a (0).

Rounds
Execution proceeds in discrete rounds. In each round (or time step), starting at time 0, each agent may first choose to change (or not) its location by moving, and then, if specified in the meeting pattern P (see next section), it views another specified agent, thus obtaining some information. To summarize, the round structure is the following: • Perform internal computation; • Move by some ∆(x); • View (or not) another agent b (obtaining passive and possibly active information).
For simplicity, all three operations, that is, internal computations, move, and view are assume to occur instantaneously, that is, in zero time.

Meeting patterns
A finite execution is associated with a pattern of meetings P, which specifies the interactions at any round, up to some final round T P . The system is considered as anonymous, that is we do not assume that the agents are aware of who they view. In addition, agents cannot even assume that they will interact with other agents or not, in subsequent rounds. In particular, the pattern of meeting P is not known to the agents in advance. We note however, that when formulating the lower bound purposes, we will consider the most liberal version of this model (see algorithm Opt below), assume that agents have unique identities, and know the pattern P in advance. Independent meeting patterns: Given a pattern of meetings P, agent a and time t, we recursively define the set of relevant agents of a at time t, denoted by R a (t, P). Informally, only agents in R a (t, P) are relevant for the move decision made by a at time t. At time zero, we define R a (0, P) := a, and at time t, R a (t, . A meeting pattern P is called independent if whenever some agent a views some other agent b at some time t ≤ T P , then R a (t − 1, P) ∩ R(b, t − 1, P) = ∅.

Random variables
Consider an algorithm A. At any given time t the location and internal state of each agent are random variables which we denote by X a (t, A) and Y a (t, A), respectively specific assignments to X a (t, A) and Y a (t, A) are denoted by x a (t, A) and y a (t, A)). The value of a random variable X a (t, A) indicating the location of a is affected by: 1. The environmental value θ * , 2. The relative distances of the initial locations of agents b ∈ R a (t, P) (governed by the distributions Φ b ) with respect to θ * , 3. The noises η in the distance samplings, 4. The moves of agents a and b ∈ R a (t, P) 5. The instruction rules of algorithm A itself (which, in particular, may be a randomized algorithm and may thus use coin flips). These rules result in, e.g., the previous moves made by agents and the information transferred by communicating active signals.
The memory random variable Y a (t, A) is affected by Items (2-5) above, but independent of the value of θ * . In the case where the pattern of meeting (see below) is chosen by a random scheduler [74], the random variables X a (t, A) and Y a (t, A) are affected also by the coin flips made by the random scheduler.

Specification of Opt
Our reference for evaluating the performances of algorithms is the optimal algorithm, denoted by Opt, which is the best algorithm operating under the most liberal assumptions regarding internal memory resources, internal (individual) computation abilities and communication capacities. That is, when considering algorithm Opt we assume no restrictions on the communication capacity, the memory capacity and the internal computation abilities of each agent. Being as liberal as possible, we further assume that active communication occurs without any noise 2 . In addition, we assume that each agent a has a unique identity and is aware of the entire pattern of meeting P in advance.
Each agent a starts by holding in its memory the full knowledge of the parameterized distribution family (Φ a , θ) (which includes, in particular, the variance of Φ a ). Without loss of generality, we assume that all individual decisions of a are encoded in its memory. Specifically, whenever moving by some quantity ∆(x) at some time t, the value of ∆(x) is stored in the memory Y a (t + 1) of the agent. Note that, in the case where algorithm Opt is probabilistic, the choice of ∆(x) may depend not only on Y a (t) but also on the results of a sequence r a (t) of coin flips. In this case, the sequence r a (t) is written in Y a (t) as well. Throughout the execution, each agent a will store in its memory Y a (t) all its meetings' information together with all the previous meeting information stored at the agents it has met, and their previous meeting information, etc.. That is, during an interaction at time t + 1, the viewing agent a stores in its internal variable Y a (t + 1) not only its internal state before the meeting, namely Y a (t), and the noisy sampled ab(t) of its distance from the second agent b, but also b's identifier and its full memory content as stored in Y b (t). (This includes b's initial parameterized distribution family (Φ b , θ) and all measurements agent b previously took). This leads to an internal state Y a (t + 1) composed of a large nested structure which contains all the information that the agent could possibly obtain at that time from the set of relevant agents R a (t, P). Note, given the pattern P, this information Y a (t) subsums the information it would have received at time t, under any other given algorithm.
The goal of each agent a at time t is to locate itself as close as possible to θ * , that is to minimize |x a (t) − θ * |. (recall that agent a is not aware of value of x a (t).) Algorithm Opt must maintain unbiased estimators X a (t) at any time t, and among all such estimators, it minimizes the mean square error of X a (t) around θ * for any given time t. Note that algorithm Opt is well-defined, since the moves used for optimizing the variance at time t do not collide with attempts for optimizing the variance for other times. The reason for that is, that the algorithm at each agent a encodes the previous moves of agent a in a's memory Y a , and thus these moves can later be subtracted by any other agent that takes into account agent a. Since the noise in the distance samples is additive, this means that without loss of generality, we can assume that, at least when algorithm Opt is concerned, the moves of agents do not affect later decisions made by the same or other agents.

Specification of Opt in an independent meeting pattern
First, please note that all proofs in this manuscript do not rely on the identification of an optimal algorithm. In practice, the performances of any algorithm is evaluated by comparing it to the Cramér-Rao bound. For completeness, we specify the optimal algorithm Opt for independent meeting patterns.
Under an independent meeting pattern, algorithm Opt can be described explicitly. At any time, each agent keeps a pdf in its memory (centered at zero), describing the distribution of the relative distance between the target value θ * and the current location of the agent. At t = 0, since the initial distribution Φ a is unbiased around θ * , its mirror image, namely Φ a (−x), centered at the zero, is the pdf describing the relative distance of θ * from the agent's location. Note that before any interaction occurs, an agent's optimal strategy would be not to move from its initial location, since any move will either make its location a biased estimator of θ * , or will increase its variance around θ * .
Assuming a prior distribution regarding oracle location that is much wider than the uncertainty of any agent, one can use Bayesian considerations to calculate the update rule for interactions between independent agents. In this case, when one agent views another, it updates its pdf by point-wise multiplication of its current pdf with the noise-convoluted and translated (by the distance measurement) pdf of the viewed agent. The agent then relocates itself to the mean of the new pdf . More specifically, let f a,t (x) (respectively, f b,t (x)) denote the pdf of agent a (respectively, b) at time t. Consider a third agent c that has no knowledge regarding its distance from θ * , which now receives the distribution f b,t (x) and a noisy measurementd cb = d cb + η of its distance d cb from agent b. Given this new knowledge, the distribution of the relative distance from θ * to c is described by the following pdf : Now, assume that at t, instead of agent c, it is agent a that receives the distribution f b,t (x) and a noisy measurementd ab = d ab + η. Since the meeting pattern is independent, the previous knowledge of a regarding its distance from θ * at time t (described by f a,t (x)) is independent from the new information it now received from b. Hence, the distribution of θ * at time t + 1 is proportionate to The normalized version of g a,b (x), termed g a,b (x), therefore describes the distribution of θ * from its location at time t, given the knowledge obtained by a after viewing agent b. To make its location an unbiased estimator of θ * , the agent then moves to the mean of this distribution. That is, we have Finally, the new pdf of a at time t + 1, namely, f a,t+1 (x), is g a,b (x) shifted by the last move of the agent, specifically, Summarizing the aforementioned discussion, we have the following description of algorithm Opt.

Algorithm Opt
In the general, under independent conditions, the pdf s maintained by algorithm Opt will be complex: To begin with, the initial distributions Φ a may be already complex in terms of their description. Further, as a result of an interaction, calculating the new pdf , as well as its new mean, involve complex operations such as integration and convulsion. Note, with time the pdf 's become increasingly more intricate.

Fisher information and the Cremér-Rao bound
We consider a multi-variable probability density function (pdf ) family {(Φ; θ)} θ∈R where θ is a translation parameter [45]. Letz = {z 1 , z 2 . . . z k } be a vector of random variables, distributed according to (Φ; θ). The Fisher information of {(Φ; θ)} θ∈R with respect to θ is defined as: Note, that since θ is a translational parameter, Fisher information is both unique (there is no freedom in choosing the parametrization) and independent of θ [45].
The Cremér-Rao inequality bounds from below the variance of the best possible estimator of θ * which is based on the random vector samplez taken from (Φ; θ * ), by the reciprocal of the corresponding Fisher information.
Theorem 3.1. LetX be any unbiased estimator of θ * which is based on a multi-variable samplez taken from (Φ; θ * ). Then

Initial Fisher-deviation
To define the initial Fisher-deviation parameter ∆ 0 , we first define the Fisher-deviation of a single variable distribution Φ centered at θ * , as (Note that ∆(Φ) ≥ 1 for any unbiased distribution Φ, by the Cramér-Rao bound). Similarly, the Fisher-deviation ∆(N ) of the noise distribution N (η) is defined as: The initial Fisher-deviation ∆ 0 is the supremum of the Fisher-deviations over all the (unbiased) distributions involved, namely, the Φ a distributions governing the initial locations and the noise distribution N (η). Specifically, let Observe that if the distributions Φ a and N (η) are all Gaussians then ∆ 0 = 1.

Relative Fisher Information
Fix any pattern of meeting P and any algorithm A. In addition to the noise N (η) on the passive communication, we now consider any level of noise on the active communication (although we do not specify this noise explicitly). We start off by defining the notion of relative Fisher information associated to an agent a at a time t operating under algorithm A. This definition will be used to bound from below the variance of the location X a (t, A) of agent a at time t around the environmental value θ * .
Recall, we consider a group of agents that apply algorithm A in order to maintain, for each time t, an unbiased estimator X a (t, A) of the environmental state θ * . Note that once the location of the environment is chosen to be some θ * , all agents' initial locations are chosen with respect to θ * . Hence, informally, since agents have no knowledge regarding their actual location and only obtain relative distance samples, from their perspective, all scenarios are identical, regardless of the actual location of the target.
Each agent stores in its memory a multi-valued random variable Y a (t, A) using it to perform its move ∆(x) a,t+1,A at the beginning of round t + 1 (see Subsection 1.1). This would yield its new location at time t + 1, that is, The random variable Y a (t, A) is distributed according to some distribution f y that depends on: • The pattern of meeting P, the agent a and the time t, • The rule of algorithm A (determining moves and communication), It is important to note that f y is independent of θ * , and therefore one cannot construct an estimator for θ * by relying on f y alone. On the other hand, f y contains information regarding the relative distance between the agent and θ * .
Consider now an (imaginary) external observer a * that at any given time needs to estimate the actual target value θ * . The external observer a * at the beginning of round t + 1 receives the pair where the sample y a (t + 1, A) is the memory of a at the beginning of round t + 1, and x a (t) is the actual location of a at time t. Note that the random variables y a (t + 1, A) and x a (t) may be dependent. In contrast to y a (t + 1, A), the value of x a (t) depends on θ * . Hence, z a * (t + 1, A) is distributed according to a pdf family {(g a * (t), θ)} parameterized by θ. Under some smoothness conditions of A and integrable properties of the noise N (η), the noise in the active communication, and the initial distributions Based on z a * (t + 1, A), the external observer needs to output an unbiased estimationX a * (t + 1, A) of θ * , that is, we must have: where the mean is taken with respect to the distribution of the random multivariable z a * (t, A) and, in case a * is probabilistic, also the coin tosses of a * , used for deciding its output. For example, one possible estimator would be x a (t, A is the move of a at the beginning of round t + 1 under algorithm A, assuming a had y a (t + 1, A) in its memory. In this caseX a * (t + 1, A) is simply X a (t + 1, A), and hence unbiased.

Relative Fisher information:
We define the relative Fisher information of agent a at the beginning of round t as J a (t, A) := J g a * (t) (θ).
By the Cramér-rao bound, the variance var(X a * (t, A) − θ * ) of any unbiased estimatorX a * (t, A) of θ * used by the external observer a * at time t is bounded from below by the reciprocal of the relative Fisher information of agent a at that time. That is, we have: Since X a (t, A) is one possible such estimator, we obtain the following.
Lower bounds on the variance of the location of an agent around θ * can therefore be obtained by bounding from above the corresponding relative Fisher information.
For simplicity of notation, from here onwards, we refer to the relative Fisher information of agent a at time t simply at as the Fisher information of agent a at time t.

Proof for upper bounds
Throughout this section, we fix any algorithm A used by the agents, and any independent pattern of meeting P. In addition to the noise N (η) on the passive communication, we assume any level of noise on the active communication. We assume smoothness properties of A and integrable properties of the noises and the initial distributions {Φ b } b , so that the corresponding Fisher information at all agents and times are defined.
The goal of this section is to prove the following theorem.
Theorem 4.1. For any any time t < T P , algorithm A, and independent meeting pattern, the Fisher information of agent a satisfies: Proof. Our goal is to bound the Fisher information in the family {(g a * (t), θ)} parameterized by θ.
(Recall, (g a * (t), θ) is the density function corresponding to the pair Z a * (t) = (Y a (t), X a (t)), assuming the location of the environmental value is θ).
Assume that at time t, agent a views agent b. Let us take a closer look at the random multivariate random variable Y a (t + 1) corresponding to the memory of a at the beginning of round t + 1. Note that the algorithm A may choose to erase or change some of the information the agent obtained. In addition, the algorithm at b may choose to transmit only part of the memory y b (t) at b, and this transmission of of the active information may further incur noise. Hence, overall, the memory Y a (t + 1) is a function of: • The random variableD ab (t) := X b (t) − X a (t) + N , (corresponding to the noisy distance measurement observed by agent a when viewing agent b), • The memory Y a (t), • The memory Y b (t).
We now aim at calculating the Fisher information with respect to the parameter θ, available to the outside observer agent a * at time t + 1. This is the Fisher information with respect to θ, in the pair Z a * (t + 1) = (Y a (t + 1), X a (t)). Since Y a (t + 1) is a function ofD ab (t), Y a (t), and Y b (t), the Fisher information in Z a * (t + 1) is bounded from above by the Fisher information in the collection of random multi-variable (X a (t),D ab , Y a (t), Y b (t)), see [44]. This latter Fisher information is the same as the Fisher information in the random variables X a (t), Our goal therefore is to bound from above the Fisher information J in the random multivariable (X a (t), Since the meeting pattern is independent, given the location of the environment value θ, the random multivariable (X a (t), Y a (t)) is independent of the random multivariable in (X b (t), Y b (t)). The Fisher information J a (t + 1, A) available to agent a * at time t + 1 with respect to θ is therefore bounded from above by the Fisher information J a in the random multivariable (X a (t), Y a (t)) with respect to θ plus the Fisher informationJ b in the random multivariable ( The term J a is precisely J a (t, A), namely, the Fisher information available to agent a * at time t. Let us now focus on the rightmost term in Equation 1 and calculateJ b , namely , the Fisher information in the random variable ( , where x 1 is a real valued random variable. Observe that the right hand side of Equation 2 is a convolution of G with N , where the convolution occurs with respect to the first random variable in G, namely, x 1 . The Fisher information inequality [44,45] bounds the Fisher information of a convolution, but it applies to single variable distributions. Essentially, the theorem says that if x, y and θ are real values, K(x − θ), R(x − θ) and Q(x − θ) are a parameterized family and K = R ⊗ Q, then J(K) ≤ 1/( 1 J(R) + 1 J(Q) ). We would like to have such an inequality for our convolution of G and N , but we cannot directly apply it to our case, since we have a function G with multiple random variables, where only one of them x 1 being convoluted. We rely on the fact that fact that the random variable y b (t) does not depend on the location of θ. This fact turns out to be sufficient to also overcome the potential complication rising from the fact that given that θ = µ, the random variable x 1 may depend on the random multi-variable y b (t). In Subsection 4.2 we prove Lemma 4.2 which extends the Fisher information inequality to our multi-variable (possibly dependent) convolution case, enabling to prove the following inequality: where J η is the Fisher information in the parameterized family N (η − θ) with respect to θ. In other words, we obtain: (3) Together with Equation 1, we obtain the recursive inequality for the Fisher information as required by Theorem 4.1. This completes the proof of the theorem.

Extending the Fisher inequality
The Fisher information inequality [44] applies for three one variable distribution families r(z), p 1 (x 1 ), and p 2 (x 2 ) parameterized by µ such that r is a convolution of p 1 and p 2 , that is, r(z) = p 1 (z − t) · p 2 (t)dt. The theorem gives an upper bound of the Fisher information J µ r of the family r(z − µ) (w.r.t., to µ) based on the Fisher information J µ p 1 and J µ p 2 of the families p 1 (x 1 − µ), and p 2 (x 2 − µ), respectively. Specifically, the theorem states that: (α 1 + α 2 ) 2 J µ r ≤ α 2 1 J µ p 1 + α 2 2 J µ p 2 , for any two real numbers α 1 and α 2 . This in particular implies that 1/J µ r ≥ 1/J µ p 1 + 1/J µ p 2 . The following lemma extends the Fisher information inequality to the case where the distributions p 1 and r are composed of multiple, not necessarily independent, variables, where the convolution with p 2 takes place over of one of the variables of p 1 .
Lemma 4.2. Let r(z,x 3 | θ), p 1 (x 1 ,x 3 | θ), and p 2 (x 2 | θ) be three pdf s such that z, x 1 and x 2 are real variables,x 3 is a vector of multiple real valued variables such that where J θ f (·) is Fisher information in the parameterized distribution family f with respect to the parameter θ.
Proof. We start by using the definition of r as a convolution over p 2 and the first variable of p 1 : We can insert the density function p(x 3 ) to rewrite the right hand side as: Implying that: We now define the distributions R(z) = r(z − θ|x 3 ) and P 1 (t) = p 1 (t − θ|x 3 ) so that the previous equation becomes For which we apply the original Lemma by inequality as proved by Stam [44]: and deduce that for any two real numbers α 1 and α 2 , we have We now multiply both sides of the equations by p(x 3 ) and integrate overx 3 , to yield: Plugging in the definitions for Fisher information and R(z), the integral on the left hand side becomes: where we usedz = z − µ and the fact thatx 3 is independent of θ.
Similarly, the integral over the first term on the right hand side of Equation 4 similarly gives J θ p 1 (x 1 −θ,x 3 ) . The last term is by normalization of the distributionx 3 .

An almost optimal compressed algorithm
In this section, we prove that under any given independent pattern of meetings, the performances of algorithm Conf, as defined in the main text, are near optimal.

Mean and variance of estimators in algorithm Conf
Our first goal is to prove that that for any agent a and at any given time t, we have (1) algorithm Conf preserves an unbiased estimator X a (t) := X a (t, Conf) and (2) the confidence c a (t) at a equals the reciprocal of the current variance of the location of a with respect to θ * .
Observe first that if the location of u before the interaction is x a (t) then after the interaction, its location is Recall that the distributions Φ a are centered around the environmental value θ * and that the noise N (η) is centered around zero. Moreover, observe that at time t, the confidence at each agent a, namely c a (t), is deterministically defined (if we fix the pattern of meeting). Equation 5 therefore implies the following, by induction.
Claim 5.1. At any time t and for any agent a, the location x a (t) serves as an unbiased estimator of θ * .
We are now ready to analyze the variance of the estimator x a (t) around θ * .

Lemma 5.2. Consider algorithm Conf. At any time t ≤ T P and for any agent a, we have
Proof. The lemma holds for time t = 0 by definition of c a (0). Assume by induction that for any agent a at time t it holds that c a (t) = 1/var(x a (t) − θ * ) and consider time t + 1. We now consider an interaction between two agents at time t, in which agent a views agent b. The variance of the new location of a is: var(x a (t + 1) − θ * ) = var x a (t) +d which proves the induction step.

Competitive analysis -definition
To evaluate the performances of a given algorithm, we compare them to those of the best possible algorithm, namely, algorithm Opt. Specifically, for a given time t, the variance of an agent a under an algorithm A is defined as var(a, t, A) = var(X A a (t) − θ * ). The competitiveness of algorithm A at time t is Ψ(t, A) := max var(a, t, A) var(a, t, Opt) | a is an agent .
Finally, the competitiveness of algorithm A is defined as

The competitiveness of Conf
To evaluate the performances of algorithm Conf, we compare them to those of algorithm Opt (see Subsections 2 and 5.2.1), for a given (fixed 3 ) independent pattern of meeting P. Specifically, our next goal is to prove that under algorithm Conf, for any agent a and any time t, the location of a at time t is an unbiased estimator of θ * . Furthermore, we shall show that the variance of a's location around θ * is at most a multiplicative factor of the initial Fisher-deviation ∆ 0 over its corresponding variance at that t under algorithm Opt (see Subsection 3.1.2 for the definition of ∆ 0 ). That is, we aim to prove that for any agent a and time t ≤ T P the following equation holds.
In other words, the competitiveness of algorithm Conf at any time t is at most the initial Fisherdeviation ∆ 0 .
Lemma 3.3 gives a bound on the performances of an agent a operating under algorithm Opt, with respect to the Fisher information at the agent. Specifically, we have: (t, Opt).
Initially, the Fisher information J a (0, Opt) at an agent a equals the Fisher information in the parameterized family (Φ a , θ) with respect to θ. For a given independent pattern of meeting P, Theorem 4.1 gives a recursive rule for calculating the Fisher information J a (t, Opt), that is, we have: In the next section, we use this inequality to show that, in fact, under algorithm Conf, the variance of agent a at time t, is bounded from above as follows: Opt).

Proving this equation would establish Equation 6.
We are now ready to analyze the competitiveness of algorithm Conf. We first connect the variance of an agent under Conf to its Fisher information under algorithm Opt. Recall that J a (t, Opt) stands for the Fisher information of agent a in time t assuming agents execute the optimal algorithm Opt. The following lemma states that the confidence at an agent in algorithm Conf equals, up to a multiplicity factor of ∆ 0 , the fisher information of the corresponding imaginary agent at the same time, in algorithm Opt. Lemma 5.3. At every time t ≤ T P , we have c a (t) ≥ Ja(t,Opt) Proof. The lemma holds initially by the definition of ∆ 0 . Assume by induction that at time t the lemma holds and consider at interaction at time t when agent a views agent b. By the induction hypothesis and the fact that c a (t + 1) = c a (t) + 1 1/c b (t)+1/c N , we obtain: implying that: By definition of ∆ 0 , we have ∆ 0 ≥ J η /c N . Hence: where the second inequality holds by Theorem 4.1. This completes the proof of the lemma.
Lemmas 5.2 and 5.3 together with Lemma 3.3 imply that establishing Equation 6. Hence, we obtain the following bound.

On the performances of Conf at large times
We now consider the case that the number of interactions per agent are arbitrary large, and show that for large times, the variance of Conf becomes arbitrarily close to zero, and that the performances of Conf become closer and closer to those of Opt.
We consider noise N (η) whose variance is v η := var (N (η)). Let C min denote the minimum initial confidence of an agent in a given population. For fixed values of v η and C min , we consider larger and larger populations (n goes to infinity), and correspondingly, independent patterns of meeting with large and larger depth M n .
For a given population, let C min (t) denote the minimum initial confidence of an agent at time t (in particular, C min (0) = C min ).
Proof. When agent a views agent b the gain in confidence for agent a is It follows that C min (t) increases in a single round by either a multiplicative factor of 3/2 or by an additive factor of v η /2. This implies that the confidences go to infinity as time go to infinity.
For a given population, let var max (t) denote the maximal variance of an agent at time t under algorithm Conf. Claim 5.5 together with Claim 5.2 immediately imply the following.
Clearly, since algorithm Opt is superior over algorithm Conf, the same limit property of the variance applies to algorithm Opt as well. We now claim that in fact, if the noise N (η) is Gaussian, then the variances in Conf and Opt go to zero at roughly the same speed.
Proof. Since the noise is gaussian we have v η = 1/J η . (Recall, J η is the Fisher information in the Noise, that is J η := J (N −θ,θ) (θ)). Note now that as c a (t) become larger and larger the gain in confidence becomes very close to J η . Specifically, we have c a (t + 1) = c a (t) + 1 1/c b (t) + 1/J η .
Consider now the case that c b (t) > x · J η , for some large x. Here, the increase in confidence at a is some quantity ∆J Conf (t), satisfying The Cremér-Rao bound and Lemma 5.2 imply that J b (t, Opt) ≥ c b (t), and hence, J b (t, Opt) > x·J η . This, together with Theorem 4.1, implies that at time t, the increase ∆J Opt (t) in Fisher information of a under algorithm Opt is some quantity satisfying Hence ∆J Conf (t) and ∆J Opt (t) are the same quantity up to a multiplicative factor of 1 1+1/x . Since, by Claim 5.5, x goes to infinity as M n goes to infinity, we get that: This completes the proof of Lemma 5.7.

The Fisher Channel Capacity and convergence times 6.1 The Fisher Channel Capacity
Given the rule stated in Theorem 4.1 (and the given interaction pattern P), one could now recursively upper bound the Fisher information J a (t, A). It is this information which, according to the Cramér-Rao bound, sets a lower bound on the variance of X a (t) around θ * (see Lemma 3.3). Theorem 4.1 directly implies the following.
The corollary above sets a bound of J η for the increase in Fisher information per round. This bound holds with respect to any level of noise in active communication, and in particular, when active communication is noiseless. Note moreover, this bound of J η on the information increase holds with respect to any algorithm A. In analogy to Channel Capacity as defined by Shannon [67] we term this upper bound as the Fisher Channel Capacity.

Bounds on the convergences time
The restriction on information flow as given by the Fisher Channel Capacity can be translated into lower bounds for convergence time of algorithm A, i.e. the time in takes the population of agents to enter a certain tight window around θ * . More formally, recall that the variance of an agent a is defined as var a (t, A) = mean ((x a (t, A) where the mean is taken over all possible random initial locations and communication noises, as well as, possibly, over all random choices made by the agents themselves. The variance of the population is defined as the average of these variances, that is, where the mean is taken over all agents a.
Given small > 0, the convergence time T ( ) is defined as the first time it takes until the variance of the population around θ * is less than 2 . (Note that T ( ) is a random variable.) Convergence thus requires that the estimator applied by the typical agent has variance that is on the order of magnitude of 2 . In particular, by the Markov inequality, at least half the number of agents must have variance less than 2 2 . By Lemma 3.3, such agent a must have large Fisher information, specifically, J a (t, A) ≥ 1/2 2 .
To get some intuition on the convergence time, assume without loss of generality that the number of agents is odd, and let J 0 denote the median initial relative Fisher information of agents (this is the median over the Fisher information J Φa−θ = J a (0)), and assume J 0 1/ 2 . By definition, more than a half of the population has initial Fisher information at most J 0 . By the Pigeon-hole principle, at least one agent a has initial Fisher information less than J 0 and at time t, its Fisher information is at least 1/2 2 . Corollary 6.1 thus implies a bound for the convergence time T ( ).
Lemma 6.2. Assume that J 0 1/ 2 for some small > 0, then the following bound holds.
Let J max t denote the maximal Fisher information over the agents at time t. In the case where J max 0 J η , one can obtain a tighter upper bound for T ( ). Indeed, Theorem 4.1 implies that the maximum Fisher information grows by at most a factor odorf 2 in each round, that is, any time t, J max t+1 ≤ 2J max t . This implies that the time it takes until J max t reaches J η is at most log 2 (J η /J max 0 ). From this time, the bound implied by Lemma 6.2 holds. Hence, we obtain the following. Lemma 6.3. Assume that J max 0 J η 1/ 2 . Then the following bound holds.
Finally, we consider the case of exclusively-pairwise patterns, in which the pattern of meeting is such that an agent is never (or rarely) viewed by more than a single agent in each round. In exclusively-pairwise patterns, Theorem 4.1 implies that the average Fisher information, denoted J mean t , does not grow by more than a factor of two in each round. Similarly to Lemma 6.3, we obtain the following. Lemma 6.4. Consider an exclusively-pairwise pattern of meeting. Assume thatJ 0 J η 1/ 2 . Then the following bound holds.

No active communication
We now compare the Conf algorithm to even simpler algorithms that rely only on passive communication. We show that efficient "passive" algorithms exist in some "uniform" settings or extremely noisy ones. We then turn our attention to even simpler algorithms, which are based on a fixed weighted average rule between the location of the observing agent a and the location estimation of the observed agent b. We restrict our analysis to meeting patterns in which each agent views another agent in each round.

Passive algorithms
When noise is not extremely high, "passive" algorithms seem to become less efficient as the conditions become increasingly non-uniform. To get some intuition, consider the extreme case, in which all agents but one agent a, are initially drawn from the same distribution Φ whose variance is extremely large (i.e., these agents are distributed almost uniformly in some large domain centered at θ * ). Agent a, on the other hand, is drawn from an extremely concentrated distribution around θ * , making it "extremely knowledgeable". Further assume a uniform pattern of meeting, where in each round, each agent views another agent, chosen uniformly at random among the population. In such a case, active communication could allow a standard rumor spreading yielding very fast convergence, roughly within log n rounds, where n is the number of agents. On the other hand, it seems difficult to come up with a "passive" algorithm that competes well with Conf, as agents should somehow distinguish the knowledgeable agent a from the rest. For example, clearly, had the passive algorithm been executed with only the unknowledgeable agents (without agent a), it would not have converged fast. The fact that the algorithm must be unbiased and of low variance, would evidently cause, that in such a case, some of the agents would be anyways at the vicinity of θ * . Now, the agents running in the first scenario, should somehow distinguish the knowledgeable agent a from these other nearby agents that appear similar, in the second scenario. However, this is most likely impossible since agents are anonymous and active communication is not allowed

Fixed linear combination of locations
We now consider algorithms in which the update rule following an interaction is a simple linear combination of the location of the viewing agent and the estimated location of the viewed agent. More precisely, when agent a views agent b at time t is shifts its location such that: for some constant 0 < c < 1 (note that in algorithm Conf, c is not constant and is set according to the active message and a's current confidence). Informally, we find that such algorithms compare well with Conf under circumstances which are both very uniform and non-noisy. However, for large values of noise or if initial distributions significantly vary, the performances can be far from optimal.

Speed vs. accuracy tradeoff in uniform initial conditions
We focus first on initial conditions in which all agents are picked from distributions around the target θ * which all have the same variance var(0) v η = var(N (η)). In this case, at a given time t, the variances of all agents are the same, and we term this value by var(t). According to Equation 8, this variance satisfies the following recursion rule: The solution to this difference equation is: The steady state variance of this equation scales as c/(1 − c) and improves as c is smaller. However, for small values of c, the time to reduce the variance by half scales as which diverges for small values of c. Linear combination algorithms therefore exhibit a convergence time vs. accuracy tradeoff.

The simple average algorithm
We start with the simplest, most natural, algorithm in which agents simply average their location with the estimated location of the agent they interact with, this is the case c = 1 2 . This algorithm converges within time T ≥ log(var(0)/v η ) to the steady state: In comparison with algorithm Conf whose variance around θ * reaches arbitrary small values (see Lemma 5.6), the variance of this simple average algorithm never goes below v η /2. On the other hand, in the limit in which population variances are well above v η , the rules of Conf reduce to this simple average algorithm. Therefore, for very small variance values of the noise, the performance of this simple average algorithm compares with that of Conf in terms of both convergence rate and steady state variance.

Obtaining lower variance at the price of long convergence times
At steady state the Fisher information passing through the channel is 2/3 of the maximal capacity. The simple average algorithm, obviously, takes no advantage of this information since, at steady state, the Fisher information of the agents remains fixed. To lift this restriction we can choose: Focusing again on uniform meeting pattern and taking such that 2 v η , the variance at time t + 1 can be approximated as: It's easy to see that the steady state of this equation is var(∞) = 2 which we can choose to be arbitrarily small. This, however, comes at the cost of very long convergence times: Where the first term, the time to reach variance v η /2 is much longer than with c = 1/2 (see above), and the second term the time to reach steady state from v η /2 is much longer, for small > 0 than what is achieved by the Conf algorithm running under similar uniform conditions:

Extensions of Conf to dynamic environments
The main analysis in this paper was for dissemination of information within a constant environment. This assumption allowed us to perform rigorous analysis and aided in the definition of information flows within the population. The algorithm Conf, which we have demonstrated to be highly competitive with an optimal algorithm in stationary environments exhibits weaknesses when the environment is dynamic. These, can mainly be attributed to the buildup of overconfident populations that occurs when agents with similar opinions repeatedly interact and enhance each-others confidence.
Once environmental conditions shift this opinion may become wrong but the agents may ignore any new information since their confident in the wrong opinion is extremely high.
In this section we demonstrate how algorithm Conf may be extended to accommodate for the case of sudden changes in environmental conditions. These extension require some minimal additional complexity in the capabilities of a single agents.

Algorithm 1: Communicating an extra 'updated' bit
In this extension the agents perform the regular algorithm Conf which is supplemented by storing and communicating an extra "updated" bit in their memory. We take this active communication to be noiseless. The "updated" bit is typically set to 0 (an 'outdated agent') and may be set to 1 ('updated agent') when an agent views the environment, or as described below. The environment is modeled by including an extra mobile agent who is always considered to be updated, and whose confidence is constantly high. We assume that interactions with the environment are rare in comparison to the rumor spread time scale, log(N ), where N is the total number of agents. When an agent becomes updated it changes its update bit to 1 and maintains it at this value for a · log(N ) time steps after which the agent resets this bit to 0 and becomes 'refractory and outdated' for b · log(N ) extra time steps. Finally, (a + b) log(N ) time steps after it had first become 'updated' the agent return to its initial, 'outdated', state.
The remaining interaction rules are as follows: • When outdated agent views another outdated agent, it performs the regular rules as dictated by Conf.
• When an outdated agent observes an updated agent it sets his own confidence to be 0 and then performs the regular rules as dictated by Conf. If the observing agent is not refractory he switches to the updated state.
• When a refractory outdated agent observes an updated agent it performs regular Conf but does not switch to an updated state.
• When an updated agent observes another updated agent it performs regular Conf.
• An updated agent ignores any interactions with outdated agents.
An example for the performance of this algorithm is given in SI- Figure 1A.

Algorithm 2: Correcting interactions for correlations
Contrary to the algorithm presented in the previous subsection, which complemented the rules of Conf by additional rules regarding an extra bit, the algorithm presented in this subsection actually modifies the Conf location and confidence update rules. This algorithm follows the modifications to the simple inverse-variance weighting, as suggested by Oruc et al. [68], to deal with interactions between agents that possess correlated information. Here, the agents take into account the expected correlations between their locations before an interaction to more accurately update their position and confidence after the interaction.
The interaction rules that are used are an extension to those of Conf. Specifically, let v η denote the variance of N (η). Upon receiving c b (t) andd ab (t), agent a which view an agent b where the correlations between their positions is ρ it proceeds as follows:
• Update confidence: c a (t + 1) = We simulated this algorithm on a uniform population. This is since, in a non-uniform population, the correlation between any two agents may be different and this is difficult to keep track of in an anonymous population of agents with limited memory as interests here. In the simulation, we calculated the correlation between the agents one step at a time so that the correct correlations could be applied to the interaction rules at any time step. We find (see SI- Figure 1B) that before the environment changes and although the agents continuously interact they do not become overconfident. In fact, after a long time in which all agents interacted with each other, the Fisher information at each agent equals to the initial Fisher information in the entire population. This provides strong evidence that the algorithm suggested by Oruc et al. [68] actually performs well within a population setting. Furthermore, the fact that the agents' confidence stays bounded, even at long times, allows them to quickly track an environmental shift when it happens (see SI- Figure  1C). Figure 1: Extensions of Conf to dynamic environments. A. In the 'updated bit' algorithm, the agents behave exactly as they would in algorithm Conf except for maintaining and communicating a single extra bit which signifies that they have recently received new environmental evidence either directly or indirectly. Agents remain 'updated' for a time of a · log(N ) (N is the number of agents) and then become refractory for another b · log(N ) time steps. The figure shown how this modified Conf algorithm follows an environmental change (at green arrow), while Conf cannot. Here a = 2, b = 7 and N = 10 4 , communication noise is present. B. Agents use a modified interaction rule that takes into account the correlation between them prior to the interaction. While under the Conf algorithm the confidence of agents is unbounded and explodes (note, that the figure is log scale), this modified algorithm ensures that their confidence remains bounded. In fact the confidence of each agent at long time scale exactly equals the initial Fisher information for the whole population at t = 0 (depicted by the black line). This signifies that the agents despite the communication noise the this algorithm allows the agents to share all initial information while remaining realistically confident about their current state. C. The fact that under the modified interaction rule the agents do not become over confident allows them to track an environmental shift (green arrow). This, again, stands in contrast to the abilities of the original Conf algorithm. In this case, population size was taken to be small (N = 100), so that large correlations are created.

Heterogeneous populations
In a heterogenic population there is no guarantee that all agents measure distance in exactly the same way. We tested how algorithm Conf performs in a population in which agents have a different grasps of distance. SI- Figure 2 shows that the algorithm dominates over an algorithm with no active communication even when different individuals within the population perceive distances with multiplicative errors that vary between 1 2 and 2. For even larger biases Conf still performs better at long but not short time scales. The convergence of Conf on a population of N = 10 4 agents that vary in the way they measure distance. To simulate this population each agent was matched with a factor which multiplies all its distance measurements. The factors where chosen uniformly at random in a range that varies with line color. For example, in the population corresponding to b = 4 agents perceive distances to be between 4 times smaller or 4 times larger than they actually are. Convergence of population with different biases can be compared to the original Conf algorithm (b = 1) and to a simple average algorithm that does not communicate confidence and which is unaffected by measurement biases.