An Evolutionary Model of Bounded Rationality and Intelligence

Background Most economic theories are based on the premise that individuals maximize their own self-interest and correctly incorporate the structure of their environment into all decisions, thanks to human intelligence. The influence of this paradigm goes far beyond academia–it underlies current macroeconomic and monetary policies, and is also an integral part of existing financial regulations. However, there is mounting empirical and experimental evidence, including the recent financial crisis, suggesting that humans do not always behave rationally, but often make seemingly random and suboptimal decisions. Methods and Findings Here we propose to reconcile these contradictory perspectives by developing a simple binary-choice model that takes evolutionary consequences of decisions into account as well as the role of intelligence, which we define as any ability of an individual to increase its genetic success. If no intelligence is present, our model produces results consistent with prior literature and shows that risks that are independent across individuals in a generation generally lead to risk-neutral behaviors, but that risks that are correlated across a generation can lead to behaviors such as risk aversion, loss aversion, probability matching, and randomization. When intelligence is present the nature of risk also matters, and we show that even when risks are independent, either risk-neutral behavior or probability matching will occur depending upon the cost of intelligence in terms of reproductive success. In the case of correlated risks, we derive an implicit formula that shows how intelligence can emerge via selection, why it may be bounded, and how such bounds typically imply the coexistence of multiple levels and types of intelligence as a reflection of varying environmental conditions. Conclusions Rational economic behavior in which individuals maximize their own self interest is only one of many possible types of behavior that arise from natural selection. The key to understanding which types of behavior are more likely to survive is how behavior affects reproductive success in a given population’s environment. From this perspective, intelligence is naturally defined as behavior that increases the probability of reproductive success, and bounds on rationality are determined by physiological and environmental constraints.


The General Model
We first present our general model, which encompasses the possibility of variation in outcomes across individuals within a generation, as well as across generations, and also the possibility of intelligent behavior. In the following sections we consider several special cases giving rise to the results stated in the accompanying paper.
Each individual in a population is faced with a single decision in its lifetime, choosing action a or b, and this choice results in a certain number of offspring, x a or x b , respectively. The quantities x a and x b are random variables with a joint distribution Φ (x a , x b ). The behavior of individual i is represented by a 0/1 Bernoulli trial, I, with probability f , i.e., a is chosen with probability f (in which case I = 1), and b is chosen with probability 1 − f (in which case I = 0). When we wish to specify the outcomes applicable to a particular individual, i, for any of these variables, we add a subscript i. Similarly, when we wish to specify a particular generation, t, we add the subscript t.
We assume that an individual with a choice function I has offspring with the identical choice function I. We are interested in the growth of the population of individuals with a specific choice function over time, and we write n t for the number of such individuals in generation t. In general, we have where the sum runs over all individuals in generation t − 1. We assume that although all the individuals have the same function I, the random variable for each individual is independent of all the others. We also assume that the distributions Φ it are independent and identically distributed across individuals i and times t. Under these assumptions, the value of n t can be expressed as where the expectations and covariances in (1) are calculated for a typical individual having offspring at time t. No subscript, i, is needed to index these individuals since all members of the population have the same expectation of outcomes and the same choice function I t . The symbol p = denotes equivalence in probability, and this equivalence (1) follows from the Law of Large Numbers. 1 Introducing some new notation, we can rewrite (1) as where µ at and µ bt represent the common expected values for x ait and x bit for each individual i at time t, where σ at and σ bt represent the corresponding standard deviations, and where ρ at and ρ bt represent the common correlations at time t of each x ait and x bit with I it and 1 − I it , respectively. Because all of these values are the same across all individuals having offspring at time t, the subscript i is not necessary in any of the terms in (2). It is also convenient to write where ρ t is the common correlation of I it with y it = x ait − x bit for each individual i having offspring at time t, and where σ t is the common standard deviation of y it for such individuals.
We use backward recursion to find that where n 0 is the number of individuals in the population at time t = 0. From this we deduce that where the expectation is taken with respect to the continuous limit of the distribution of the random variable values over times t. 2 1 In particular, the sums over the sample population converge almost surely to the unrestricted means and covariances. This follows since the variance of each relevant random variable must be bounded, provided that there is an upper bound on the possible number of offspring a single individual may have. 2 Note that to apply the Law of Large Numbers here we assume that the terms This assumption is valid provided that the distribution of the argument of the logarithm does not have positive mass in arbitrarily small neighborhoods of zero.
It is convenient to introduce a new notation for the right-hand side of (4), namely In what follows, we seek to identify the values of I that give rise to the maximum value for α, since individuals with such values will dominate the population over time in a sense made precise by the following proposition.
Proposition 1 Suppose that two different choice functions, I 1 and I 2 , give rise to values α 1 and α 2 , with the property that α 1 > α 2 . Individuals with the choice function I 1 will become exponentially more numerous over time, since In the following sections, we consider the parameters giving rise to maximal values for α under various specific assumptions about the nature of the distributions Φ it .

The Case of No Intelligence
We say that a member of the population exhibits intelligence if its behavior correlates positively with outcomes, i.e., if ρ > 0. If ρ = 0, however, then we say that no intelligence is present. 3 In this situation, we can write α = α(f ), and we write f * for the value of f that gives rise to the maximum value of α. Also, we can write the expression for α from (5) as The value f * that maximizes this expression for α is characterized by the following proposition.
Proposition 2 If intelligence is not present in a population, the growth-optimal behavior f * is given by where f * is defined implicitly in the second case of (6) by: and the expectations are taken with respect to the joint distributions across time t for µ at and µ bt , as these distributions are implied by the Φ it .
Proof. The result can be seen by computing the first and second derivatives of α. Because the second derivative is strictly negative, there is exactly one maximum value obtained in the interval f ∈ [0, 1]. The values of the first derivative of α(f ) at the endpoints of the interval are If α (0) and α (1) are both positive or both negative, then α(f ) increases or decreases, respectively, throughout the interval and the maximum value is (7), and it is at this point that α(f ) attains its maximum value. The expression (6) summarizes the results of these observations for the various possible

A Universal Measure and Cost of Intelligence
As we have noted, the case of no intelligence corresponds to no correlation, i.e., ρ t = 0, while the case of intelligence corresponds to positive correlation, i.e., ρ t > 0, with higher values representing more intelligence. The correlation ρ t cannot necessarily assume any value in the range [0, 1], however, and it is in fact constrained by the choice of f . More specifically, ρ t can assume all values in the range [0, ρ t,max (f )] but no values outside this range, where ρ t,max (f ) is a function dependent on f and the Φ it . A precise value for ρ t,max is calculated in Proposition 3, below. Because the upper bound for ρ t depends upon f , the measure ρ t is difficult to use as a universal representation of underlying intelligence. We therefore introduce the additional variable γ, defined as This is a universal representation of intelligence in the sense that it represents the fraction of the maximum possible correlation achievable, and this fraction remains constant even as the maximum possible correlation varies with f .
In the case in which there is no variation in the Φ it across time, then the values of ρ t and ρ t,max (f ) are the same for all t, and we write these common value as ρ and ρ max (f ). This is the case we considered in our main paper, and there we simply used ρ as the measure of intelligence instead of γ, since the two measures are the same up to a constant rescaling factor that is common across all generations. For purposes of this Supplementary Information, however, we deal with the more general situation in which we must use γ instead of ρ as the universal measure of intelligence.
We suppose that a member of a population has a particular value of f ∈ [0, 1] and a particular value of γ ∈ [0, 1], and that these attributes are passed on to all offspring of an individual. In terms of f and γ, the expression for α in in (5) can be written We also consider the possibility that γ has a cost, c(γ), associated with it, and that once this cost is factored in, the expression for α becomes We assume that c(0) = 0 and c(γ) > 0 for γ > 0. We also assume that γ −c(γ) > 0 for sufficiently small values of γ and that γ − c(γ) < 0 for values of γ sufficiently close to 1. Thus, at least some small amount of intelligence is beneficial, but high costs make the choice of γ = 1 prohibitively expensive. In addition, we make the further assumption that c is twice continuously differentiable and that c (γ) > 0 and c (γ) > 0. Because of this assumption, there is a unique value of γ * that maximizes γ − c(γ). γ * = unique γ such that c (γ) = 1.
It is convenient to introduce some additional notation related to the distribution of the y it , which is given by the function Φ it . We write ϕ t for the probability that y it > 0 for individuals in generation t, i.e., ϕ t = Prob(y it > 0).
This value is thus the probability that choice a is superior to choice b in generation t. In addition, we write δ + t and δ − t for the expected value of y it conditional on either y it > 0 or y it ≤ 0, respectively. That is, The values of ϕ t , δ + t , and δ − t are constant across all individuals in generation t because the functions Φ it are independent and identical across individuals in generation t. In much of what follows, we also find it convenient to make the following assumption about the independence of y it and I it , conditional on the sign of y it . This assumption may be violated in a fully general case, but it allows us to simplify our analysis and obtain more tractable formulas while still retaining a rich framework in which to operate.
A 1 For all i and t, conditional on the sign of y it , the distribution of y it and the distribution of I it are independent. Thus, In other words, the value of I it can only depend upon the sign of y it , and thus the question of whether a or b is the superior choice, and not upon additional information about the degree of the superiority of one choice over the other.
Under assumption A1, and using Propositions 3 and 4, which are proven in Section 4 below, we can rewrite (10) as We seek the values of f and γ that maximize α, as defined in (14). If the optimal value occurs when f = 0 or f = 1, then the amount of intelligence is clearly irrelevant, since the term involving intelligence vanishes and behavior is simply deterministic. If the optimal value occurs when f ∈ (0, 1), then the optimal amount of intelligence is clearly γ = γ * , as defined above. The nature of the optimal choice of f is derived in Propositions 5 and 6, in the case of no systematic variation across generations, and Proposition 7, in the general case.

Upper Bound on Correlation
In this section we consider the restrictions on the possible values for ρ t . The value of ρ t is subject to constraints that depend upon the nature of the distributions Φ it (x a , x b ), as well as on the value of f for the population. The next proposition makes this dependence clear, when assumption A1 holds.

Proposition 3 Under assumption A1, the value of ρ t is given by
where H(y it ) is the Heaviside function, which is 1 when y it > 0 and 0 otherwise. The values of δ + t , δ − t , and ϕ t are as defined in (13) and (12). The correlation between I it and H (y it ) may be also be written as where π t = Prob (I it = 1 and y it > 0) .
The value of r t and that of π t are the same for each individual i in a given generation t.
Proof. The result follows directly from the definition of correlation. We have which proves (15).
The value of π t does not depend upon the choice of individual i because the functions Φ it and I it are independent and identically distributed across individuals in generation t. The lack of dependence of r t on the choice of individual i within a given generation can be seen by noting that and observing that the right-hand side of this equation does not depend upon the choice of individual i.
An implication of the formula in (15) is that the possible values for ρ t cannot necessarily be arbitrarily close to 1. The range of possible values for ρ t is made more precise by the following proposition.

Proposition 4 Under assumption A1, the range of possible values for π t when intelligence is present is
The corresponding range of possible values for r t is The inequality r t < 1 holds whenever f = ϕ t . The corresponding range of possible values for ρ t is The inequality ρ t,max < 1 holds unless r t = 1 and both E [y 2 it |y it > 0] = ( E [y 2 it |y it > 0]) 2 and Proof. From the definition of π t in (17), it is clear that π must be bounded above by min (f, ϕ t ). Also, because intelligence only occurs when ρ t > 0, the formulas in (15) and (16) show that it must also be the case that π t is bounded below by f ϕ t , and this suffices to prove (18).
The range of possible values for r t can be derived by substituting the limits of the possible range for π t into the expression for r t in (16), and the range of possible values for ρ t follows from (15).
To obtain an upper bound on ρ t,max , it is useful to proceed by first deriving a lower bound for σ t . Note that Hölder's Inequality shows that The expectations are all taken with respect to a particular individual within generation t and are also independent of the specific choice of individual within the generation. Also, the conditions for equality in Hölder's inequality show that equality only holds for our lower bound on The lower bound on σ 2 t can be re-written as and this the upper bound on ρ t,max described in the proposition.

Intelligence and No Variation Across Generations
We now consider the case in which there may be intelligence, so that it is possible to have γ > 0, but we assume that there is no variation in the distribution of possible outcomes across generations. Thus, we have µ at = µ a , µ bt = µ b , δ + t = δ + , δ − t = δ − , and ϕ t = ϕ. In this case, we can write the expression for α from (9) as Note that we do not take the expectation of the logarithm in this expression for α, since the value of the logarithm is constant across generations under the current assumptions. The values of f and γ that maximize α in the case in which intelligence is costless (so that c(γ) ≡ 0) are characterized by the following proposition.
Proposition 5 Under assumption A1, and under the further assumptions that intelligence has no cost and that there is no variation in outcome possibilities across generations, the values of f and γ at which α is maximized are f = ϕ and γ = 1, provided that ϕ ∈ (0, 1). If ϕ is either 0 or 1, then α is maximized when f = ϕ, and the value of γ is irrelevant.
Proof. The expression for α in (21) is maximized when the argument of the logarithm is maximized, and this can be written The partial derivative of e α with respect to γ is and this value is greater than 0 provided that f ∈ {0, 1}, since δ + − δ − > 0. Thus, if f ∈ {0, 1}, e α is a strictly increasing function of γ, and the maximum value of e α for any fixed value of f ∈ (0, 1) is obtained when γ = 1.
If ϕ is 0 or 1, then the value of f that maximizes α and e α is clearly f = ϕ, and the value of γ is irrelevant, since intelligence is irrelevant for an optimal outcome in this situation.
Proposition 5 shows that, when c(γ) ≡ 0, more intelligence, i.e., a higher γ value, is always desirable, except in situations in which behavior is completely deterministic, i.e., f is equal to 0 or 1. This result makes sense when intelligence is costless, but to make the situation more realistic, we also consider the situation in which an intelligence level of γ is associated with a cost c(γ) ≡ 0 of the type described in Section 3. The next proposition characterizes the values of f and γ that maximize α when there is no variation across generations and when there is such a cost to intelligence.

Proposition 6
Under assumption A1, and under the further assumptions that there is no variation in outcome possibilities across generations and that there is a cost c(γ) of intelligence of the type described in Section 3, the values of f and γ that maximize α are characterized in the following way. If µ b > µ a and ϕ ∈ (0, 1), then Here γ * is as defined in (11). If µ b < µ a and ϕ ∈ (0, 1), then If µ a = µ b then f * = ϕ, and if ϕ ∈ {0, 1}, then f * = ϕ. In all cases for which f * ∈ {0, 1}, the optimal choice of γ is γ * . If, however, f * ∈ {0, 1}, then intelligence is unimportant, and the choice of γ does not matter.
Proof. The proof follows from a straightforward analysis of the partial derivatives of e α with respect to γ and f . The derivative of e α with respect to γ is and for any value of f ∈ (0, 1), this partial derivative is zero exactly when γ = γ * . Thus, if the optimal value of f is in the interior of the interval [0, 1], the optimal value of γ is γ * .
When γ = γ * , the partial derivative of e α with respect to f can be written when f < ϕ, and when f > ϕ. When µ b > µ a and ϕ ∈ (0, 1), the expression in (25) is always negative, and so e α is decreasing in f in the region f ∈ [ϕ, 1]. Also, the sign of (24) is positive, zero, or negative, according to whether γ * − c(γ * ) is larger, equal to, or less than, respectively, the value of µ b −µa δ + +µ b −µa . In these three situations, the function e α is increasing, constant, or decreasing, respectively, in the region f ∈ [0, ϕ]. These observations lead directly to the results in (22).
The remaining results of the proposition follow from similar analysis of the partial derivative of e α with respect to f in the various cases described.

Example of Optimal Choice with Intelligence
In this section, we provide an illustrative example of our model in the case of no variation across generations and the possibility of intelligence in behavior.
For purposes of our example, we assume a cost function of a particular type, namely c(γ) = κ γ 2 1 − γ , Figure 1: Values of γ * and γ * − c(γ * ) as functions of κ, the cost of intelligence parameter in (26).
where κ > 0 is a parameter that can be chosen higher to indicate a greater cost to intelligence, or lower to indicate the reverse situation. This function c(γ) can also be written and it is straightforward to check that it satisfies all of our requirements for a cost function for γ ∈ [0, 1]. Specifically, γ − c(γ) > 0 for small values of γ, and γ − c(γ) < 0 for values of γ sufficiently close to 1. Also, c(γ) is twice continuously differentiable, is increasing, and is convex. For this cost function, the value of γ * defined in (11) can be written γ * = 1 − κ 1 + κ .
The values of γ * and γ * − c(γ * ) are plotted as functions of κ in Figure 1.
The result of Proposition 6 is illustrated in Figure 2. We assume that µ b > µ a , and we use the horizontal axis to indicate the size of the ratio r = δ + /(µ b − µ a ). We use the vertical axis to indicate the value of κ. For any r and κ values, Proposition 6 can be used to determine the optimal f value, namely f * . The value of f * is either 0 or ϕ, except when γ * − c(γ * ) = 1/(1 + r), and in this special case the value of f * may be anywhere between 0 and ϕ. The deterministic value 0 is possible while the deterministic value 1 is not simply because we have assumed that µ b > µ a . As the figure indicates, a sufficiently high cost of intelligence, as indicated by a high κ value, corresponds to the deterministic choice f * = 0 and no use of intelligence. When intelligence has a low enough cost for a given ratio value, however, the optimal choice is f * = ϕ, which is the same frequency for f as occurs in probability matching. Figure 2: Values of f * for particular values of κ and r = δ + /(µ b −µ a ). The region toward the upper left corresponds to relatively costly intelligence and deterministic behavior of the form f * = 0. The region toward the lower right corresponds to relatively cheap intelligence and probability matching of the form f * = ϕ. On the line between the two large regions, any value for f * between 0 and ϕ is optimal.

Intelligence and Variation Across Generations
The final case we consider is the one in which individuals may be intelligent and in which there may be variation in outcomes over time. The following proposition describes nature of the optimal choice of f and γ in this setting under certain assumptions.