## Figures

## Abstract

Animals with rudimentary innate abilities require substantial learning to transform those abilities into useful skills, where a skill can be considered as a set of sensory–motor associations. Using linear neural network models, it is proved that if skills are stored as distributed representations, then within-lifetime learning of part of a skill can induce automatic learning of the remaining parts of that skill. More importantly, it is shown that this “free-lunch” learning (FLL) is responsible for accelerated evolution of skills, when compared with networks which either 1) cannot benefit from FLL or 2) cannot learn. Specifically, it is shown that FLL accelerates the appearance of adaptive behaviour, both in its innate form and as FLL-induced behaviour, and that FLL can accelerate the rate at which learned behaviours become innate.

## Author Summary

Some behaviours are purely innate (e.g., blinking), whereas other, “apparently innate,” behaviours require a degree of learning to refine them into a useful skill (e.g., nest building). In terms of biological fitness, it matters how quickly such learning occurs, because time spent learning is time spent not eating, or time spent being eaten, both of which reduce fitness. Using artificial neural networks as model organisms, it is proven that it is possible for an organism to be born with a set of “primed” connections which guarantee that learning part of a skill induces automatic learning of other skill components, an effect known as free-lunch learning (FLL). Critically, this effect depends on the assumption that associations are stored as distributed representations. Using a genetic algorithm, it is shown that primed organisms can evolve within 30 generations. This has three important consequences. First, primed organisms learn quickly, which increases their fitness. Second, the presence of FLL effectively accelerates the rate of evolution, for both learned and innate skill components. Third, FLL can accelerate the rate at which learned behaviours become innate. These findings suggest that species may depend on the presence of distributed representations to ensure rapid evolution of adaptive behaviours.

**Citation: **Stone JV (2007) Distributed Representations Accelerate Evolution of Adaptive Behaviours. PLoS Comput Biol 3(8):
e147.
doi:10.1371/journal.pcbi.0030147

**Editor: **Karl J. Friston, University College London, United Kingdom

**Received: **January 15, 2007; **Accepted: **June 11, 2007; **Published: ** August 3, 2007

**Copyright: ** © 2007 James V. Stone. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **The author received no specific funding for this study.

**Competing interests: ** The author has declared that no competing interests exist.

**Abbreviations:
**FLL,
free-lunch learning

## Introduction

Both evolution and learning may be considered as different types of adaptation. Learning occurs within a lifetime, whereas genetic change occurs across lifetimes [1]. Whereas genetic change ensures that a task can be executed innately, learning permits even the most rudimentary innate ability to be honed into a useful skill.

In an environment that fluctuates from generation to generation, learning permits an innate ability to be adapted to the particular physical environment into which each generation is born. If the environment ceases to fluctuate, then genetic assimilation [2] can transform a rudimentary innate ability, which requires much learning, into an innate skill, which requires minimal learning. This transformation is more likely to occur if the cost of learning is high [3,4], and, in this case, computer simulations suggest that learning can accelerate the rate of genetic assimilation [5] via the Baldwin effect [6]. However, if learning is sufficiently inexpensive, then genetic change may not occur at all [7,8]. Overall, there appears to be a tradeoff between learning and genetic assimilation, such that learning can subsidize genetic change, especially if learning is inexpensive.

All but the most primitive organisms learn in order to survive, and organisms which learn quickly are at a selective advantage relative to those that learn slowly. Therefore, a mechanism which reduces the time required to learn a given behaviour confers a selective advantage. One candidate for such a mechanism is FLL [9,10].

As explained below, FLL ensures that in the process of learning one set of associations or behaviours another set of associations is usually learned. These associations could comprise either perceptual skills (such as face recognition, predator recognition [11], or prey recognition), or motor skills (such as catching prey, flying, seed pecking, or nest building).

### Free-Lunch Learning

Before considering how FLL can accelerate evolution of certain types of behaviours, FLL will be described in its original context of spontaneous recovery of memory in humans [9] and in neural network models [10]. Note that FLL is not unique to a specific class of network architectures, although it does assume that associations are learned using a form of supervised learning.

In humans, FLL has been demonstrated using a task in which participants learned the positions of letters on a nonstandard computer keyboard [9]. After a period of forgetting, participants relearned a proportion of these letter positions. Crucially, it was found that this relearning induced recovery of the non-relearned letter positions.

More recently, a set of theorems provided a formal characterization of FLL in linear neural network models [10]. In essence, FLL occurs in neural network models because each association is distributed amongst all connection weights (synapses) between units (model neurons). After partial forgetting, relearning some of the associations forces all of the weights closer to pre-forgetting values, resulting in improved performance even on non-relearned associations; a general proof is provided in [10]. A geometric demonstration of FLL for a network with two connection weights is given in Figure 1. Networks with multiple input and output units can be considered without loss of generality [10].

(A) Given a network with two input units and one output unit, its connection weights *w _{a}* and

*w*define a weight vector

_{b}**w**= (

*w*,

_{a}*w*). The network learns two subsets

_{b}*A*

_{1}and

*A*

_{2}. In this example, each subset is a single association, where

*A*

_{1}(for example) is the mapping from input vector

**x**

_{1}= (

*x*

_{11},

*x*

_{12}) to desired output value

*d*

_{1}, and learning

*A*

_{1}consists of adjusting

**w**until the network output

*y*

_{1}=

**w**·

**x**

_{1}equals

*d*

_{1}.

(B) Each association *A*_{1} and *A*_{2} defines a constraint line *L*_{1} and *L*_{2}, respectively. The intersection of *L*_{1} and *L*_{1} defines a point **w**_{0} which satisfies both constraints, so that zero performance error on *A* = *A*_{1} ∪ *A*_{2} is obtained if **w** = **w**_{0}. After partial forgetting, the weight vector is a randomly chosen point **w**_{1} on the circle *C* of radius *r*, and the performance error on *A*_{1} is given by the squared distance *p*^{2}. After relearning *A*_{2}, the weight vector **w**_{2} lies in *L*_{2}, and performance error on *A*_{1} is the squared distance *q*^{2}. FLL occurs if performance error on *A*_{1} is decreased by relearning *A*_{2}, or equivalently if *p*^{2} > *q*^{2}. Relearning *A*_{2} has three possible effects, depending on the position of **w**_{1} on *C*: 1) if **w**_{1} is under the larger (dashed) arc *C _{FLL}* (as shown here), then

*p*

^{2}>

*q*

^{2}; therefore, FLL is observed, 2) if

**w**

_{1}is under the smaller (dotted) arc, then

*p*

^{2}<

*q*

^{2}; therefore, negative FLL is observed, and 3) if

**w**

_{1}is at the critical point

**w**

*, then*

_{crit}*p*

^{2}=

*q*

^{2}; therefore, no FLL is observed. Given that

**w**

_{1}is a randomly chosen point on the circle

*C*, the probability of FLL is equal to the proportion of the upper semicircle under the (dashed) arc

*C*. Symmetry implies the above analysis also applies to the lower half of

_{FLL}*C*. In terms of evolution, a network “born” with

**w**=

**w**

^{*}has zero error on both

*A*

_{1}and

*A*

_{2}after learning only

*A*

_{2}(see text).

The protocol used to examine FLL in neural networks is as follows (see Figure 2). A network with *n* input units and one output unit has *n* connection weights. This network learns a set *A* of *m* ≤ *n* associations, where *A* = *A*_{1} ∪ *A*_{2} comprises two subsets *A*_{1} and *A*_{2} of *n*_{1} and *n*_{2} associations, respectively (note that *m* = *n*_{1} + *n*_{2}). After the associations *A* have been learned and then partially forgotten, performance error on subset *A*_{1} is measured (forgetting is induced by adding isotropic noise to weights). Finally, *only A*_{2} is relearned, and then performance error on *A*_{1} is remeasured. FLL occurs if relearning *A*_{2} improves performance on *A*_{1}. It has been proven that the probability of FLL approaches unity as the number of weights increases [10]. For the sake of brevity, this is reflected in phrases such as “learning *A*_{2} *usually* improves performance on *A*_{2}” in this paper.

(A) FLL within the single lifetime of a neural network model. Two subsets of associations *A*_{1} and *A*_{2} are learned. After partial forgetting (see text), performance error on subset *A*_{1} is measured. Subset *A*_{2} is then relearned, and performance error on subset *A*_{1} is measured again. If performance error on *A*_{1} decreases as a result of learning *A*_{2}, then FLL has occurred.

(B) FLL across generations. Using a genetic algorithm, a network is born with connections similar to those of the network in **a** after it has learned and then partially forgotten subsets *A*_{1} and *A*_{2}. Consequently, innate performance error on *A*_{1} is similar to that of the network in **a** after it has partially forgotten both subsets. After measuring performance error on *A*_{1} at birth, subset *A*_{2} is learned for the first time, and performance error on subset *A*_{1} is measured again. After many generations, the innate connection values of each network ensure that if subset *A*_{2} is learned *for the first time,* then this induces automatic learning of subset *A*_{1}.

### FLL and Evolution

Now consider an organism *b*_{2} which is born with a genetically specified set of neuronal connections [12]. These connections are organised such that, if *b*_{2} learns one subset *A*_{2} of associations then another subset *A*_{1} is usually learned. In other words, *the organism b*_{2} *happens to be born with neuronal connections similar to the connections of an organism b*_{1} *which had once learned and then forgotten subsets A*_{1} *and A*_{2} (e.g., isotropically distributed around w_{0} in Figure 1). Just as FLL ensures that if organism *b*_{1} relearns *A*_{2} then subset *A*_{1} is usually relearned (see Figure 2), so *if b*_{2} *learns A*_{2} *then A*_{1} *is usually learned*. In both cases, FLL ensures that learning one subset of associations induces learning of the other subset. Critically, whereas the FLL exhibited by organism *b*_{1} depends on previous learning and forgetting, FLL in organism *b*_{2} depends on being born in a state such that the first time *A*_{2} is learned, the associations *A*_{1} are also usually acquired. Such a network can be evolved using a genetic algorithm, as shown below.

The use of two distinct subsets in this paper is clearly unrealistic when considered in the context of skill learning. However, the use of two subsets lies at one extreme along a continuum of tasks. At one extreme, associations are learned one by one in a strict order, and at the other extreme, all associations are learned simultaneously. In a biological context, the components of a skill which are learned first act as “scaffold” for others, and this effectively imposes a temporal order to the acquisition of different skill components. This is the type of scenario assumed for the simulations reported in this paper. Essentially, learning *A*_{2} is assumed to consist of a subset of skill components which provide a scaffold for the skill components in *A*_{1}.

### The Geometry of FLL

This section is a brief account of the basic geometry underlying FLL, in the absence of its interactions with evolution. For the present, and without loss of generality (see [10]), we assume that the network has one output unit and two input units, which implies *n* = 2 connection weights, and that *A*_{1} and *A*_{2} each consist of *n*_{1} = *n*_{2} = 1 association, as in Figure 1. Input units are connected to the output unit via weights *w _{a}* and

*w*, which define a weight vector

_{b}**w**= (

*w*,

_{a}*w*). Associations

_{b}*A*

_{1}and

*A*

_{2}consist of different mappings from the input vectors

**x**

_{1}= (

*x*

_{11},

*x*

_{12}) and

**x**

_{2}= (

*x*

_{21},

*x*

_{22}) to desired output values

*d*

_{1}and

*d*

_{2}, respectively. If a network is presented with input vectors

**x**

_{1}and

**x**

_{2}, then its output values are

*y*

_{1}=

**w**·

**x**

_{1}=

*w*

_{a}x_{11}+

*w*

_{b}x_{12}and

*y*

_{2}=

**w**·

**x**

_{2}=

*w*

_{a}x_{21}+

*w*

_{b}x_{22}, respectively. More generally, network performance error for

*k*associations is defined as

The weight vector **w** defines a point in the (*w _{a}*,

*w*) plane. For an input vector

_{b}**x**

_{1}, there are many different combinations of weight values

*w*and

_{a}*w*that give the desired output

_{b}*d*

_{1}. These combinations lie on a straight line

*L*

_{1}, because the network output is a linear weighted sum of input values. A corresponding constraint line

*L*

_{2}exists for

*A*

_{2}. The intersection of

*L*

_{1}and

*L*

_{2}therefore defines the only point

**w**

_{0}that satisfies both constraints, so that zero error on

*A*

_{1}and

*A*

_{2}is obtained if and only if

**w**=

**w**

_{0}. Without loss of generality, we define the origin

**w**

_{0}to be the intersection of

*L*

_{1}and

*L*

_{2}. A general prerequisite for FLL is that

*L*

_{1}is not orthogonal to

*L*

_{2}.

We now consider the geometric effect of partial forgetting of both associations, followed by relearning *A*_{2}. This geometric account applies to a network with two weights (see Figure 1), and depends on the following observation: if the length of the input vector **x**_{1} is unity, then the performance error *E*(**w**,*A*_{1}) = (*d*_{1} − *y*_{1})^{2} of a network with weight vector **w** when tested on association *A*_{1} is equal to the squared distance between **w** and the constraint line *L*_{1} [10]. For example, if **w** is in *L*_{1}, then *E*(**w**,*A*_{1}) = 0, but as the distance between **w** and *L*_{1} increases so *E*(**w**,*A*_{1}) must increase. For the purposes of this geometric account, we assume the length of the input vectors is unity.

If partial forgetting is induced by adding isotropic noise **g** to the weight vector **w** = **w**_{0}, then this effectively moves **w** to a randomly chosen point **w**_{1} = **w**_{0} + **g** on the circle *C* of radius *r*, where *r* is the length of **g**, and represents the amount of forgetting. For a network with **w** = **w**_{1}, learning *A*_{2} moves **w** to the nearest point **w**_{2} on *L*_{1} [10], so that **w**_{2} is the orthogonal projection of **w** on *L*_{2}. Before relearning *A*_{2}, the performance error *E*(**w**,*A*_{1}) on *A*_{1} is the squared distance *p*^{2} between **w**_{1} and its orthogonal projection on *L*_{1}. After relearning *A*_{2}, the performance error *E _{post}* is the squared distance

*q*

^{2}between

**w**

_{2}and its orthogonal projection on

*L*

_{1}. The amount of FLL is

*δ*= E(

**w**

_{1,}

*A*

_{1}) − E(

**w**

_{2,}

*A*

_{1}), and (for a network with two weights) is also given by

*Q*=

*p*

^{2}−

*q*

^{2}. The probability

*P*(

*δ*> 0) of FLL given

*L*

_{1}and

*L*

_{2}is equal to the proportion of points on

*C*for which

*δ*> 0 (or equivalently, for which

*Q*> 0). For example, it can be shown that the mean value of this proportion is

*P*(

*δ*> 0) for a two-weight network like the one shown in Figure 1A. Given the particular configuration shown in Figure 1A, the critical point

**w**

*is defined such that the performance error before and after learning is the same (i.e.,*

_{crit}*δ*= 0).

### FLL Induces Perfect Performance

Given a network with *n* weights **w** and two subsets *A*_{1} and *A*_{2} of *n*_{1} and *n*_{2} associations, respectively, it is shown that weights **w**^{*} exist such that learning associations *A*_{2} is guaranteed to yield zero performance error on *A*_{1}, provided *n* ≥ *n*_{1} + *n*_{2}.

Consider a network with *n* = 2 weights and subsets *A*_{1} and *A*_{2}, each of which comprises a single association. Each association defines a constraint line *L*_{1} and *L*_{2}, respectively (see Figure 1). If the weight vector **w** is in *L*_{1}, then performance error on *A*_{1} is zero, and if **w** is in *L*_{2}, then performance error on *A*_{2} is zero. Clearly, if and only if **w** is at the intersection **w**_{0} of *L*_{1} and *L*_{2}, then performance error on both *A*_{1} and *A*_{2} is zero. If **w** is not in *L*_{2}, then learning *A*_{2} moves **w** from its current position to its orthogonal projection onto *L*_{2} [10]. Crucially, if **w** = **w**^{*} in Figure 1, then learning *A*_{2} moves **w** to the optimal weight vector **w**_{0}. In this case, learning *A*_{2} reduces performance error to zero on both *A*_{1} and *A*_{2}, and therefore learning *A*_{2} implies perfect performance on *A*_{1}.

This line of reasoning generalises to networks with more than two weights, as follows. If a network has more than two input units, then subsets *A*_{1} and *A*_{2} can have *n*_{1} > 1 and *n*_{2} > 1 associations. If *n* ≥ *n*_{1} + *n*_{2}, then *A*_{1} and *A*_{2} define an (*n* − *n*_{1})-dimensional subspace *L*_{1} and an (*n* − *n*_{2})-dimensional subspace *L*_{2}, respectively. The intersection *L*_{12} of *L*_{1} and *L*_{2} corresponds to weight vectors which generate zero error on *A* = *A*_{1} ∪ *A*_{2}. In this case, the circle in Figure 1 corresponds to an *n*-dimensional hypersphere, with its centre **w**_{0} in *L*_{12}. Given that learning *A*_{2} provides an orthogonal projection of **w** onto *L*_{2}, and that there exists a **w** = **w**^{*} such that its orthogonal projection onto *L*_{2} is **w**_{0}, it follows that learning *A*_{2} in a network with **w** = **w**^{*} yields zero performance error on both *A*_{2} and *A*_{1}.

Given that the weight vector **w** is genetically specified with finite precision, a network is necessarily born with its weight vector **w** = **w**_{1} at a non-zero distance *r* from the optimal weight vector **w**_{0}. This finite precision defines a hypersphere *C* around **w**_{0}, and the location of **w**_{1} on *C* determines the amount of FLL. If a network is born with **w**_{1} = **w**^{*}, then learning *A*_{2} induces perfect performance on *A*_{1}. If fitness depends on performance on both *A*_{1} and *A*_{2} after learning only *A*_{2}, then there is selective pressure for networks to be born with weight vectors close to **w**^{*}, given a specific degree of genetic precision *r*. More generally, there is pressure for networks to be born with weight vectors close to the subspace with contains **w**_{0} and **w**^{*}.

### Terminology: Evolution, Innateness, and FLL

As this paper deals with subtle combinations of evolution and learning, involving two distinct subsets of associations (*A*_{1} and *A*_{2}), it is important to be clear about terminology. Specifically, we need to be careful about which subset is being referred to, and whether we are referring to innate performance or not. Accordingly, performance error on *A*_{1} before learning *A*_{2} is called just that, “innate performance error on *A*_{1}.” Behaviours that are induced by learning *A*_{2} are called *FLL-induced* behaviours, because they are not innate, nor are they learned, so that performance error on *A*_{1} after learning *A*_{2} is called “FLL-induced performance error on *A*_{1}.” If learning *A*_{2} does not affect performance on *A*_{1} (as in condition NoFLL, below), then this is referred to as “post-learning performance.” More generally, performance will be quoted with a specified context (e.g., performance on *A*_{1} after learning *A*_{2}).

## Methods

The effect of FLL on evolution was tested by measuring performance on *A*_{1} after learning *A*_{2} across generations. To eliminate the possibility that the observed results are artefacts, the effects of FLL were compared with two control conditions (described below).

Each generation consisted of 1,000 neural networks, each of which consisted of 20 input units and one output unit. The genome of each network was defined by a one-to-one mapping of the *n* = 20 weight values in the network to a single string of *n* genes, where the value of each gene was set to the value of a corresponding network weight. The number of offspring generated by each network was proportional to its fitness, which depended only on its ability to provide the correct desired output value for each of 20, *n*-element input vectors. The mapping from each input vector to its output value defines one association (see Figure 1).

A network's output *y _{i}* is a weighted sum of input values
, where

*x*is the

_{ij}*j*th value of the

*i*th input vector

**x**

*, and each weight*

_{i}*w*is one input–output connection.

_{i}The fitness of each network was assessed with respect to its performance error on a single common set *A* = *A*_{1} ∪ *A*_{2} of *m* = 20 associations, where *A*_{1} and *A*_{2} are two disjoint subsets of *n*_{1} = 10 and *n*_{2} = 10 associations, respectively. The *m* associations in *A* were allocated randomly to the two subsets, *A*_{1} and *A*_{2}. The subsets *A*_{1} and *A*_{2} were intended to represent different components of a task, and were therefore the same *for all networks,* and *across all generations*. In the first generation, each network's weight values were chosen from a Gaussian distribution (see below for details).

The desired output value *d _{i}* for each input vector

**x**

*was drawn from a Gaussian distribution with variance 1/*

_{i}*n*. An analytic method was used to solve for the optimal weight vector

**w**

_{0}which maps inputs to outputs:

*d*=

_{i}**w**

_{0}·

**x**

*(see below). Given the variances of the inputs and outputs, the expected length of*

_{i}**w**

_{0}is unity.

Each new generation was formed from 1,000 matings between 1,000 pairs of networks. The *K*^{th} network was chosen for mating according to its fitness *F*(*K*) with probability *p*(*K*). Networks were chosen with replacement to ensure that the number of offspring from a given network was proportional to *p*(*K*), which is defined as
where the denominator ensures ∑* _{k}p*(

*k*) = 1. Half the weights of each offspring were copied from (randomly chosen) corresponding weight locations in one parent network, and half from the other parent. Aside from mutations, weight values inherited by an offspring were the same as those inherited by its parents (i.e., inheritance was Darwinian, not Lamarckian).

Mutation was applied to each weight with a probability of 0.05, using a uniform probability density function. Then Gaussian noise with a standard deviation of 0.05 was added to the value of those weights that had been chosen for mutation.

There were three conditions: *FLL, NoFLL,* and *NoLearn*, with corresponding fitness functions *F _{FLL}, F_{NoFLL},* and

*F*. The initial randomly chosen weight values (see Network Learning Algorithm) of the population of networks were the same in all conditions. Networks were selected for mating according to their performance on the combined set

_{NoLearn}*A*=

*A*

_{1}∪

*A*

_{2}, according to Equation 2 for all fitness functions, as described next.

### Condition FLL

Networks that exhibited high levels of FLL were preferentially selected for mating. Only associations *A*_{2} were learned, but the fitness of each network depended on its performance on both the learned associations *A*_{2} and on the unlearned associations *A*_{1}. The fitness *F _{FLL}*(

*K*) of the

*K*

^{th}network is defined in terms of its innate performance error

*E*on

_{pre}*A*=

*A*

_{1}∪

*A*

_{2}, and on its performance error

*E*on

_{post}*A*after learning

*A*

_{2}: where

*E*and

_{pre}*E*are: where and are the network's output errors in response to the

_{post}*i*th input vector before and after learning

*A*

_{2}(respectively). The parameter

*c*= 0.05 defines the balance between performance error on innate versus post-learning (e.g., FLL-induced) behaviours, and is interpreted as a cost-of-learning parameter (see below). The network's fitness error

*D*is a function of the difference

_{i}*e*=

_{i}*y*−

_{i}*d*between the network's response

_{i}*y*to the

_{i}*i*th input vector and the desired output value

*d*: This ensures that output errors above

_{i}*D*have a disproportionately large and detrimental effect on fitness, as shown in Figure 3. This, in turn, ensures that only those networks with “good” performance are likely to be selected for reproduction. The value of

_{thresh}*D*was set to 0.01.

_{thresh}The response of the network to a given input vector **x** is y = **w** · **x**. Given a desired (target) output *d*, the solid curve shows how the fitness penalty *D* for an incorrect response increases sharply (to unity) if the magnitude of the difference *e* = *y* − *d* is greater than 0.1 (i.e., if *e*^{2} > 0.01). For comparison, the quadratic error function *e*^{2} = (*y* − *d*)^{2}, which is minimised during learning, is shown as a dashed curve. The range of *e*-values shown are typical for the simulations reported here. See Methods section for details.

For later use, we also define the fitness errors *E _{pre}* =

*E*(

_{pre}*A*

_{1}) +

*E*(

_{pre}*A*

_{2}), and

*E*=

_{post}*E*(

_{post}*A*

_{1}) +

*E*(

_{post}*A*

_{2}) ≈

*E*(

_{post}*A*

_{1}). The approximation here and in Equation 5 emphasises the fact that the total fitness error is attributable almost exclusively to

*A*

_{1}after learning

*A*

_{2}(because error on

*A*

_{2}is then almost zero).

The inclusion of innate performance error *E _{pre}* in

*F*ensures that the cost of learning is taken into account when assessing fitness. If

_{FLL}*c*is small, then

*E*tends to be large, so that much learning is required to increase fitness. Conversely, if

_{pre}*c*is large, then

*E*tends to be small, so that little learning is required to increase fitness. Thus,

_{pre}*E*is an implicit measure of the cost of learning (where

_{pre}*c*multiplies

*E*), and ensures that networks which require minimal learning have high innate fitness (although the small value of

_{pre}*c*used in Equation 3 defines a relatively low cost of learning).

### Condition NoFLL

This was identical to condition FLL, except that the effects of FLL were precluded by making all input vectors in *A* mutually orthogonal, whilst retaining the length of each vector as in condition FLL, using Gram-Schmidt orthogonalisation. This makes the input vectors in *A*_{1} orthogonal to those in *A*_{2}, which ensures *L*_{1} and *L*_{2} are orthogonal. This, in turn, ensures that learning *A*_{2} cannot affect performance on *A*_{1}. The fitness function *F _{NoFLL}* was the same as in condition FLL (i.e.,

*F*=

_{NoFLL}*F*).

_{FLL}### Condition NoLearn

No learning occurred, so that improvement in performance over successive generations was due only to selection of innate performance on *A*_{1} and *A*_{2}. Fitness was defined as *F _{NoLearn}* = 1/

*E*, where

_{pre}*E*is defined in Equation 4. This is equivalent to setting

_{pre}*c*= 1 in Equation 3.

### Network Learning Algorithm

The network learning algorithm used here involves a type of supervised learning. Note that Equation 1 defines the network error used for learning, whereas Equation 3 defines the fitness of a network.

Each network was initialised with *n* weight values drawn randomly from a Gaussian distribution with unit variance. This was then divided by *n*^{1/2}, which ensures that the expected length of weight vectors in the population is unity.

Given a network with *n* input units and one output unit, the set *A* of *m*, *n*-element input vectors **x*** _{i}*:

*i*= 1…

*m*and

*m*desired scalar output target values

*d*were chosen randomly from a Gaussian distribution with unit variance. Each input vector was then divided by

_{i}*n*

^{1/2}so that the expected length of input vectors was unity (i.e., the variance of input values was 1/

*n*).

In conditions FLL and NoFLL, each network learned *n*_{2} associations. Rather than using the iterative weight update normally associated with the delta rule, an analytic solution was obtained. Learning *n*_{2} associations consists of finding the orthogonal projection operator which projects the initial weight vector **w**_{1} to its nearest point in the subspace (e.g., *L*_{2}) defined by the *n*_{2} input vectors being learned. The end result **w**_{2} is the same as that obtained using the standard delta rule for infinitesimal learning rates [10]. As with the standard delta rule, this yielded a value of approximately zero for post-learning performance error on the learned associations *A*_{1}. This type of learning is most plausibly associated with motor learning in the cerebellum and basal ganglia [10].

## Results

The results are based on ten computer simulation runs for each of the three conditions, FLL, NoFLL, and NoLearn, described above, and graphs show the mean of these ten runs. Each run involved a different fixed set *A* of 20 associations. As a reminder, the two free parameters are: 1) the cost-of-learning parameter, which was set to *c* = 0.05, and 2) the threshold of the fitness error function, which was set to *D*_{thresh} = 0.01.

The main results are shown in Figure 4, which is a summary of more detailed results in Figure 5. Condition FLL yields a FLL-induced error (i.e., error on *A*_{1} after learning *A*_{2}) of approximately zero after 30 generations, whereas condition NoFLL requires about 60 generations to achieve an error of less than unity. Condition NoLearn (dotted curve) yields the slowest innate learning, and is included for comparison.

The graph shows the mean results of ten computer simulation runs, in three conditions: FLL, NoFLL, and NoLearn (see text). In each run, the fitness of each of 1,000 networks was determined by performance error on a fixed set *A* = *A*_{1} ∪ *A*_{2} of 20 associations; a different set *A* was used for each run. For all graphs in this paper, the median (of error, here) of 1,000 networks was obtained throughout each run, and the mean of ten medians is shown for each generation in each condition (standard errors were no greater than 0.5, and are not shown for clarity).

**Condition FLL** (solid line): mean performance error *E _{post}*(

*A*

_{1}) on the ten associations in

*A*

_{1}after learning the ten associations in subset

*A*

_{2}. Learning

*A*

_{2}had a beneficial effect on performance on

*A*

_{1}over 100 generations, corresponding to an increase in the amount and prevalence of FLL (see Figure 6).

**Condition NoFLL**(dashed line): mean performance error

*E*(

_{post}*A*

_{1}) was evaluated as in condition FLL, except that the input vectors in

*A*

_{1}were orthogonal to those in

*A*

_{2}, so that learning

*A*

_{2}could not have any effect on performance on

*A*

_{1}(see text).

**Condition NoLearn**(dotted line): mean performance error

*E*on

_{pre}*A*

_{1}was evaluated “at birth” (i.e., no within-lifetime learning was allowed).

Performance error in three conditions (FLL, NoFLL, and NoLearn). Different performance errors are drawn as follows: **solid line**, *E _{post}*(

*A*

_{1}), performance error on

*A*

_{1}after learning only

*A*

_{2;}

**dashed line**,

*E*(

_{pre}*A*

_{1}), innate performance error on

*A*

_{1};

**dotted line**,

*E*(

_{pre}*A*

_{2}), innate performance error on

*A*

_{2}.

(A)** Condition FLL**: mean performance error *E _{post}*(

*A*

_{1}) on

*A*

_{1}(solid line) after learning

*A*

_{2}decreased over generations most rapidly in this condition. Innate error on

*A*

_{1}is slightly lower than on

*A*

_{2}.

(B)** Condition NoFLL**: precluding FLL ensured that mean innate error on *A*_{1} (dashed line) was essentially the same as after learning *A*_{2} (solid line). The dashed line has been plotted 0.1 units above the solid line for clarity. Innate error on *A*_{2} is large because the fitness cost of innate errors is low (*c* = 0.05).

(C)** Condition NoLearn**: mean innate performance error *E _{pre}*(

*A*

_{1}) and

*E*(

_{pre}*A*

_{2}) on

*A*

_{1}and

*A*

_{2}. For comparison, mean performance errors on

*A*

_{1}in each condition (i.e., the solid lines here) are summarised in Figure 4.

The lines in each graph correspond to the conditions: FLL = **solid line**, NoFLL = **dashed line**. Performance error on *A*_{1} was tested after learning *A*_{2} in conditions FLL and NoFLL. Each plotted line is the mean of ten computer simulation runs (see Figure 4 for details).

(A)** Prevalence of FLL**: the proportion of networks which showed improved performance on *A*_{1} after learning *A*_{2}. In condition FLL, the prevalence of FLL was non-zero in the first generation, as expected (see text), and increased across subsequent generations. In condition NoFLL, the prevalence remained at zero, as expected.

(B)** Amount of FLL**: in condition FLL, the amount of FLL increased dramatically over the first 30 generations. The absence of FLL in condition NoFLL is as expected (see text). The amount of FLL defined for this graph only is the difference in fitness error on *A*_{1} before and after learning *A*_{2}, expressed as a proportion of the error on *A*_{1} before learning *A*_{2}; or, equivalently, as [*E _{pre}*(

*A*

_{1}) −

*E*(

_{post}*A*

_{1})]/

*E*(

_{pre}*A*

_{1}).

The proportion of networks that exhibit FLL over generations is shown in Figure 6A, and the amount of FLL is shown in Figure 6B. The proportion and amount of FLL increases in condition FLL, as indicated by the solid line in each figure. The zero prevalence of FLL in condition NoFLL (dashed line) is associated with zero FLL as indicated in Figure 6B. More detailed results are shown in Figure 5A–5C.

### Performance on *A*_{1}

Performance on *A*_{1} (solid line in Figure 5A) after learning *A*_{2} is better in condition FLL than in condition NoFLL (solid line, 5B). Innate performance on *A*_{1} is also better in condition FLL (dashed line, Figure 5A) than in conditions NoFLL (dashed line, Figure 5B) and NoLearn (dashed line, Figure 5C). Together, these results suggest that FLL accelerates both the rate at which FLL-induced behaviours (*A*_{1}) appear, as well as the rate at which FLL-induced behaviours (*A*_{1}) become genetically assimilated.

### Performance on *A*_{2}

Innate performance on subset *A*_{2} is better in condition FLL (dotted curve, Figure 5A) than in conditions NoFLL (dotted curve, Figure 5B) and NoLearn (dotted curve, Figure 5C).

Learning *A*_{2} reduces error on subset *A*_{1} even in the first generation in conditions FLL (Figure 5A). Additionally, the proportion of networks showing FLL is greater than 0.5 (Figure 6A), and the amount of FLL is greater than zero (Figure 6B) in the first generation. These effects are not due to any special properties of the networks nor of the associations. Indeed they are entirely expected, and are consistent with the theoretical analysis in [10]. In essence, FLL is observed in the first generation because it is very unlikely that the mainfolds *L*_{1} and *L*_{2} defined by *A*_{1} and *A*_{2} (respectively) are orthogonal, so that learning *A*_{2} usually reduces error on *A*_{1} (albeit by a small amount in the first generation).

## Discussion

Before discussing results in detail, it is important to clarify precisely what is being claimed here. The main claim is that, given a population of organisms which can learn, the presence of FLL accelerates the rate at which a given set of advantageous behaviours evolves relative to populations which 1) can learn but which do not have FLL (e.g., condition NoFLL) and 2) cannot learn (e.g., condition NoLearn). Specifically, it is claimed that FLL accelerates the appearance of adaptive behaviour, both in its innate form and as FLL-induced behaviour, and that FLL can accelerate the rate at which learned behaviours become innate.

It is also claimed that FLL increases the rate at which a set of behaviours (e.g., *A*) is acquired within a lifetime. Clearly, if learning one subset (e.g., *A*_{2}) induces learning of another subset (e.g., *A*_{1}), then the amount of learning required to learn both subsets (e.g., *A*) is reduced.

It is worth noting that FLL is not related to generalisation (see [10]), which cannot therefore be responsible for the effects reported here.

### Task Difficulty

The task was purposely made difficult, such that network outputs which were not close to desired target values were assigned an error value of unity. This heavily penalises networks that do not generate near-correct responses. This type of task may emulate tasks for which being “almost correct” provides no fitness benefit. Such tasks are exemplified by a predator which almost catches prey (e.g., a kingfisher almost catching a fish, or where each failed attempt yields a large fitness cost), or where learning is incremental and stepwise (e.g., learning to catch progressively larger prey). Such tasks give rise to “needle-in-a-haystack” search spaces [5], which have rugged or uncorrelated landscapes [13].

### Is Accelerated Evolution due to Learning?

A cogent critique of research by Nolfi et al. [14] argues that accelerated evolution (specifically, assimilation) is a generic consequence of learning per se [15]. In results not shown here, replacing *A*_{2} with a new, randomly chosen subset every generation in condition FLL yields a more gradual evolution of FLL-induced and innate behaviours than is obtained in any of the conditions used here. This effectively excludes the possibility that the accelerated evolution reported here is due to learning per se.

### Reaction Norms

In terms of evolutionary theory, FLL-induced behaviours can be considered as the establishment of a new reaction norm. The specific “environment” that induces the reaction norm is learning a particular subset of behaviours (*A*_{2}), and the phenotypic reaction to this environment is another subset of behaviours (*A*_{1}).

### Genetic Assimilation

FLL does not necessarily force FLL-induced behaviours to become genetically assimilated. In fact, there is a tradeoff between the amount of acceleration induced by FLL and the extent to which behaviours become innate. If the cost-of-learning parameter is set to *c* = 1, then there is no incentive for FLL to increase over generations. In contrast, if *c* ≈ 0 (as in the simulations reported here), then the rapid evolution of FLL-induced behaviour shown in Figure 5A (solid line) is obtained, alongside the slower evolution of innate behaviour (dashed line in Figure 5A). Thus, even the small value of *c* (0.05) used here puts pressure on learned behaviours to become innate, as indicated by the decreasing innate performance errors on *A*_{1} (dashed line) and *A*_{2} (dotted line) in Figure 5A.

In practice, learning always has a non-zero fitness cost, if only in terms of the time required for that learning to occur. This is because time spent learning is time spent not eating, or time spent being eaten, both of which reduce fitness. Thus, the small value of *c* used here represents one value along the spectrum of learning costs. It therefore seems likely that even the simplest learned behaviours have a tendency to become innate, and that this tendency increases with the cost of learning. For example, in results not shown here, increasing the cost-of-learning parameter *c* decreases the rate at which FLL-induced performance on *A*_{1} improves, and increases the rate at which performance on *A*_{1} and *A*_{2} becomes innate (innate performance with *c* = 1 is effectively obtained in condition NoLearn (see Figure 5C)). It is therefore not easy to classify the effect reported here as a clear-cut example of the Baldwin effect [6], although these effects are almost certainly related.

### General Free-Lunch Effects

The basic geometry which underpins FLL within a lifetime (as in [10]) and across lifetimes (as here) can also be applied in two other contexts: 1) evolution of innate behaviours without learning, and 2) evolution of general phenotypic traits. These two cases are considered in the next two paragraphs. Both of these effects require the presence of environmental conditions that fluctuate over successive generations (e.g., fluctuations in temperature induced by ice ages, salinity, prey numbers, or predation pressure).

1) Accelerated evolution of innate behaviours without learning can be understood by considering an organism that has no learning ability, and which relies on genetic specification of its neuronal connections [12]. Natural selection ensures that its neuronal connections at birth yield innate behaviour matched to its environment. If the environment changes, then natural selection will induce a corresponding shift to a new set of innate connections. If the environment then shifts back to its original state, then organisms' connections will tend to revert to their original values. Let us assume that some connections revert faster than others over successive generations. For example, some connections may be specified by genes linked to other innate behaviours, and this genetic linkage would tend to reduce the rate of genetic change. In fact, for simplicity, assume that half of the connections revert quickly and half revert slowly. If the required behaviours are encoded as distributed representations, then this connection reversion will induce a FLL-type effect, such that *all* associations benefit from the reversion of a proportion (half here) of connection values.

2) Accelerated evolution of general phenotypic traits can be understood if we assume an extreme form of pleiotropy: that each of a given set of genes affects every phenotypic trait. This is equivalent to assuming that the genome is a *distributed representation* of the phenotype. Consider a population in which the fittest organism has a genome **w**_{0} which is perfectly adapted to its environment *e*_{0}. If the environment changes to *e*_{1}, then the fittest organism's genome will eventually evolve to a new state **w**_{1} that is suited to *e*_{1} (this is analogous to forgetting in FLL). Now, consider what happens if the environment changes back to *e*_{0}. The fittest organism's genome will be forced back toward **w**_{0}, but inevitably some genes will revert faster than others. For the sake of argument, assume that a subset *G*_{2} of genes revert to their original values, while others *G*_{1} remain as they were in **w**_{1} (this is analogous to relearning only *A*_{2}). Because each gene in *G*_{2} contributes to every phenotypic trait, the reversion of genes in *G*_{2} to their original values will push the entire phenotype back toward its state in the original environment *e*_{0}. Thus, the reappearance of an entire set of phenotypic traits (e.g., changes in size) can occur more quickly if those traits are encoded within a set of pleiotropic genes than if each trait is represented by a non-pleiotropic gene, and suggests a form of *free-lunch evolution*.

### Conclusion

It has been demonstrated that FLL accelerates the evolution of behaviours in neural network models. Given that FLL appears to be a fundamental property of distributed representations, and given the reliance of neuronal systems on distributed representations, FLL-induced behaviours may constitute a significant component of apparently innate behaviours (e.g., nest-building). Results presented here suggest that any organism that did not take advantage of such a fundamental and ubiquitous effect would be at a selective disadvantage. Finally, if FLL accelerates evolution in the natural world, then it may have been involved in the Cambrian explosion, an explosion that began when brains (and therefore learning) first appeared.

## Acknowledgments

Thanks to N. Hunkin and R. Lister for comments on this paper, and to P. Parpia for useful discussions. Thanks also to two anonymous reviewers for their detailed comments.

## Author Contributions

JVS conceived and designed the experiments, performed the experiments, analyzed the data, and wrote the paper.

## References

- 1.
Bateson G (1979) Mind and nature. Flamingo.
- 2. Waddington C (1959) Canalisation of development and genetic assimilation of acquired characters. Nature 183: 1654–1655.
- 3. Mery F, Kawecki T (2003) A fitness cost of learning in Drosophila melangaster. Proc Roy Soc London B 270: 2465–2469.
- 4. Price T, Qvarnstrom A, Irwin D (2003) The role of phenotypic plasticity in driving genetic evolution. Proc Roy Soc Lond B 270: 1433–1440.
- 5. Hinton G, Nowlan S (1987) How learning can guide evolution. Complex Systems 1: 495–502.
- 6. Baldwin J (1896) A new factor in evolution. Am Nat 30: 441–451. Additional pages: 536-553.
- 7. Mery F, Kawecki T (2004) The effect of learning on experimental evolution of resource preference in Drosophila melangaster. Evolution 58: 757–767.
- 8. Dopazo H, Gordon M, Perazzo R, Risau-Gusman S (2001) A model for the interaction of learning and evolution. Bull Math Biol 63: 117–134.
- 9. Stone J, Hunkin N, Hornby A (2001) Predicting spontaneous recovery of memory. Nature 414: 167–168.
- 10. Stone J, Jupp P (2007) Free-lunch learning: Modelling spontaneous recovery of memory. Neural Comput 19: 194–217.
- 11.
Tinbergen N (1951) The study of instinct. Oxford: Oxford University Press.
- 12. Kaufman A, Dror G, Meilijson I, Ruppin E (2006) Gene expression of Caenorhabditis elegans neurons carries information on their synaptic connectivity. PLoS Comput Biol 2: 1561–1567.
- 13. Kauffman S, Levin S (1987) Towards a general theory of adaptive walks on rugged landscapes. J Theor Biol 128: 11–45.
- 14. Nolfi S, Elman J, Parisi D (1994) Learning and evolution in neural networks. Adaptive Behavior 3: 5–28.
- 15. Harvey I (1996) Relearning and evolution in neural networks. Adaptive Behaviour 4: 81–84.