^{1}

^{*}

^{2}

Conceived and designed the experiments: JS. Performed the experiments: JS. Analyzed the data: JS. Contributed reagents/materials/analysis tools: JS. Wrote the paper: JS PEJ. Mathematical proofs: PEJ.

The authors have declared that no competing interests exist.

Long after a new language has been learned and forgotten, relearning a few words seems to trigger the recall of other words. This “free-lunch learning” (FLL) effect has been demonstrated both in humans and in neural network models. Specifically, previous work proved that linear networks that learn a set of associations, then partially forget them all, and finally relearn some of the associations, show improved performance on the remaining (i.e., nonrelearned) associations. Here, we prove that relearning forgotten associations

If you learn a skill, then partially forget it, does relearning part of that skill induce recovery of other parts of the skill? More generally, if you learn a set of associations, then partially forget them, does relearning a subset induce recovery of the remaining associations? In previous work, in which participants learned the layout of a scrambled computer keyboard, the answer to this question appeared to be “yes.” More recently, we modeled this “free-lunch learning” effect using artificial neural networks, in which the synaptic strength between each pair of model neurons is a connection weight. We proved that if forgetting is induced by allowing each weight value to drift randomly, then free-lunch learning is almost inevitable. However, if, after learning a set of associations, forgetting is induced by allowing each connection weight to

The idea that structural changes underpin the formation of new memories can be traced to the 19th century

In

Part-set cueing inhibition

If the brain stores information as distributed representations, then each neuron contributes to the storage of many associations. Therefore, relearning some old and partially forgotten associations should affect the integrity of other associations learned at about the same time. As noted above, previous work has shown that relearning some forgotten associations does not disrupt other associations, but partially restores them. This FLL effect has also been demonstrated in neural network models (

The protocol used to examine FLL here is the same as that used in _{1}+_{2} associations _{1}∪_{2} consisting of two subsets _{1} and _{2} of _{1} and _{2} associations, respectively. After all learned associations _{1}. Finally, relearn _{2} and then remeasure performance on subset _{1}. FLL occurs if relearning subset _{2} improves performance on _{1}.

Two subsets of associations _{1} and _{2} are learned. After partial forgetting (see text), performance error _{pre} on subset _{1} is measured. Subset _{2} is then relearned to pre-forgetting levels of performance, and performance error _{post} on subset _{1} is re-measured. If _{post}<_{pre} then FLL has occurred, and the amount of FLL is _{pre}−_{post}. Redrawn from

In order to preclude a common misunderstanding, we emphasize that, for a network with _{1}+_{2} ; that is, the number of connection weights on each output unit is not less than the number _{1}+_{2} of learned associations. Using the class of linear network models described below, up to

The proofs below refer to a network with one output unit. However, these proofs apply to networks with multiple output units, because the

Each association consists of an input vector _{1},…,_{k}_{1},…,_{k}_{i}_{i}_{i}_{1},…,_{k}^{T}_{1},…,_{k}^{T}

The two subsets _{1} and _{2} consist of _{1} and _{2} associations, respectively. Let _{0} be the network weight vector after _{1} and _{2} are learned. When _{1} and _{2} are forgotten, the network weight vector changes to _{1}, say, and the performance error on _{1} becomes _{pre} = _{1},_{2} yields a new weight vector, _{2}, say, and the performance error on _{1} is _{post} = _{2},_{1} is less after relearning _{2} than it was before relearning _{2} (i.e., if _{post}<_{pre}).

Given weight vectors _{1} and _{2}, a matrix

In previous work _{1}−_{0}) has an isotropic distribution. Here we shall assume instead that the post-forgetting weight vector _{1} is given by_{0} “fall” towards the origin by a

We provide theoretical results, and compare these with results obtained using computer simulations. In essence, our theoretical and simulation results indicate that falling weights induce negative FLL, which decreases with the square of the falling factor 1−

Our two main theorems are summarised here, and proofs are provided in the _{1}+_{2} associations _{1}∪_{2}, and then after partial forgetting, relearns the _{2} associations in _{2}.

We prove that if _{1}+_{2}≤_{1} and _{2} are consistent) and the joint distribution of (_{1},_{1}) is isotropic (where _{1} and _{1} are the matrix of inputs and the vector of desired outputs for subset _{1} of associations) then the expected value of _{1} approaches ∞.

For every non-zero value of

From a physiological perspective, the case _{0} to the origin yields an expectation of negative FLL, in accordance with Theorem 1.

Under mild conditions on the distributions of the input/output pairs (_{1},_{1}) and (_{2},_{2}),

Theorem 2 implies that, if (i) the number (_{1}) of associations in _{1} is a fixed non-zero proportion ( _{1}/_{1}∥^{2}]_{2}∥^{−2}] is bounded as

For example, if we assume that (i) each input vector _{1},…,_{n}_{i}_{1}∥^{2}]^{2}∥^{−2}] = _{1}/(_{2}−1). This ensures that

Simulation was carried out on a network with _{1},…,_{k}_{1},…,_{k}_{1},…,_{n}_{i}_{i}_{i}_{ij}_{i}_{j}

Given that the network error for a given set of

However, in order to save time, we used an equivalent learning method. Learning of the _{1}∪_{2} was performed by solving a set of _{0} was obtained; this provided perfect performance on all _{1} = _{0}, after which performance error was _{pre}. Relearning the _{2} = _{2} was implemented with _{2} as above, after which performance error was _{post}.

In each simulation, each value in each input vector _{i}_{i}_{1} and _{2} each consisted of 50 associations. The value of _{pre}−_{post} was obtained in each of 100 simulations, using a different random seed for each simulation. In

A network with 100 input units and one output unit learns two subsets _{1} and _{2}, each of which consists of 50 associations. After learning _{1} and _{2}, the network has a weight vector w = w_{0}, but after partial forgetting, the weight vector is w = w_{1}. If forgetting consists of subtracting a proportion 1−_{0} such that w_{1} = w_{0}−(1−_{0} then the weight vector “falls” towards the origin; the factor 1−_{1} is _{pre}, an error which changes to _{post} after relearning _{2}, where this change is _{pre}−_{post}. Given that there are _{1} associations in _{1}, the expected free-lunch learning per association in _{1} is therefore E[_{1}|_{1}|_{1}|_{predict}[_{1}|^{2}. As predicted, free-lunch learning _{1}|

We present a brief account of the geometry which underpins the results reported here, for a network with two input units and one output unit, as shown in _{1} = (_{1},_{1}) and _{2} = (_{2},_{2}).

(A) A network with two input units and one output unit, with connection weights _{a}_{b}_{a}_{b}_{1} and _{2}. For example, _{1} is the mapping from input vector x_{1} = (_{11},_{12}) to desired output value _{1}, and learning _{1} consists of adjusting w until the network output _{1} = w·x_{1} equals _{1}. (B) For a given association _{2} = (_{2},_{2}), the corresponding constraint line in the space defined by (_{a}_{b}_{2}. Irrespective of the precise value of the target output value _{1} in association _{1}, if _{1} is distributed isotropically then +_{1} is as probable as −_{1}. When averaged over +_{1} and −_{1}, the change _{1} induced by relearning _{2} can be shown to be −(1−^{2}^{2}, where w_{1}^{±} = _{0}^{±}. Since this is less than zero, the expected change

_{2} increases the error on _{1}. After learning _{1} and _{2}, _{0}. The effects of forgetting and relearning can be seen by ignoring the ± superscripts and subscripts for now. After partial forgetting, _{1}, and performance error _{pre} = ^{2}. Relearning _{2} yields _{2}, the orthogonal projection of _{1} on to _{2}, and performance error is _{post} = ^{2}. FLL occurs if _{pre}−_{post}>0, or equivalently if ^{2}−^{2}>0 (see _{0} by a factor _{1} = _{0}.

The plus and minus signs in _{1}, in which _{1} is the same and the target _{1} has the same magnitude, but opposite signs:

We now find the expected change in error induced by relearning a given association _{2}. After learning _{2} is _{2} is _{2} (on different occasions) is_{1} in _{1}, if the distribution of _{1} is isotropic then +_{1} is as probable as −_{1}. If the total change in error for two instances (_{1} is −2(1−^{2}^{2} then the expected change (conditional on ^{2}^{2}. Therefore, if forgetting is induced by falling weight values, then the expected change in error

We have proved and demonstrated that, in one of the simplest forms of neural network model, relearning part of a previously learned set of associations reduces performance on the remaining non-relearned associations. This result is in stark contrast to our previous results, which proved that relearning induced partial recovery of non-relearned items

An obvious physiological concomitant of Hebbian learning is long-term potentiation (LTP), which seems to underpin learned behaviors

Given a choice between forgetting via synaptic weights that fall towards zero and weights that drift isotropically, has evolution chosen drifting or falling? If all other things were equal then forgetting via synaptic drift would seem to be the obvious choice. This is because drifting ensures that relearning a subset of associations improves performance on other associations, whereas falling decreases performance. However, other things are rarely equal. The expected magnitude of weights increases with drifting but decreases with falling. (Consider a hypersphere centered on the origin, with radius ∥_{0}∥ . Simple geometry shows that more than half of all directions emanating from _{0} yield a new weight vector _{1} which lies outside the hypersphere, and therefore _{1}∥]>_{0}∥] (assuming, for example, that all vectors _{1}−_{0} have the same length).) This decrease in weight magnitudes effectively reduces neuronal firing rates, which reduces metabolic costs relative to costs incurred by synaptic drift. Synaptic drift therefore confers mnemonic benefits, but these benefits come at a metabolic price. Thus the increased fitness gained from the mnemonic benefits of synaptic drift must be offset against their metabolic costs. In essence, even free-lunch learning comes at a price.

We proceed by deriving expressions for _{pre}, _{post}, and for _{post}. We prove that if _{1}+_{2}≤_{1} approaches ∞.

Given a _{X}_{,d} be the affine subspace_{1} and _{2}, a matrix _{pre} = _{1},_{post} = _{2},_{X}_{,d}. Then

If _{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}_{i}

Now suppose that

Then, after the network has learned _{1} and _{2}, the weight vector _{0} satisfies_{1}+_{2}≤_{2} and _{2} are consistent, and (_{1},_{1}) has a continuous distribution then Equation 16 holds with probability 1.)

We now assume that forgetting is induced by weight values “falling” towards the origin at zero, i.e., forgetting consists of shrinking the weight vector _{0} by a (possibly random) factor _{1} is given by_{1}−_{0} is

The form of forgetting given by Equation 17 is very different from that investigated in _{1},_{1}) and (_{2},_{2}).

Let _{2} be the orthogonal projection of _{1} onto _{2}. Then

Manipulation gives

Then Equations 14, 16, and 18–20 yield

In this section we assume that the distribution of (_{1},_{1}) is isotropic, i.e., that (_{1}_{1}) has the same distribution as (_{1},_{1}) for all orthogonal _{1}×_{1} matrices _{2}, _{2}, and

If

_{1}+_{2}≤

_{2} and _{2} are consistent,

the distribution of (_{1},_{1}) is continuous and isotropic,

_{1}, _{1}, and (_{2},_{2},

then

If 1.-3. of Theorem 1 hold then_{2} = 0.

Corollary 1 says that (apart from trivial exceptions) the expected amount of FLL is negative.

To obtain Theorem 2, it is useful to have some moments of isotropic distributions. Let

The other tool used in proving Theorem 2 is the formula

Taking the expectation and variance of Equation 21 as only _{1} varies and using Equation 24 gives

Taking the expectation of Equation 28 as only _{1} varies and using Equation 24 gives

We now suppose that

Then taking the variance of Equation 27 as only _{1} varies and using Equation 25 gives

Adding Equations 29 and 30 and using Equation 26 yields

To obtain an upper bound on the conditional probability of FLL (i.e., on _{2},_{2},

Applying Chebyshev's inequality to the conditional distribution of δ(_{1},_{2},_{1},_{1}) given (_{2},_{2},_{1},_{2};_{1},_{1})|_{2},_{2},

Substituting Equations 22 and 32 into Equation 33 gives

For any positive-definite symmetric matrix

Combining Equations 34 and 35 with the fact that

Taking the expectation of Equation 36 over _{2} yields

Taking the expectation of Equation 37 over _{2} and

If (a) conditions 1.-4. of Theorem 1 hold, (b) the columns _{2}, _{2}, and _{2},_{2}) is isotropic, and (e) _{2}∥^{−2}] is finite then

If the conditions of Theorem 2 hold and

Thus_{1}/_{2}/

Thanks to David Sterratt for asking, “What would happen to free-lunch learning if weights decayed?” and to three anonymous reviewers for their detailed comments.