Skip to main content
  • Loading metrics

Falling towards Forgetfulness: Synaptic Decay Prevents Spontaneous Recovery of Memory


Long after a new language has been learned and forgotten, relearning a few words seems to trigger the recall of other words. This “free-lunch learning” (FLL) effect has been demonstrated both in humans and in neural network models. Specifically, previous work proved that linear networks that learn a set of associations, then partially forget them all, and finally relearn some of the associations, show improved performance on the remaining (i.e., nonrelearned) associations. Here, we prove that relearning forgotten associations decreases performance on nonrelearned associations; an effect we call negative free-lunch learning. The difference between free-lunch learning and the negative free-lunch learning presented here is due to the particular method used to induce forgetting. Specifically, if forgetting is induced by isotropic drifting of weight vectors (i.e., by adding isotropic noise), then free-lunch learning is observed. However, as proved here, if forgetting is induced by weight values that simply decay or fall towards zero, then negative free-lunch learning is observed. From a biological perspective, and assuming that nervous systems are analogous to the networks used here, this suggests that evolution may have selected physiological mechanisms that involve forgetting using a form of synaptic drift rather than synaptic decay, because synaptic drift, but not synaptic decay, yields free-lunch learning.

Author Summary

If you learn a skill, then partially forget it, does relearning part of that skill induce recovery of other parts of the skill? More generally, if you learn a set of associations, then partially forget them, does relearning a subset induce recovery of the remaining associations? In previous work, in which participants learned the layout of a scrambled computer keyboard, the answer to this question appeared to be “yes.” More recently, we modeled this “free-lunch learning” effect using artificial neural networks, in which the synaptic strength between each pair of model neurons is a connection weight. We proved that if forgetting is induced by allowing each weight value to drift randomly, then free-lunch learning is almost inevitable. However, if, after learning a set of associations, forgetting is induced by allowing each connection weight to decay or fall toward zero, then relearning a subset of associations decreases performance on the remaining associations. This suggests that evolution may have selected physiological mechanisms that involve forgetting using a form of synaptic drift rather than synaptic decay, because synaptic drift yields free-lunch learning, whereas decay does not.


The idea that structural changes underpin the formation of new memories can be traced to the 19th century [1]. More recently, Hebb proposed that “When an axon of cell A is near enough to excite B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased” [2]. It is now widely accepted that learning involves some form of Hebbian adaptation, and a growing body of evidence suggests that Hebbian adaptation is associated with the long-term potentiation (LTP) observed in neuronal systems [3]. LTP is an increase in synaptic efficacy which occurs in the presence of pre-synaptic and post-synaptic activity, and can be specific to a single synapse. One consequence of Hebbian adaptation is that information regarding a specific association is distributed amongst many synaptic connections, and therefore gives rise to a distributed representation of each association.

In [4], participants learned the layout of letters on a “scrambled” keyboard. After a period of forgetting, participants relearned a subset of letter positions. Crucially, this improved performance on the remaining (i.e., nonrelearned) letter positions. However, whereas relearning some associations shows evidence of FLL in some studies [4][6], this is not found in not all studies [7]. This discrepancy may be because the many studies performed to investigate this general phenomenon use a wide variety of different materials and procedures, with some measuring recall and others measuring recognition performance, for example. However, within the realms of psychology, one relevant effect is known as part-set cueing inhibition.

Part-set cueing inhibition [8] occurs when a subject is exposed to part of a set of previously learned items, which is found to reduce recall of nonrelearned items. However, [9] showed that a learned row of words was better recalled if the cues consisted of a subset of words placed in their learned positions than if cue words were placed in other positions. In this case, part-set cueing seems to improve performance, but only if each “part” appears in the spatial position in which it was originally learned. This position-specificity is consistent with the FLL effect reported using the “scrambled keyboard” procedure in [4] but has no obvious concomitant in network models (e.g., [4],[10],[11]).

If the brain stores information as distributed representations, then each neuron contributes to the storage of many associations. Therefore, relearning some old and partially forgotten associations should affect the integrity of other associations learned at about the same time. As noted above, previous work has shown that relearning some forgotten associations does not disrupt other associations, but partially restores them. This FLL effect has also been demonstrated in neural network models ([10],[12]), where it can accelerate evolution of adaptive behaviors [13]. Crucially, in [12], the proof that relearning some associations partially restores other associations assumes that forgetting is caused by the addition of isotropic noise to connection weights, which could result from the cumulative effect of small random changes in connection weights. In contrast, here we prove that if forgetting is induced by shrinking weights towards zero, so that weights “fall” towards the origin, then relearning some associations disrupts other associations.

The protocol used to examine FLL here is the same as that used in [4] and [12] and is as follows (see Figure 1). First, learn a set of n1+n2 associations A = A1A2 consisting of two subsets A1 and A2 of n1 and n2 associations, respectively. After all learned associations A have been partially forgotten, measure performance error on subset A1. Finally, relearn only subset A2 and then remeasure performance on subset A1. FLL occurs if relearning subset A2 improves performance on A1.

Figure 1. Free-lunch learning protocol.

Two subsets of associations A1 and A2 are learned. After partial forgetting (see text), performance error Epre on subset A1 is measured. Subset A2 is then relearned to pre-forgetting levels of performance, and performance error Epost on subset A1 is re-measured. If Epost<Epre then FLL has occurred, and the amount of FLL is δ = EpreEpost. Redrawn from [12].

In order to preclude a common misunderstanding, we emphasize that, for a network with n connection weights, it is assumed that nn1+n2 ; that is, the number of connection weights on each output unit is not less than the number n1+n2 of learned associations. Using the class of linear network models described below, up to n associations can be learned perfectly (see [12]).

The proofs below refer to a network with one output unit. However, these proofs apply to networks with multiple output units, because the n connections to each output unit can be considered as a distinct network, in which case our results can be applied to the network associated with each output unit.

Definition of Performance Error

Each association consists of an input vector x and a corresponding target value d. For a network with weight vector w, the response to an input vector x is y = w·x. We define the performance error for input vectors x1,…,xk and desired outputs d1,…,dk to be(1)where yi = w·xi is the output response to the input vector xi. By putting X = (x1,…,xk)T, d = (d1,…,dk)T andwe can write Equation 1 succinctly as(2)

The two subsets A1 and A2 consist of n1 and n2 associations, respectively. Let w0 be the network weight vector after A1 and A2 are learned. When A1 and A2 are forgotten, the network weight vector changes to w1, say, and the performance error on A1 becomes Epre = E(X;w1,d). Finally, relearning A2 yields a new weight vector, w2, say, and the performance error on A1 is Epost = E(X;w2,d). Free-lunch learning has occurred if performance error on A1 is less after relearning A2 than it was before relearning A2 (i.e., if Epost<Epre).

Given weight vectors w1 and w2, a matrix X of input vectors, and a vector d of desired outputs, define(3)which we shall also refer to simply as δ.

In previous work [12], we assumed that the “forgetting vector” v (defined as v = w1w0) has an isotropic distribution. Here we shall assume instead that the post-forgetting weight vector w1 is given by(4)for some (possibly random) scalar r, so that(5)and therefore(6)The interpretation of Equation 6 is that forgetting consists of making the optimal weight vector w0 “fall” towards the origin by a falling factor 1−r.


We provide theoretical results, and compare these with results obtained using computer simulations. In essence, our theoretical and simulation results indicate that falling weights induce negative FLL, which decreases with the square of the falling factor 1−r.

Theoretical Results

Our two main theorems are summarised here, and proofs are provided in the Methods section. These theorems apply to a network with n weights which learns n1+n2 associations A = A1A2, and then after partial forgetting, relearns the n2 associations in A2.

We prove that if n1+n2n (so that, in general, the associations A1 and A2 are consistent) and the joint distribution of (X1,d1) is isotropic (where X1 and d1 are the matrix of inputs and the vector of desired outputs for subset A1 of associations) then the expected value of δ is negative (recall that δ is defined in Equation 3). We then prove that the probability P(δ<0) that δ is negative approaches unity as n1 approaches ∞.

Theorem 1

For every non-zero value of r, the expected value of δ given r is negative. More precisely,(7)with equality only in trivial cases, and where the constant of proportionality is guaranteed to be positive. Thus, the expected amount of FLL is negative (or zero).

From a physiological perspective, the case r<1 is obviously of interest because it represents synaptic weight decay. However, from a mathematical perspective, Theorem 1 applies to every value of r, and so it also holds for r>1. In other words, any movement of the weight vector w along the the line connecting w0 to the origin yields an expectation of negative FLL, in accordance with Theorem 1.

Theorem 2

Under mild conditions on the distributions of the input/output pairs (X1,d1) and (X2,d2),(8)where x and are any columns of and , respectively, and

Theorem 2 implies that, if (i) the number (n1) of associations in A1 is a fixed non-zero proportion ( n1/n ) of the number n of connection weights, (ii) E[∥d12]E[∥d2−2] is bounded as n → ∞, and (iii) γ(n) → 0 as n → ∞ then P(δ>0) → 0 as n → ∞, i.e., the amount of FLL is negative, with a probability which tends to 1 as n → ∞.

For example, if we assume that (i) each input vector x = (x1,…,xn) is chosen from an isotropic Gaussian distribution and (ii) the variance of xi is then γ(n) = 2/n, , and E[∥d12]E[∥d2−2] = n1/(n2−1). This ensures that P(δ>0) → 0 as n → ∞.

Simulation Results

Simulation was carried out on a network with n input units and one output unit. The set A of associations consisted of k input vectors (x1,…,xk) and k corresponding desired scalar output values (d1,…,dk). Each input vector comprised n elements x = (x1,…,xn). The values of xi and di were chosen from a Gaussian distribution with unit variance (i.e., ). A network's output yi is a weighted sum of input values , where xij is the jth component of the ith input vector xi, and each weight wj is the connection between the jth input unit and the output unit.

Given that the network error for a given set of k associations is , the derivative of E with respect to w yields the delta learning rule , where η is the learning rate, which is adjusted according to the number of weights.

However, in order to save time, we used an equivalent learning method. Learning of the k = n associations in A = A1A2 was performed by solving a set of n simultaneous equations using a standard method, after which the weight vector w0 was obtained; this provided perfect performance on all n associations. Partial forgetting was induced by making weights “fall” towards the origin w1 = rw0, after which performance error was Epre. Relearning the n2 = n/2 associations in A2 was implemented with k = n2 as above, after which performance error was Epost.

In each simulation, each value in each input vector xi, and each target value di was chosen from the same isotropic gaussian distribution with unit variance. There were 100 input units, and one output unit. The subsets A1 and A2 each consisted of 50 associations. The value of δ = EpreEpost was obtained in each of 100 simulations, using a different random seed for each simulation. In Figure 2, the mean of 100 values of δ is shown for various values of the falling factor 1−r.

Figure 2. Free-lunch learning decreases as the network's weight vector falls toward the origin.

A network with 100 input units and one output unit learns two subsets A1 and B2, each of which consists of 50 associations. After learning A1 and A2, the network has a weight vector w = w0, but after partial forgetting, the weight vector is w = w1. If forgetting consists of subtracting a proportion 1−r of w0 such that w1 = w0−(1−r)w0 then the weight vector “falls” towards the origin; the factor 1−r is called the falling factor. After forgetting, performance error on A1 is Epre, an error which changes to Epost after relearning A2, where this change is δ = EpreEpost. Given that there are A1 associations in A1, the expected free-lunch learning per association in A1 is therefore E[δ/n1|r]. Solid curve: the expected FLL, E[δ/n1|r], where this expectation is taken over 100 computer simulations. Dashed curve: theoretical prediction of E[δ/n1|r] (see Equation 7), using a constant of proportionality equal to unity, so that the predicted free-lunch learning is Epredict[δ/n1|r] = −(1−r)2. As predicted, free-lunch learning E[δ/n1|r] becomes more negative as the falling factor 1−r increases.

The Geometry of Forgetting

We present a brief account of the geometry which underpins the results reported here, for a network with two input units and one output unit, as shown in Figure 3A. This network learns two associations A1 = (X1,d1) and A2 = (X2,d2).

Figure 3. Geometric example of how relearning A2 increases the error on A1.

(A) A network with two input units and one output unit, with connection weights ωa and ωb defines a weight vector w = (ωa,ωb). The network learns two associations A1 and A2. For example, A1 is the mapping from input vector x1 = (x11,x12) to desired output value d1, and learning A1 consists of adjusting w until the network output y1 = w·x1 equals d1. (B) For a given association A2 = (X2,d2), the corresponding constraint line in the space defined by (ωa,ωb) is L2. Irrespective of the precise value of the target output value d1 in association A1, if d1 is distributed isotropically then +d1 is as probable as −d1. When averaged over +d1 and −d1, the change δ in error on A1 induced by relearning A2 can be shown to be −(1−r)2e2, where w1± = rw0±. Since this is less than zero, the expected change E[δ|r]<0. (Figure 3A redrawn from [12]).

Figure 3B provides a geometric example of how relearning A2 increases the error on A1. After learning A1 and A2, w = w0. The effects of forgetting and relearning can be seen by ignoring the ± superscripts and subscripts for now. After partial forgetting, w = w1, and performance error Epre = p2. Relearning A2 yields w2, the orthogonal projection of w1 on to L2, and performance error is Epost = q2. FLL occurs if δ = EpreEpost>0, or equivalently if p2q2>0 (see [12], Appendices A–C for proofs). Forgetting here consists of reducing w0 by a factor r<1, so that w1 = rw0.

The plus and minus signs in Figure 3B refer to two versions and of association A1, in which X1 is the same and the target d1 has the same magnitude, but opposite signs: and .

We now find the expected change in error induced by relearning a given association A2. After learning followed by forgetting, the change in error on after relearning A2 is . After learning followed by forgetting, the change in error on after relearning A2 is . Using similar triangles in Figure 3B,(9)(10)Therefore, the total change in error on and induced by relearning A2 (on different occasions) is(11)(12)(13)Irrespective of the precise value of the target output value d1 in A1, if the distribution of d1 is isotropic then +d1 is as probable as −d1. If the total change in error for two instances ( and ) of A1 is −2(1−r)2e2 then the expected change (conditional on e ) is E[δ|e] = −(1−r)2e2. Therefore, if forgetting is induced by falling weight values, then the expected change in error E[δ]<0.


We have proved and demonstrated that, in one of the simplest forms of neural network model, relearning part of a previously learned set of associations reduces performance on the remaining non-relearned associations. This result is in stark contrast to our previous results, which proved that relearning induced partial recovery of non-relearned items [12]. The only difference between these two studies is the way in which forgetting was induced.

An obvious physiological concomitant of Hebbian learning is long-term potentiation (LTP), which seems to underpin learned behaviors [14]. LTP can last for hours, days or even months, and usually follows an exponential decay [3]. However, some forms of LTP do not seem to decay [15], and have been shown to be stable for up to one year [16]. Such stability is remarkable, but from a statistical point of view, would almost certainly be accompanied by random fluctuations which would have a cumulative effect over time; and indeed, fluctuations are apparent in the stable LTP reported in [16]. Crucially, it is not known if the forgetting of learned behaviors is caused by decaying efficacy at many synapses, or by the cumulative effect of random fluctuations in stable LTP-induced synaptic efficacies. Here, decaying efficacy is analogous to weight values that fall toward zero in network models, whereas the cumulative effect of random fluctuations is analogous to the addition of random noise, or drifting, of weight values in network models.

Given a choice between forgetting via synaptic weights that fall towards zero and weights that drift isotropically, has evolution chosen drifting or falling? If all other things were equal then forgetting via synaptic drift would seem to be the obvious choice. This is because drifting ensures that relearning a subset of associations improves performance on other associations, whereas falling decreases performance. However, other things are rarely equal. The expected magnitude of weights increases with drifting but decreases with falling. (Consider a hypersphere centered on the origin, with radius ∥w0∥ . Simple geometry shows that more than half of all directions emanating from w0 yield a new weight vector w1 which lies outside the hypersphere, and therefore E[∥w1∥]>E[∥w0∥] (assuming, for example, that all vectors w1w0 have the same length).) This decrease in weight magnitudes effectively reduces neuronal firing rates, which reduces metabolic costs relative to costs incurred by synaptic drift. Synaptic drift therefore confers mnemonic benefits, but these benefits come at a metabolic price. Thus the increased fitness gained from the mnemonic benefits of synaptic drift must be offset against their metabolic costs. In essence, even free-lunch learning comes at a price.


We proceed by deriving expressions for Epre, Epost, and for δ = Epre−Epost. We prove that if n1+n2n then the expected value of δ is negative. We then prove that the probability P(δ<0) that δ is negative approaches unity as n1 approaches ∞.

Performance Errors

Given a c×n matrix X and a c -dimensional vector d, let LX,d be the affine subspaceof . If X and d are consistent (i.e., there is a w such that Xw = d) thenGiven weight vectors w1 and w2, a matrix X of input vectors, and a vector d of desired outputs, definewhere Epre = E(X;w1,d) and Epost = E(X;w2,d). Let be any element of LX,d. Then (14)

If Xi has rank ni then transposing the QR decomposition of (or, equivalently, using Gram–Schmidt orthonormalisation of the rows of Xi) givesfor unique ni×ni and ni×n matrices Ti and Zi with Ti lower triangular with positive diagonal elements, and . Simple calculation shows that, for any weight vector w, and are orthogonal. Since , it follows that the matrix represents the operator that projects orthogonally onto the image of . Because(15)the image of is contained in that of . As both these images have dimension ni, they must be equal, and so represents the operator which projects orthogonally onto the image of .

Now suppose that X and d are consistent, where

Then, after the network has learned A1 and A2, the weight vector w0 satisfies(16)(If, as below, n1+n2n, X2 and d2 are consistent, and (X1,d1) has a continuous distribution then Equation 16 holds with probability 1.)


We now assume that forgetting is induced by weight values “falling” towards the origin at zero, i.e., forgetting consists of shrinking the weight vector w0 by a (possibly random) factor r towards the “dead state” 0. Thus the post-forgetting weight vector w1 is given by(17)and so the “forgetting vector” v = w1w0 is(18)

The form of forgetting given by Equation 17 is very different from that investigated in [12], where v has an isotropic distribution and is independent of (X1,d1) and (X2,d2).

Let w2 be the orthogonal projection of w1 onto L2. Then

Manipulation gives(19)and so(20)

Then Equations 14, 16, and 18–20 yield(21)

The Case of Isotropic Random (X1,d1)

In this section we assume that the distribution of (X1,d1) is isotropic, i.e., that (UX1V,Ud1) has the same distribution as (X1,d1) for all orthogonal n1×n1 matrices U and all orthogonal n×n matrices V. Then taking the conditional expectation of Equation 21 for given X2, d2, and r gives the following theorem.

Theorem 1


  1. n1+n2n,
  2. X2 and d2 are consistent,
  3. the distribution of (X1,d1) is continuous and isotropic,
  4. X1, d1, and (X2,d2,r) are independent.

then(22)where x is any column of .

Corollary 1

If 1.-3. of Theorem 1 hold then(23)with equality if and only if either r = 1 or d2 = 0.

Corollary 1 says that (apart from trivial exceptions) the expected amount of FLL is negative.

To obtain Theorem 2, it is useful to have some moments of isotropic distributions. Let x be isotropically distributed on . Then Equations 9.6.1 and 9.6.2 of Mardia and Jupp (2000), together with some algebraic manipulation, yield(24)(25)as in Equations A.14 and A.15 of [12].

The other tool used in proving Theorem 2 is the formula(26)for any random variables X,Y,Z for which these quantities exist. Equation 26 is an application to the conditional distribution of Y|Z of the standard conditional variance formula that is given in Equation 2b.3.6 on page 97 of [17].

Taking the expectation and variance of Equation 21 as only d1 varies and using Equation 24 gives(27)(28)

Taking the expectation of Equation 28 as only X1 varies and using Equation 24 gives(29)

We now suppose that(30)

Then taking the variance of Equation 27 as only X1 varies and using Equation 25 gives(31)

Adding Equations 29 and 30 and using Equation 26 yields(32)

To obtain an upper bound on the conditional probability of FLL (i.e., on P(δ≥0|X2,d2,r)), we use Chebyshev's inequality, which states that, for any random variable Y and any positive value of t

Applying Chebyshev's inequality to the conditional distribution of δ(w1,w2,X1,d1) given (X2,d2,r), taking t = E[δ(w1,w2;X1,d1)|X2,d2,r], and noting that (by Equation 23) t≤0, we obtain(33)

Substituting Equations 22 and 32 into Equation 33 gives(34)where

For any positive-definite symmetric matrix A and vector x, diagonalization of A, together with the fact that x+1/x≥2 for positive x, yields(35)

Combining Equations 34 and 35 with the fact that gives(36)

Taking the expectation of Equation 36 over X2 yields(37)where x and are any columns of and , respectively.

Taking the expectation of Equation 37 over d2 and r yields the following theorem.

Theorem 2

If (a) conditions 1.-4. of Theorem 1 hold, (b) the columns of are distributed independently, (c) X2, d2, and r are independent, (d) the distribution of (X2,d2) is isotropic, and (e) E[∥d2−2] is finite then(38)where x and are any columns of and , respectively, and

Corollary 2

If the conditions of Theorem 2 hold andwhere x and are any columns of and , respectively, then

Thusprovided that n1/n and n2/n are bounded away from zero.


Thanks to David Sterratt for asking, “What would happen to free-lunch learning if weights decayed?” and to three anonymous reviewers for their detailed comments.

Author Contributions

Conceived and designed the experiments: JS. Performed the experiments: JS. Analyzed the data: JS. Contributed reagents/materials/analysis tools: JS. Wrote the paper: JS PEJ. Mathematical proofs: PEJ.


  1. 1. Tanzi E (1893) I fatti e le induzioni nellodierna isologia del sistema nervosa. Riv Sper Freniatr Med Leg 19: 419–472.
  2. 2. Hebb D (1949) The Organization of Behavior: A Neuropsychological Theory. New York: Wiley.
  3. 3. Abraham W (2003) How long will long-term potentiation last? Philos Trans R Soc Lond B Biol Sci 358: 735–744.
  4. 4. Stone J, Hunkin N, Hornby A (2001) Predicting spontaneous recovery of memory. Nature 414: 167–168.
  5. 5. Coltheart M, Byng S (1989) A treatment for surface dyslexia. In: Seron X, editor. Cognitive Approaches in Neuropsychological Rehabilitation. London: Lawrence Erlbaum Associates.
  6. 6. Weekes B, Coltheart M (1996) Surface dyslexia and surface dysgraphia: treatment studies and their theoretical implications. Cogn Neuropsychol 13: 277–315.
  7. 7. Atkins P (2001) What happens when we relearn part of what we previously knew? Predictions and constraints for models of long-term memory. Psychol Res 65: 202–215.
  8. 8. Roediger H III (1973) Inhibition in recall from cueing with recall targets. J Verbal Learn Verbal Behav 12: 644–657.
  9. 9. Serra M, Nairne J (2000) Part-set cuing of order information: implications for associative theories. Mem Cognit 28: 847–855.
  10. 10. Hinton G, Plaut D (1987) Using fast weights to deblur old memories. Proceedings Ninth Annual Conference of the Cognitive Science Society. pp. 177–186.
  11. 11. Atkins P, Murre J (1998) Recovery of unrehearsed items in connectionist models. Connect Sci 10: 99–119.
  12. 12. Stone J, Jupp P (2007) Free-lunch learning: modelling spontaneous recovery of memory. Neural Comput 19: 194–217.
  13. 13. Stone J (2007) Distributed representations accelerate evolution of adaptive behaviours. PLoS Comput Biol 3: e147.
  14. 14. Whitlock J, Heynen A, Shuler M, Bear M (2006) Learning induces long-term potentiation in the hippocampus. Science 313: 1093–1097.
  15. 15. Staubli U, Lynch G (1987) Stable hippocampal long-term potentiation elicited by theta pattern stimulation. Brain Res 435: 227–234.
  16. 16. Abraham WC, Logan B, Greenwood JM, Dragunow M (2002) Induction and experience-dependent consolidation of stable long-term potentiation lasting months in the hippocampus. J Neurosci 22: 9626–9634.
  17. 17. Rao C (1973) Linear Statistical Inference and its Applications. 2nd edition. New York: Wiley.