Falling towards Forgetfulness: Synaptic Decay Prevents Spontaneous Recovery of Memory

Long after a new language has been learned and forgotten, relearning a few words seems to trigger the recall of other words. This “free-lunch learning” (FLL) effect has been demonstrated both in humans and in neural network models. Specifically, previous work proved that linear networks that learn a set of associations, then partially forget them all, and finally relearn some of the associations, show improved performance on the remaining (i.e., nonrelearned) associations. Here, we prove that relearning forgotten associations decreases performance on nonrelearned associations; an effect we call negative free-lunch learning. The difference between free-lunch learning and the negative free-lunch learning presented here is due to the particular method used to induce forgetting. Specifically, if forgetting is induced by isotropic drifting of weight vectors (i.e., by adding isotropic noise), then free-lunch learning is observed. However, as proved here, if forgetting is induced by weight values that simply decay or fall towards zero, then negative free-lunch learning is observed. From a biological perspective, and assuming that nervous systems are analogous to the networks used here, this suggests that evolution may have selected physiological mechanisms that involve forgetting using a form of synaptic drift rather than synaptic decay, because synaptic drift, but not synaptic decay, yields free-lunch learning.


Introduction
The idea that structural changes underpin the formation of new memories can be traced to the 19th century [1]. More recently, Hebb proposed that ''When an axon of cell A is near enough to excite B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased'' [2]. It is now widely accepted that learning involves some form of Hebbian adaptation, and a growing body of evidence suggests that Hebbian adaptation is associated with the long-term potentiation (LTP) observed in neuronal systems [3]. LTP is an increase in synaptic efficacy which occurs in the presence of pre-synaptic and post-synaptic activity, and can be specific to a single synapse. One consequence of Hebbian adaptation is that information regarding a specific association is distributed amongst many synaptic connections, and therefore gives rise to a distributed representation of each association.
In [4], participants learned the layout of letters on a ''scrambled'' keyboard. After a period of forgetting, participants relearned a subset of letter positions. Crucially, this improved performance on the remaining (i.e., nonrelearned) letter positions. However, whereas relearning some associations shows evidence of FLL in some studies [4][5][6], this is not found in not all studies [7]. This discrepancy may be because the many studies performed to investigate this general phenomenon use a wide variety of different materials and procedures, with some measuring recall and others measuring recognition performance, for example. However, within the realms of psychology, one relevant effect is known as part-set cueing inhibition.
Part-set cueing inhibition [8] occurs when a subject is exposed to part of a set of previously learned items, which is found to reduce recall of nonrelearned items. However, [9] showed that a learned row of words was better recalled if the cues consisted of a subset of words placed in their learned positions than if cue words were placed in other positions. In this case, part-set cueing seems to improve performance, but only if each ''part'' appears in the spatial position in which it was originally learned. This positionspecificity is consistent with the FLL effect reported using the ''scrambled keyboard'' procedure in [4] but has no obvious concomitant in network models (e.g., [4,10,11]).
If the brain stores information as distributed representations, then each neuron contributes to the storage of many associations. Therefore, relearning some old and partially forgotten associations should affect the integrity of other associations learned at about the same time. As noted above, previous work has shown that relearning some forgotten associations does not disrupt other associations, but partially restores them. This FLL effect has also been demonstrated in neural network models ( [10,12]), where it can accelerate evolution of adaptive behaviors [13]. Crucially, in [12], the proof that relearning some associations partially restores other associations assumes that forgetting is caused by the addition of isotropic noise to connection weights, which could result from the cumulative effect of small random changes in connection weights. In contrast, here we prove that if forgetting is induced by shrinking weights towards zero, so that weights ''fall'' towards the origin, then relearning some associations disrupts other associations.
The protocol used to examine FLL here is the same as that used in [4] and [12] and is as follows (see Figure 1). First, learn a set of n 1 +n 2 associations A = A 1 <A 2 consisting of two subsets A 1 and A 2 of n 1 and n 2 associations, respectively. After all learned associations A have been partially forgotten, measure performance error on subset A 1 . Finally, relearn only subset A 2 and then remeasure performance on subset A 1 . FLL occurs if relearning subset A 2 improves performance on A 1 .
In order to preclude a common misunderstanding, we emphasize that, for a network with n connection weights, it is assumed that n$n 1 +n 2 ; that is, the number of connection weights on each output unit is not less than the number n 1 +n 2 of learned associations. Using the class of linear network models described below, up to n associations can be learned perfectly (see [12]).
The proofs below refer to a network with one output unit. However, these proofs apply to networks with multiple output units, because the n connections to each output unit can be considered as a distinct network, in which case our results can be applied to the network associated with each output unit.

Definition of Performance Error
Each association consists of an input vector x and a corresponding target value d. For a network with weight vector w, the response to an input vector x is y = w?x. We define the performance error for input vectors x 1 ,…,x k and desired outputs where y i = w?x i is the output response to the input vector x i . By we can write Equation 1 succinctly as The two subsets A 1 and A 2 consist of n 1 and n 2 associations, respectively. Let w 0 be the network weight vector after A 1 and A 2 are learned. When A 1 and A 2 are forgotten, the network weight vector changes to w 1 , say, and the performance error on A 1 becomes E pre = E(X;w 1 ,d). Finally, relearning A 2 yields a new weight vector, w 2 , say, and the performance error on A 1 is E post = E(X;w 2 ,d). Free-lunch learning has occurred if performance error on A 1 is less after relearning A 2 than it was before relearning A 2 (i.e., if E post ,E pre ).
Given weight vectors w 1 and w 2 , a matrix X of input vectors, and a vector d of desired outputs, define which we shall also refer to simply as d.
In previous work [12], we assumed that the ''forgetting vector'' v (defined as v = w 1 2w 0 ) has an isotropic distribution. Here we shall assume instead that the post-forgetting weight vector w 1 is given by for some (possibly random) scalar r, so that v~r{1 ð Þw 0 ð5Þ and therefore The interpretation of Equation 6 is that forgetting consists of making the optimal weight vector w 0 ''fall'' towards the origin by a falling factor 12r.   Figure 1. Free-lunch learning protocol. Two subsets of associations A 1 and A 2 are learned. After partial forgetting (see text), performance error E pre on subset A 1 is measured. Subset A 2 is then relearned to pre-forgetting levels of performance, and performance error E post on subset A 1 is re-measured. If E post ,E pre then FLL has occurred, and the amount of FLL is d = E pre 2E post . Redrawn from [12]. doi:10.1371/journal.pcbi.1000143.g001

Author Summary
If you learn a skill, then partially forget it, does relearning part of that skill induce recovery of other parts of the skill? More generally, if you learn a set of associations, then partially forget them, does relearning a subset induce recovery of the remaining associations? In previous work, in which participants learned the layout of a scrambled computer keyboard, the answer to this question appeared to be ''yes.'' More recently, we modeled this ''free-lunch learning'' effect using artificial neural networks, in which the synaptic strength between each pair of model neurons is a connection weight. We proved that if forgetting is induced by allowing each weight value to drift randomly, then free-lunch learning is almost inevitable. However, if, after learning a set of associations, forgetting is induced by allowing each connection weight to decay or fall toward zero, then relearning a subset of associations decreases performance on the remaining associations. This suggests that evolution may have selected physiological mechanisms that involve forgetting using a form of synaptic drift rather than synaptic decay, because synaptic drift yields free-lunch learning, whereas decay does not.

Results
We provide theoretical results, and compare these with results obtained using computer simulations. In essence, our theoretical and simulation results indicate that falling weights induce negative FLL, which decreases with the square of the falling factor 12r.

Theoretical Results
Our two main theorems are summarised here, and proofs are provided in the Methods section. These theorems apply to a network with n weights which learns n 1 +n 2 associations A = A 1 <A 2 , and then after partial forgetting, relearns the n 2 associations in A 2 .
We prove that if n 1 +n 2 #n (so that, in general, the associations A 1 and A 2 are consistent) and the joint distribution of (X 1 ,d 1 ) is isotropic (where X 1 and d 1 are the matrix of inputs and the vector of desired outputs for subset A 1 of associations) then the expected value of d is negative (recall that d is defined in Equation 3). We then prove that the probability P(d,0) that d is negative approaches unity as n 1 approaches '.

Theorem 1
For every non-zero value of r, the expected value of d given r is negative. More precisely, with equality only in trivial cases, and where the constant of proportionality is guaranteed to be positive. Thus, the expected amount of FLL is negative (or zero). From a physiological perspective, the case r,1 is obviously of interest because it represents synaptic weight decay. However, from a mathematical perspective, Theorem 1 applies to every value of r, and so it also holds for r.1. In other words, any movement of the weight vector w along the the line connecting w 0 to the origin yields an expectation of negative FLL, in accordance with Theorem 1.

Theorem 2
Under mild conditions on the distributions of the input/output pairs (X 1 ,d 1 ) and (X 2 ,d 2 ), where x andx x are any columns of X T 1 and X T 2 , respectively, and Theorem 2 implies that, if (i) the number (n 1 ) of associations in A 1 is a fixed non-zero proportion ( n 1 /n ) of the number n of connection weights, (ii) E[Id 1 I 2 ]E[Id 2 I 22 ] is bounded as n R ', and (iii) c(n) R 0 as n R ' then P(d.0) R 0 as n R ', i.e., the amount of FLL is negative, with a probability which tends to 1 as n R '.
For example, if we assume that (i) each input vector x = (x 1 ,…,x n ) is chosen from an isotropic Gaussian distribution and E[Id 1 I 2 ]E[Id 2 I 22 ] = n 1 /(n 2 21). This ensures that P(d.0) R 0 as n R '.

Simulation Results
Simulation was carried out on a network with n input units and one output unit. The set A of associations consisted of k input vectors (x 1 ,…,x k ) and k corresponding desired scalar output values (d 1 ,…,d k ). Each input vector comprised n elements x = (x 1 ,…,x n ). The values of x i and d i were chosen from a Gaussian distribution with unit variance (i.e., s 2 x~s 2 d~1 ). A network's output y i is a weighted sum of input values y i~w : x i~P k j~1 w j x ij , where x ij is the jth component of the ith input vector x i , and each weight w j is the connection between the jth input unit and the output unit.
Given that the network error for a given set of k associa- Þx i of E with respect to w yields the delta learning rule w new~wold {g+E w old ð Þ , where g is the learning rate, which is adjusted according to the number of weights.
However, in order to save time, we used an equivalent learning method. Learning of the k = n associations in A = A 1 <A 2 was performed by solving a set of n simultaneous equations using a standard method, after which the weight vector w 0 was obtained; this provided perfect performance on all n associations. Partial forgetting was induced by making weights ''fall'' towards the origin w 1 = rw 0 , after which performance error was E pre . Relearning the n 2 = n/2 associations in A 2 was implemented with k = n 2 as above, after which performance error was E post .
In each simulation, each value in each input vector x i , and each target value d i was chosen from the same isotropic gaussian distribution with unit variance. There were 100 input units, and one output unit. The subsets A 1 and A 2 each consisted of 50 associations. The value of d = E pre 2E post was obtained in each of 100 simulations, using a different random seed for each simulation. In Figure 2, the mean of 100 values of d is shown for various values of the falling factor 12r.

The Geometry of Forgetting
We present a brief account of the geometry which underpins the results reported here, for a network with two input units and one output unit, as shown in Figure 3A. This network learns two associations A 1 = (X 1 ,d 1 ) and A 2 = (X 2 ,d 2 ). Figure 3B provides a geometric example of how relearning A 2 increases the error on A 1 . After learning A 1 and A 2 , w = w 0 . The effects of forgetting and relearning can be seen by ignoring the 6 superscripts and subscripts for now. After partial forgetting, w = w 1 , and performance error E pre = p 2 . Relearning A 2 yields w 2 , the orthogonal projection of w 1 on to L 2 , and performance error is E post = q 2 . FLL occurs if d = E pre 2E post .0, or equivalently if p 2 2q 2 .0 (see [12], Appendices A-C for proofs). Forgetting here consists of reducing w 0 by a factor r,1, so that w 1 = rw 0 .
The plus and minus signs in Figure 3B refer to two versions A z 1 and A { 1 of association A 1 , in which X 1 is the same and the target d 1 has the same magnitude, but opposite signs:  Figure 3B, Therefore, the total change in error on A z 1 and A { 1 induced by relearning A 2 (on different occasions) is Irrespective of the precise value of the target output value d 1 in A 1 , if the distribution of d 1 is isotropic then +d 1 is as probable as 2d 1 . If the total change in error for two instances (A z 1 and A { 1 ) of A 1 is 22(12r) 2 e 2 then the expected change (conditional on e ) is E[d|e] = 2(12r) 2 e 2 . Therefore, if forgetting is induced by falling weight values, then the expected change in error E[d],0.

Discussion
We have proved and demonstrated that, in one of the simplest forms of neural network model, relearning part of a previously learned set of associations reduces performance on the remaining non-relearned associations. This result is in stark contrast to our previous results, which proved that relearning induced partial recovery of non-relearned items [12]. The only difference between these two studies is the way in which forgetting was induced.
An obvious physiological concomitant of Hebbian learning is long-term potentiation (LTP), which seems to underpin learned behaviors [14]. LTP can last for hours, days or even months, and usually follows an exponential decay [3]. However, some forms of LTP do not seem to decay [15], and have been shown to be stable for up to one year [16]. Such stability is remarkable, but from a statistical point of view, would almost certainly be accompanied by random fluctuations which would have a cumulative effect over time; and indeed, fluctuations are apparent in the stable LTP reported in [16]. Crucially, it is not known if the forgetting of learned behaviors is caused by decaying efficacy at many synapses, or by the cumulative effect of random fluctuations in stable LTP-induced synaptic efficacies. Here, decaying efficacy is analogous to weight values that fall toward zero in network models, whereas the cumulative effect of random fluctuations is analogous to the addition of random noise, or drifting, of weight values in network models.
Given a choice between forgetting via synaptic weights that fall towards zero and weights that drift isotropically, has evolution chosen drifting or falling? If all other things were equal then forgetting via synaptic drift would seem to be the obvious choice. This is because drifting ensures that relearning a subset of associations improves performance on other associations, whereas falling decreases performance. However, other things are rarely equal. The expected magnitude of weights increases with drifting but decreases with falling. (Consider a hypersphere centered on the origin, with radius Iw 0 I . Simple geometry shows that more than half of all directions emanating from w 0 yield a new weight vector w 1 which lies outside the hypersphere, and therefore E[Iw 1 I]. E[Iw 0 I] (assuming, for example, that all vectors w 1 2w 0 have the same length).) This decrease in weight magnitudes effectively reduces neuronal firing rates, which reduces metabolic costs relative to costs incurred by synaptic drift. Synaptic drift therefore confers mnemonic benefits, but these benefits come at a metabolic price. Thus the increased fitness gained from the mnemonic benefits of synaptic drift must be offset against their metabolic costs. In essence, even freelunch learning comes at a price.

Methods
We proceed by deriving expressions for E pre , E post , and for d = Epre2E post . We prove that if n 1 +n 2 #n then the expected value of d is negative. We then prove that the probability P(d,0) that d is negative approaches unity as n 1 approaches '.

Performance Errors
Given a c6n matrix X and a c -dimensional vector d, let L X,d be the affine subspace Falling factor 1−r Figure 2. Free-lunch learning decreases as the network's weight vector falls toward the origin. A network with 100 input units and one output unit learns two subsets A 1 and B 2 , each of which consists of 50 associations. After learning A 1 and A 2 , the network has a weight vector w = w 0 , but after partial forgetting, the weight vector is w = w 1 . If forgetting consists of subtracting a proportion 12r of w 0 such that w 1 = w 0 2(12r)w 0 then the weight vector ''falls'' towards the origin; the factor 12r is called the falling factor. After forgetting, performance error on A 1 is E pre , an error which changes to E post after relearning A 2 , where this change is d = E pre 2E post . Given that there are A 1 associations in A 1 , the expected free-lunch learning per association in of L X,d . Then If X i has rank n i then transposing the QR decomposition of X T i (or, equivalently, using Gram-Schmidt orthonormalisation of the rows of X i ) gives for unique n i 6n i and n i 6n matrices T i and Z i with T i lower triangular with positive diagonal elements, and Z i Z T i~I ni . Simple calculation shows that, for any weight vector w, it follows that the matrix Z T i Z i represents the operator that projects orthogonally onto the image of Z T i Z i . Because the image of X T i X i is contained in that of Z T i Z i . As both these images have dimension n i , they must be equal, and so Z T i Z i represents the operator which projects orthogonally onto the image of X T i X i . Now suppose that X and d are consistent, where Then, after the network has learned A 1 and A 2 , the weight vector w 0 satisfies X 1 w 0~d1 and X 2 w 0~d2 ð16Þ (If, as below, n 1 +n 2 #n, X 2 and d 2 are consistent, and (X 1 ,d 1 ) has a continuous distribution then Equation 16 holds with probability 1.)

Falling
We now assume that forgetting is induced by weight values ''falling'' towards the origin at zero, i.e., forgetting consists of shrinking the weight vector w 0 by a (possibly random) factor r towards the ''dead state'' 0. Thus the post-forgetting weight vector w 1 is given by and so the ''forgetting vector' The form of forgetting given by Equation 17 is very different from that investigated in [12], where v has an isotropic distribution and is independent of (X 1 ,d 1 ) and (X 2 ,d 2 ). . The network learns two associations A 1 and A 2 . For example, A 1 is the mapping from input vector x 1 = (x 11 ,x 12 ) to desired output value d 1 , and learning A 1 consists of adjusting w until the network output y 1 = w?x 1 equals d 1 . (B) For a given association A 2 = (X 2 ,d 2 ), the corresponding constraint line in the space defined by (v a ,v b ) is L 2 . Irrespective of the precise value of the target output value d 1 in association A 1 , if d 1 is distributed isotropically then +d 1 is as probable as 2d 1 . When averaged over +d 1 and 2d 1 , the change d in error on A 1 induced by relearning A 2 can be shown to be 2(12r) 2 e 2 , where w 1 6 = rw 0 6 . Since this is less than zero, the expected change E[d|r],0. ( Figure 3A redrawn from [12] Let w 2 be the orthogonal projection of w 1 onto L 2 . Then Manipulation gives and so Then Equations 14, 16, and 18-20 yield The Case of Isotropic Random (X 1 ,d 1 ) In this section we assume that the distribution of (X 1 ,d 1 ) is isotropic, i.e., that (UX 1 V,Ud 1 ) has the same distribution as (X 1 ,d 1 ) for all orthogonal n 1 6n 1 matrices U and all orthogonal n6n matrices V. Then taking the conditional expectation of Equation 21 for given X 2 , d 2 , and r gives the following theorem.
where x is any column of X T 1 .

Corollary 1
If 1.-3. of Theorem 1 hold then with equality if and only if either r = 1 or d 2 = 0. Corollary 1 says that (apart from trivial exceptions) the expected amount of FLL is negative.
To obtain Theorem 2, it is useful to have some moments of isotropic distributions. Let x be isotropically distributed on R n . Then Equations 9.6.1 and 9.6.2 of Mardia and Jupp (2000), together with some algebraic manipulation, yield as in Equations A.14 and A.15 of [12]. The other tool used in proving Theorem 2 is the formula for any random variables X,Y,Z for which these quantities exist. Equation 26 is an application to the conditional distribution of Y|Z of the standard conditional variance formula that is given in Equation 2b.3.6 on page 97 of [17].
Taking the expectation and variance of Equation 21 as only d 1 varies and using Equation 24 gives var d w 1 ,w 2 ; X 1 ,d 1 ð Þ j X 1 ,X 2 ,d 2 ,r ð Þ