## Abstract

Long after a new language has been learned and forgotten, relearning a few words seems to trigger the recall of other words. This “free-lunch learning” (FLL) effect has been demonstrated both in humans and in neural network models. Specifically, previous work proved that linear networks that learn a set of associations, then partially forget them all, and finally relearn some of the associations, show improved performance on the remaining (i.e., nonrelearned) associations. Here, we prove that relearning forgotten associations *decreases* performance on nonrelearned associations; an effect we call negative free-lunch learning. The difference between free-lunch learning and the negative free-lunch learning presented here is due to the particular method used to induce forgetting. Specifically, if forgetting is induced by isotropic *drifting* of weight vectors (i.e., by adding isotropic noise), then free-lunch learning is observed. However, as proved here, if forgetting is induced by weight values that simply decay or *fall* towards zero, then negative free-lunch learning is observed. From a biological perspective, and assuming that nervous systems are analogous to the networks used here, this suggests that evolution may have selected physiological mechanisms that involve forgetting using a form of synaptic drift rather than synaptic decay, because synaptic drift, but not synaptic decay, yields free-lunch learning.

## Author Summary

If you learn a skill, then partially forget it, does relearning part of that skill induce recovery of other parts of the skill? More generally, if you learn a set of associations, then partially forget them, does relearning a subset induce recovery of the remaining associations? In previous work, in which participants learned the layout of a scrambled computer keyboard, the answer to this question appeared to be “yes.” More recently, we modeled this “free-lunch learning” effect using artificial neural networks, in which the synaptic strength between each pair of model neurons is a connection weight. We proved that if forgetting is induced by allowing each weight value to drift randomly, then free-lunch learning is almost inevitable. However, if, after learning a set of associations, forgetting is induced by allowing each connection weight to *decay* or *fall* toward zero, then relearning a subset of associations decreases performance on the remaining associations. This suggests that evolution may have selected physiological mechanisms that involve forgetting using a form of synaptic drift rather than synaptic decay, because synaptic drift yields free-lunch learning, whereas decay does not.

**Citation: **Stone JV, Jupp PE (2008) Falling towards Forgetfulness: Synaptic Decay Prevents Spontaneous Recovery of Memory. PLoS Comput Biol 4(8):
e1000143.
doi:10.1371/journal.pcbi.1000143

**Editor: **Karl J. Friston, University College London, United Kingdom

**Received:** December 7, 2007; **Accepted:** June 25, 2008; **Published:** August 22, 2008

**Copyright:** © 2008 Stone, Jupp. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **No funding was received for this work.

**Competing interests:** The authors have declared that no competing interests exist.

### Introduction

The idea that structural changes underpin the formation of new memories can be traced to the 19th century [1]. More recently, Hebb proposed that “When an axon of cell A is near enough to excite B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased” [2]. It is now widely accepted that learning involves some form of Hebbian adaptation, and a growing body of evidence suggests that Hebbian adaptation is associated with the long-term potentiation (LTP) observed in neuronal systems [3]. LTP is an increase in synaptic efficacy which occurs in the presence of pre-synaptic and post-synaptic activity, and can be specific to a single synapse. One consequence of Hebbian adaptation is that information regarding a specific association is distributed amongst many synaptic connections, and therefore gives rise to a distributed representation of each association.

In [4], participants learned the layout of letters on a “scrambled” keyboard. After a period of forgetting, participants relearned a subset of letter positions. Crucially, this improved performance on the remaining (i.e., nonrelearned) letter positions. However, whereas relearning some associations shows evidence of FLL in some studies [4]–[6], this is not found in not all studies [7]. This discrepancy may be because the many studies performed to investigate this general phenomenon use a wide variety of different materials and procedures, with some measuring recall and others measuring recognition performance, for example. However, within the realms of psychology, one relevant effect is known as part-set cueing inhibition.

Part-set cueing inhibition [8] occurs when a subject is exposed to part of a set of previously learned items, which is found to reduce recall of nonrelearned items. However, [9] showed that a learned row of words was better recalled if the cues consisted of a subset of words placed in their learned positions than if cue words were placed in other positions. In this case, part-set cueing seems to improve performance, but only if each “part” appears in the spatial position in which it was originally learned. This position-specificity is consistent with the FLL effect reported using the “scrambled keyboard” procedure in [4] but has no obvious concomitant in network models (e.g., [4],[10],[11]).

If the brain stores information as distributed representations, then each neuron contributes to the storage of many associations. Therefore, relearning some old and partially forgotten associations should affect the integrity of other associations learned at about the same time. As noted above, previous work has shown that relearning some forgotten associations does not disrupt other associations, but partially restores them. This FLL effect has also been demonstrated in neural network models ([10],[12]), where it can accelerate evolution of adaptive behaviors [13]. Crucially, in [12], the proof that relearning some associations partially restores other associations assumes that forgetting is caused by the addition of isotropic noise to connection weights, which could result from the cumulative effect of small random changes in connection weights. In contrast, here we prove that if forgetting is induced by shrinking weights towards zero, so that weights “fall” towards the origin, then relearning some associations disrupts other associations.

The protocol used to examine FLL here is the same as that used in [4] and [12] and is as follows (see Figure 1). First, learn a set of *n*_{1}+*n*_{2} associations *A* = *A*_{1}∪*A*_{2} consisting of two subsets *A*_{1} and *A*_{2} of *n*_{1} and *n*_{2} associations, respectively. After all learned associations *A* have been partially forgotten, measure performance error on subset *A*_{1}. Finally, relearn *only* subset *A*_{2} and then remeasure performance on subset *A*_{1}. FLL occurs if relearning subset *A*_{2} improves performance on *A*_{1}.

**Figure 1. Free-lunch learning protocol.**

Two subsets of associations *A*_{1} and *A*_{2} are learned. After partial forgetting (see text), performance error *E*_{pre} on subset *A*_{1} is measured. Subset *A*_{2} is then relearned to pre-forgetting levels of performance, and performance error *E*_{post} on subset *A*_{1} is re-measured. If *E*_{post}<*E*_{pre} then FLL has occurred, and the amount of FLL is *δ* = *E*_{pre}−*E*_{post}. Redrawn from [12].

In order to preclude a common misunderstanding, we emphasize that, for a network with *n* connection weights, it is assumed that *n*≥*n*_{1}+*n*_{2} ; that is, the number of connection weights on each output unit is not less than the number *n*_{1}+*n*_{2} of learned associations. Using the class of linear network models described below, up to *n* associations can be learned perfectly (see [12]).

The proofs below refer to a network with one output unit. However, these proofs apply to networks with multiple output units, because the *n* connections to each output unit can be considered as a distinct network, in which case our results can be applied to the network associated with each output unit.

#### Definition of Performance Error

Each association consists of an input vector **x** and a corresponding target value *d*. For a network with weight vector **w**, the response to an input vector **x** is *y* = **w·x**. We define the *performance error* for input vectors **x**_{1},…,**x*** _{k}* and desired outputs

*d*

_{1},…,

*d*to be(1)

_{k}where

*y*=

_{i}**w**·

**x**

*is the output response to the input vector*

_{i}**x**

*. By putting*

_{i}**X**= (

**x**

_{1},…,

**x**

*)*

_{k}*,*

^{T}**d**= (

*d*

_{1},…,

*d*)

_{k}*and*

^{T}we can write Equation 1 succinctly as(2)

The two subsets *A*_{1} and *A*_{2} consist of *n*_{1} and *n*_{2} associations, respectively. Let **w**_{0} be the network weight vector after *A*_{1} and *A*_{2} are learned. When *A*_{1} and *A*_{2} are forgotten, the network weight vector changes to **w**_{1}, say, and the performance error on *A*_{1} becomes *E*_{pre} = *E*(**X**;**w**_{1},**d**). Finally, relearning *A*_{2} yields a new weight vector, **w**_{2}, say, and the performance error on *A*_{1} is *E*_{post} = *E*(**X**;**w**_{2},**d**). Free-lunch learning has occurred if performance error on *A*_{1} is less after relearning *A*_{2} than it was before relearning *A*_{2} (i.e., if *E*_{post}<*E*_{pre}).

Given weight vectors **w**_{1} and **w**_{2}, a matrix **X** of input vectors, and a vector **d** of desired outputs, define(3)

which we shall also refer to simply as *δ*.

In previous work [12], we assumed that the “forgetting vector” **v** (defined as **v** = **w**_{1}−**w**_{0}) has an isotropic distribution. Here we shall assume instead that the post-forgetting weight vector **w**_{1} is given by(4)

for some (possibly random) scalar *r*, so that(5)

and therefore(6)

The interpretation of Equation 6 is that forgetting consists of making the optimal weight vector **w**_{0} “fall” towards the origin by a *falling factor* 1−*r*.

### Results

We provide theoretical results, and compare these with results obtained using computer simulations. In essence, our theoretical and simulation results indicate that falling weights induce negative FLL, which decreases with the square of the falling factor 1−*r*.

#### Theoretical Results

Our two main theorems are summarised here, and proofs are provided in the Methods section. These theorems apply to a network with *n* weights which learns *n*_{1}+*n*_{2} associations *A* = *A*_{1}∪*A*_{2}, and then after partial forgetting, relearns the *n*_{2} associations in *A*_{2}.

We prove that if *n*_{1}+*n*_{2}≤*n* (so that, in general, the associations *A*_{1} and *A*_{2} are consistent) and the joint distribution of (**X**_{1},**d**_{1}) is isotropic (where **X**_{1} and **d**_{1} are the matrix of inputs and the vector of desired outputs for subset *A*_{1} of associations) then the expected value of *δ* is negative (recall that *δ* is defined in Equation 3). We then prove that the probability *P*(*δ*<0) that *δ* is negative approaches unity as *n*_{1} approaches ∞.

#### Theorem 1

For every non-zero value of *r*, the expected value of *δ* given *r* is negative. More precisely,(7)

with equality only in trivial cases, and where the constant of proportionality is guaranteed to be positive. Thus, the expected amount of FLL is negative (or zero).

From a physiological perspective, the case *r*<1 is obviously of interest because it represents synaptic weight decay. However, from a mathematical perspective, Theorem 1 applies to every value of *r*, and so it also holds for *r*>1. In other words, any movement of the weight vector **w** along the the line connecting **w**_{0} to the origin yields an expectation of negative FLL, in accordance with Theorem 1.

#### Theorem 2

Under mild conditions on the distributions of the input/output pairs (**X**_{1},**d**_{1}) and (**X**_{2},**d**_{2}),(8)

where **x** and are any columns of and , respectively, and

Theorem 2 implies that, if (i) the number (*n*_{1}) of associations in *A*_{1} is a fixed non-zero proportion ( *n*_{1}/*n* ) of the number *n* of connection weights, (ii) **E**[∥**d**_{1}∥^{2}]**E**[∥**d**_{2}∥^{−2}] is bounded as *n* → ∞, and (iii) *γ*(*n*) → 0 as *n* → ∞ then *P*(*δ*>0) → 0 as *n* → ∞, i.e., the amount of FLL is negative, with a probability which tends to 1 as *n* → ∞.

For example, if we assume that (i) each input vector **x** = (*x*_{1},…,*x _{n}*) is chosen from an isotropic Gaussian distribution and (ii) the variance of

*x*is then

_{i}*γ*(

*n*) = 2/

*n*, , and

**E**[∥

**d**

_{1}∥

^{2}]

**E**[∥

**d**

^{2}∥

^{−2}] =

*n*

_{1}/(

*n*

_{2}−1). This ensures that

*P*(

*δ*>0) → 0 as

*n*→ ∞.

#### Simulation Results

Simulation was carried out on a network with *n* input units and one output unit. The set *A* of associations consisted of *k* input vectors (**x**_{1},…,**x*** _{k}*) and

*k*corresponding desired scalar output values (

*d*

_{1},…,

*d*). Each input vector comprised

_{k}*n*elements

**x**= (

*x*

_{1},…,

*x*). The values of

_{n}*x*and

_{i}*d*were chosen from a Gaussian distribution with unit variance (i.e., ). A network's output

_{i}*y*is a weighted sum of input values , where

_{i}*x*is the

_{ij}*j*th component of the

*i*th input vector

**x**

*, and each weight*

_{i}*w*is the connection between the

_{j}*j*th input unit and the output unit.

Given that the network error for a given set of *k* associations is , the derivative of *E* with respect to **w** yields the delta learning rule , where *η* is the learning rate, which is adjusted according to the number of weights.

However, in order to save time, we used an equivalent learning method. Learning of the *k* = *n* associations in *A* = *A*_{1}∪*A*_{2} was performed by solving a set of *n* simultaneous equations using a standard method, after which the weight vector **w**_{0} was obtained; this provided perfect performance on all *n* associations. Partial forgetting was induced by making weights “fall” towards the origin **w**_{1} = *r***w**_{0}, after which performance error was *E*_{pre}. Relearning the *n*_{2} = *n*/2 associations in *A*_{2} was implemented with *k* = *n*_{2} as above, after which performance error was *E*_{post}.

In each simulation, each value in each input vector **x*** _{i}*, and each target value

*d*was chosen from the same isotropic gaussian distribution with unit variance. There were 100 input units, and one output unit. The subsets

_{i}*A*

_{1}and

*A*

_{2}each consisted of 50 associations. The value of

*δ*=

*E*

_{pre}−

*E*

_{post}was obtained in each of 100 simulations, using a different random seed for each simulation. In Figure 2, the mean of 100 values of

*δ*is shown for various values of the falling factor 1−

*r*.

**Figure 2. Free-lunch learning decreases as the network's weight vector falls toward the origin.**

A network with 100 input units and one output unit learns two subsets *A*_{1} and *B*_{2}, each of which consists of 50 associations. After learning *A*_{1} and *A*_{2}, the network has a weight vector w = w_{0}, but after partial forgetting, the weight vector is w = w_{1}. If forgetting consists of subtracting a proportion 1−*r* of w_{0} such that w_{1} = w_{0}−(1−*r*)w_{0} then the weight vector “falls” towards the origin; the factor 1−*r* is called the *falling factor*. After forgetting, performance error on *A*_{1} is *E*_{pre}, an error which changes to *E*_{post} after relearning *A*_{2}, where this change is *δ* = *E*_{pre}−*E*_{post}. Given that there are *A*_{1} associations in *A*_{1}, the expected free-lunch learning per association in *A*_{1} is therefore E[*δ*/*n*_{1}|*r*]. *Solid curve*: the expected FLL, *E*[*δ*/*n*_{1}|*r*], where this expectation is taken over 100 computer simulations. *Dashed curve*: theoretical prediction of *E*[*δ*/*n*_{1}|*r*] (see Equation 7), using a constant of proportionality equal to unity, so that the predicted free-lunch learning is *E*_{predict}[*δ*/*n*_{1}|*r*] = −(1−*r*)^{2}. As predicted, free-lunch learning *E*[*δ*/*n*_{1}|*r*] becomes more negative as the falling factor 1−*r* increases.

#### The Geometry of Forgetting

We present a brief account of the geometry which underpins the results reported here, for a network with two input units and one output unit, as shown in Figure 3A. This network learns two associations *A*_{1} = (*X*_{1},*d*_{1}) and *A*_{2} = (*X*_{2},*d*_{2}).

**Figure 3. Geometric example of how relearning A_{2} increases the error on A_{1}.**

(A) A network with two input units and one output unit, with connection weights *ω _{a}* and

*ω*defines a weight vector w = (

_{b}*ω*,

_{a}*ω*). The network learns two associations

_{b}*A*

_{1}and

*A*

_{2}. For example,

*A*

_{1}is the mapping from input vector x

_{1}= (

*x*

_{11},

*x*

_{12}) to desired output value

*d*

_{1}, and learning

*A*

_{1}consists of adjusting w until the network output

*y*

_{1}= w·x

_{1}equals

*d*

_{1}. (B) For a given association

*A*

_{2}= (

*X*

_{2},

*d*

_{2}), the corresponding constraint line in the space defined by (

*ω*,

_{a}*ω*) is

_{b}*L*

_{2}. Irrespective of the precise value of the target output value

*d*

_{1}in association

*A*

_{1}, if

*d*

_{1}is distributed isotropically then +

*d*

_{1}is as probable as −

*d*

_{1}. When averaged over +

*d*

_{1}and −

*d*

_{1}, the change

*δ*in error on

*A*

_{1}induced by relearning

*A*

_{2}can be shown to be −(1−

*r*)

^{2}

*e*

^{2}, where w

_{1}

^{±}=

*r*w

_{0}

^{±}. Since this is less than zero, the expected change

*E*[

*δ*|

*r*]<0. (Figure 3A redrawn from [12]).

Figure 3B provides a geometric example of how relearning *A*_{2} increases the error on *A*_{1}. After learning *A*_{1} and *A*_{2}, **w** = **w**_{0}. The effects of forgetting and relearning can be seen by ignoring the ± superscripts and subscripts for now. After partial forgetting, **w** = **w**_{1}, and performance error *E*_{pre} = *p*^{2}. Relearning *A*_{2} yields **w**_{2}, the orthogonal projection of **w**_{1} on to *L*_{2}, and performance error is *E*_{post} = *q*^{2}. FLL occurs if *δ* = *E*_{pre}−*E*_{post}>0, or equivalently if *p*^{2}−*q*^{2}>0 (see [12], Appendices A–C for proofs). Forgetting here consists of reducing **w**_{0} by a factor *r*<1, so that **w**_{1} = *r***w**_{0}.

The plus and minus signs in Figure 3B refer to two versions and of association *A*_{1}, in which *X*_{1} is the same and the target *d*_{1} has the same magnitude, but opposite signs: and .

We now find the expected change in error induced by relearning a given association *A*_{2}. After learning followed by forgetting, the change in error on after relearning *A*_{2} is . After learning followed by forgetting, the change in error on after relearning *A*_{2} is . Using similar triangles in Figure 3B,(9)

(10)

Therefore, the total change in error on and induced by relearning *A*_{2} (on different occasions) is(11)

(12)

(13)

Irrespective of the precise value of the target output value *d*_{1} in *A*_{1}, if the distribution of *d*_{1} is isotropic then +*d*_{1} is as probable as −*d*_{1}. If the total change in error for two instances ( and ) of *A*_{1} is −2(1−*r*)^{2}*e*^{2} then the expected change (conditional on *e* ) is *E*[*δ*|*e*] = −(1−*r*)^{2}*e*^{2}. Therefore, if forgetting is induced by falling weight values, then the expected change in error **E**[*δ*]<0.

### Discussion

We have proved and demonstrated that, in one of the simplest forms of neural network model, relearning part of a previously learned set of associations reduces performance on the remaining non-relearned associations. This result is in stark contrast to our previous results, which proved that relearning induced partial recovery of non-relearned items [12]. The only difference between these two studies is the way in which forgetting was induced.

An obvious physiological concomitant of Hebbian learning is long-term potentiation (LTP), which seems to underpin learned behaviors [14]. LTP can last for hours, days or even months, and usually follows an exponential decay [3]. However, some forms of LTP do not seem to decay [15], and have been shown to be stable for up to one year [16]. Such stability is remarkable, but from a statistical point of view, would almost certainly be accompanied by random fluctuations which would have a cumulative effect over time; and indeed, fluctuations are apparent in the stable LTP reported in [16]. Crucially, it is not known if the forgetting of learned behaviors is caused by decaying efficacy at many synapses, or by the cumulative effect of random fluctuations in stable LTP-induced synaptic efficacies. Here, decaying efficacy is analogous to weight values that fall toward zero in network models, whereas the cumulative effect of random fluctuations is analogous to the addition of random noise, or drifting, of weight values in network models.

Given a choice between forgetting via synaptic weights that fall towards zero and weights that drift isotropically, has evolution chosen drifting or falling? If all other things were equal then forgetting via synaptic drift would seem to be the obvious choice. This is because drifting ensures that relearning a subset of associations improves performance on other associations, whereas falling decreases performance. However, other things are rarely equal. The expected magnitude of weights increases with drifting but decreases with falling. (Consider a hypersphere centered on the origin, with radius ∥**w**_{0}∥ . Simple geometry shows that more than half of all directions emanating from **w**_{0} yield a new weight vector **w**_{1} which lies outside the hypersphere, and therefore **E**[∥**w**_{1}∥]>**E**[∥**w**_{0}∥] (assuming, for example, that all vectors **w**_{1}−**w**_{0} have the same length).) This decrease in weight magnitudes effectively reduces neuronal firing rates, which reduces metabolic costs relative to costs incurred by synaptic drift. Synaptic drift therefore confers mnemonic benefits, but these benefits come at a metabolic price. Thus the increased fitness gained from the mnemonic benefits of synaptic drift must be offset against their metabolic costs. In essence, even free-lunch learning comes at a price.

### Methods

We proceed by deriving expressions for *E*_{pre}, *E*_{post}, and for *δ* = *E*pre−*E*_{post}. We prove that if *n*_{1}+*n*_{2}≤*n* then the expected value of *δ* is negative. We then prove that the probability *P*(*δ*<0) that *δ* is negative approaches unity as *n*_{1} approaches ∞.

#### Performance Errors

Given a *c*×*n* matrix **X** and a *c* -dimensional vector **d**, let *L*_{X}_{,d} be the affine subspace

of . If **X** and **d** are consistent (i.e., there is a **w** such that **Xw** = **d**) then

Given weight vectors **w**_{1} and **w**_{2}, a matrix **X** of input vectors, and a vector **d** of desired outputs, define

where *E*_{pre} = *E*(**X**;**w**_{1},**d**) and *E*_{post} = *E*(**X**;**w**_{2},**d**). Let be any element of *L*_{X}_{,d}. Then (14)

If **X*** _{i}* has rank

*n*then transposing the QR decomposition of (or, equivalently, using Gram–Schmidt orthonormalisation of the rows of

_{i}**X**

*) gives*

_{i}for unique

*n*×

_{i}*n*and

_{i}*n*×

_{i}*n*matrices

**T**

*and*

_{i}**Z**

*with*

_{i}**T**

*lower triangular with positive diagonal elements, and . Simple calculation shows that, for any weight vector*

_{i}**w**, and are orthogonal. Since , it follows that the matrix represents the operator that projects orthogonally onto the image of . Because(15)

the image of is contained in that of . As both these images have dimension

*n*, they must be equal, and so represents the operator which projects orthogonally onto the image of .

_{i}Now suppose that **X** and **d** are consistent, where

Then, after the network has learned *A*_{1} and *A*_{2}, the weight vector **w**_{0} satisfies(16)

(If, as below, *n*_{1}+*n*_{2}≤*n*, **X**_{2} and **d**_{2} are consistent, and (**X**_{1},**d**_{1}) has a continuous distribution then Equation 16 holds with probability 1.)

#### Falling

We now assume that forgetting is induced by weight values “falling” towards the origin at zero, i.e., forgetting consists of shrinking the weight vector **w**_{0} by a (possibly random) factor *r* towards the “dead state” **0**. Thus the post-forgetting weight vector **w**_{1} is given by(17)

and so the “forgetting vector” **v** = **w**_{1}−**w**_{0} is(18)

The form of forgetting given by Equation 17 is very different from that investigated in [12], where **v** has an isotropic distribution and is independent of (**X**_{1},**d**_{1}) and (**X**_{2},**d**_{2}).

Let **w**_{2} be the orthogonal projection of **w**_{1} onto *L*_{2}. Then

Manipulation gives(19)

and so(20)

Then Equations 14, 16, and 18–20 yield(21)

#### The Case of Isotropic Random (**X**_{1},**d**_{1})

In this section we assume that the distribution of (**X**_{1},**d**_{1}) is isotropic, i.e., that (**UX**_{1}**V**,**Ud**_{1}) has the same distribution as (**X**_{1},**d**_{1}) for all orthogonal *n*_{1}×*n*_{1} matrices **U** and all orthogonal *n*×*n* matrices **V**. Then taking the conditional expectation of Equation 21 for given **X**_{2}, **d**_{2}, and *r* gives the following theorem.

#### Theorem 1

If

*n*_{1}+*n*_{2}≤*n*,**X**_{2}and**d**_{2}are consistent,- the distribution of (
**X**_{1},**d**_{1}) is continuous and isotropic, **X**_{1},**d**_{1}, and (**X**_{2},**d**_{2},*r*) are independent.

then(22)

where **x** is any column of .

#### Corollary 1

If 1.-3. of Theorem 1 hold then(23)

with equality if and only if either *r* = 1 or **d**_{2} = 0.

Corollary 1 says that (apart from trivial exceptions) the expected amount of FLL is negative.

To obtain Theorem 2, it is useful to have some moments of isotropic distributions. Let **x** be isotropically distributed on . Then Equations 9.6.1 and 9.6.2 of Mardia and Jupp (2000), together with some algebraic manipulation, yield(24)

(25)

as in Equations A.14 and A.15 of [12].

The other tool used in proving Theorem 2 is the formula(26)

for any random variables *X*,*Y*,*Z* for which these quantities exist. Equation 26 is an application to the conditional distribution of *Y*|*Z* of the standard conditional variance formula that is given in Equation 2b.3.6 on page 97 of [17].

Taking the expectation and variance of Equation 21 as only **d**_{1} varies and using Equation 24 gives(27)

(28)

Taking the expectation of Equation 28 as only **X**_{1} varies and using Equation 24 gives(29)

Then taking the variance of Equation 27 as only **X**_{1} varies and using Equation 25 gives(31)

Adding Equations 29 and 30 and using Equation 26 yields(32)

To obtain an upper bound on the conditional probability of FLL (i.e., on *P*(*δ*≥0|**X**_{2},**d**_{2},*r*)), we use Chebyshev's inequality, which states that, for any random variable *Y* and any positive value of *t*

Applying Chebyshev's inequality to the conditional distribution of δ(**w**_{1},**w**_{2},**X**_{1},**d**_{1}) given (**X**_{2},**d**_{2},*r*), taking *t* = **E**[*δ*(**w**_{1},**w**_{2};**X**_{1},**d**_{1})|**X**_{2},**d**_{2},*r*], and noting that (by Equation 23) *t*≤0, we obtain(33)

Substituting Equations 22 and 32 into Equation 33 gives(34)

where

For any positive-definite symmetric matrix **A** and vector **x**, diagonalization of **A**, together with the fact that *x*+1/*x*≥2 for positive *x*, yields(35)

Combining Equations 34 and 35 with the fact that gives(36)

Taking the expectation of Equation 36 over **X**_{2} yields(37)

where **x** and are any columns of and , respectively.

Taking the expectation of Equation 37 over **d**_{2} and *r* yields the following theorem.

#### Theorem 2

If (a) conditions 1.-4. of Theorem 1 hold, (b) the columns of are distributed independently, (c) **X**_{2}, **d**_{2}, and *r* are independent, (d) the distribution of (**X**_{2},**d**_{2}) is isotropic, and (e) **E**[∥**d**_{2}∥^{−2}] is finite then(38)

where **x** and are any columns of and , respectively, and

#### Corollary 2

If the conditions of Theorem 2 hold and

where **x** and are any columns of and , respectively, then

Thus

provided that *n*_{1}/*n* and *n*_{2}/*n* are bounded away from zero.

### Acknowledgments

Thanks to David Sterratt for asking, “What would happen to free-lunch learning if weights decayed?” and to three anonymous reviewers for their detailed comments.

### Author Contributions

Conceived and designed the experiments: JS. Performed the experiments: JS. Analyzed the data: JS. Contributed reagents/materials/analysis tools: JS. Wrote the paper: JS PEJ. Mathematical proofs: PEJ.

### References

- 1. Tanzi E (1893) I fatti e le induzioni nellodierna isologia del sistema nervosa. Riv Sper Freniatr Med Leg 19: 419–472.
- 2.
Hebb D (1949) The Organization of Behavior: A Neuropsychological Theory. New York: Wiley.
- 3. Abraham W (2003) How long will long-term potentiation last? Philos Trans R Soc Lond B Biol Sci 358: 735–744.
- 4. Stone J, Hunkin N, Hornby A (2001) Predicting spontaneous recovery of memory. Nature 414: 167–168.
- 5.
Coltheart M, Byng S (1989) A treatment for surface dyslexia. In: Seron X, editor. Cognitive Approaches in Neuropsychological Rehabilitation. London: Lawrence Erlbaum Associates.
- 6. Weekes B, Coltheart M (1996) Surface dyslexia and surface dysgraphia: treatment studies and their theoretical implications. Cogn Neuropsychol 13: 277–315.
- 7. Atkins P (2001) What happens when we relearn part of what we previously knew? Predictions and constraints for models of long-term memory. Psychol Res 65: 202–215.
- 8. Roediger H III (1973) Inhibition in recall from cueing with recall targets. J Verbal Learn Verbal Behav 12: 644–657.
- 9. Serra M, Nairne J (2000) Part-set cuing of order information: implications for associative theories. Mem Cognit 28: 847–855.
- 10.
Hinton G, Plaut D (1987) Using fast weights to deblur old memories. Proceedings Ninth Annual Conference of the Cognitive Science Society. pp. 177–186.
- 11. Atkins P, Murre J (1998) Recovery of unrehearsed items in connectionist models. Connect Sci 10: 99–119.
- 12. Stone J, Jupp P (2007) Free-lunch learning: modelling spontaneous recovery of memory. Neural Comput 19: 194–217.
- 13. Stone J (2007) Distributed representations accelerate evolution of adaptive behaviours. PLoS Comput Biol 3: e147. doi:10.1371/journal.pcbi.0030147.
- 14. Whitlock J, Heynen A, Shuler M, Bear M (2006) Learning induces long-term potentiation in the hippocampus. Science 313: 1093–1097.
- 15. Staubli U, Lynch G (1987) Stable hippocampal long-term potentiation elicited by theta pattern stimulation. Brain Res 435: 227–234.
- 16. Abraham WC, Logan B, Greenwood JM, Dragunow M (2002) Induction and experience-dependent consolidation of stable long-term potentiation lasting months in the hippocampus. J Neurosci 22: 9626–9634.
- 17.
Rao C (1973) Linear Statistical Inference and its Applications. 2nd edition. New York: Wiley.