Efficient Partitioning of Memory Systems and Its Importance for Memory Consolidation

Long-term memories are likely stored in the synaptic weights of neuronal networks in the brain. The storage capacity of such networks depends on the degree of plasticity of their synapses. Highly plastic synapses allow for strong memories, but these are quickly overwritten. On the other hand, less labile synapses result in long-lasting but weak memories. Here we show that the trade-off between memory strength and memory lifetime can be overcome by partitioning the memory system into multiple regions characterized by different levels of synaptic plasticity and transferring memory information from the more to less plastic region. The improvement in memory lifetime is proportional to the number of memory regions, and the initial memory strength can be orders of magnitude larger than in a non-partitioned memory system. This model provides a fundamental computational reason for memory consolidation processes at the systems level.


Brief Comment on Models
In this supplementary information we describe the models used in the paper in detail. The details of all relevant analysis are also included. We first discuss the homogeneous and heterogeneous models. Then there are two memory transfer models: 1 -the "synaptic model" in which the neuronal activity is not explicitly modeled, 2 -the "neuronal model" which does include neuronal activity. Both models are Markov models, i.e. they are stochastically updated in discrete time where the state of variables at a time t + 1 depends only on the state of the variables at time t. We consider the memory capacity of these models by adopting an ideal observer approach. That is, we track the mnemonic trace of one particular memory, encoded in the pattern of synaptic weights of the model, until it is unrecoverable. For the models considered here we can obtain an analytical formula for the memory trace which allows us to determine how important measures such as the initial memory strength and memory lifetime, scale with the system size, i.e. how many synapses are available. All analytical results are obtained by studying continuous-time approximations to these Markov processes. These approximations take the form of ordinary and partial differential equations. We have organized this supplementary information in order to accompany the figures in the paper. 2 A synaptic model with homogeneous transition probabilities. (Fig.1a and b)

The Markov model
We consider binary synapses, which can be in one of two states: depressed or potentiated. At each time step, each synapse is independently presented with a plasticity event (see top left of Fig.1a) which can be either potentiating or depressing with probability 1/2. A depressed (potentiated) synapse presented with a potentiating (depressing) event will change state with a probability q, which is called the transition probability. The binary nature of the synapses means that only one memory may be encoded at a time. This is illustrated in Fig.1a by a color code. In the model we consider the state of N synapses simultaneously, and the 'memories' are therefore the combined N plasticity events to which the synapses are subjected, see bottom left of Fig.1a. These memories are random and uncorrelated from one time step to the next. As time progresses, we keep track of how similar the state of the N synapses is to one particular memory. Since all memories are identically distributed, tracking one is equivalent to tracking any other one and will tell us how all memories decay in time.
We can formalize this model description by assigning the value 1 to a potentiated synapse and −1 to a depressed one. Similarly, a plasticity event is assigned a value 1 if it is potentiating and −1 if depressing. We then define a vector of length N , J t where J t i ∈ {−1, 1} is the state of synapse i at time t. Similarly, the memories are also vectors of length N , m t , where m t i ∈ {−1, 1} is the plasticity event to which synapse i is subjected at time t. If we choose to track the memory presented at time t * , then we define the memory trace as the signal at time t, which is just the dot product of two vectors, S t−t * = m t * · J t . The signal itself is a stochastic variable, since the updating of the synaptic states is stochastic. This means that if one runs several simulations presenting exactly the same memories, the signal will be different each time, see right hand side of Fig.1a. The mean signal, understood as the signal averaged over many realizations of the Markov process, can be computed analytically. We compare this signal to the background noise which is defined as the standard deviation in the overlap between uncorrelated memories.
The probability of finding a synapse in a potentiated state at time t + 1 is p t+1 where p t + = Pr(J t = 1). Using the fact that p t + + p t − = 1 allows one to write In tracking one particular memory, we wish to know p t + for those synapses which were subjected to a potentiation when that memory was presented. Therefore, the initial condition is p 0 + = (1 + q)/2. An identical equation and initial condition govern the dynamics of p − .
The number of potentiated synapses n t + follows a Binomial distribution with probability p t + . It has mean E(n + ) = N 2 p t + and variance var(n t + ) = N 4 p t + (1 − p t + ). The mean signal or memory trace can be written Note that the term -1 in the parentheses removes the trivial correlation (synapses are potentiated or depressed with probability 1/2 irrespective of the memory). The variance in the signal is var(S t ) = 4N p t + (1−p t + ). Writingp t = 2p t + − 1, we can then write the signal to noise ratio as SN R t = √ Np t √ 1−(p t ) 2 , where 0 ≤p ≤ 1. For simplicity, in this work we take SN R t = √ Np t , which is a lower bound for the true SNR and is the correct asymptotic expression for the SNR whenp t ≪ 1, e.g. at long times. This is so because at long times the probability of any synapse being potentiated is just one half, i.e. p t + → 1/2 as t → ∞, and p t + = 1/2 +p t . While for the simple case of homogeneous synapses one can calculate the signal analytically in the discrete case (S t = q(1 − q) t N ), solving for the signal in the continuous-time approximation is, in general, much easier. For the memory transfer model only the continuous-time approximation will yield analytical results. For this reason we consider now the continuous-time formulation of the Markov process. the solution of which isS(t) = qN e −qt , and the signal-to-noise ratio is therefore SN R(t) = qN 1/2 e −qt . In judging the performance of a model, we consider three salient characteristics of the SNR: 1 -the initial SNR, 2 -the functional form of the decay of the SNR, and 3 -the lifetime of the SNR.

The continuous-time approximation
In the case of the homogeneous population of synapses, decay is exponential while the initial SNR is qN 1/2 and the lifetime, determined by setting the SNR equal to one, is T = 1 q ln (qN 1/2 ). It is clear that, for fixed N , taking a q near one will lead to a large initial SNR but a short lifetime, while a small q leads to a weak initial SNR but a long lifetime, see Fig.1b. This is a fundamental trade-off in populations of bounded synapses with homogeneous transition probabilities [2]. Significantly, adding synapses in order to increase memory lifetimes is extremely inefficient since the lifetime scales as ln N . 3 A synaptic model with heterogeneous transition probabilities. (Fig.1c)

The Markov model
Each synapse is updated as in the homogeneous model with the difference that the transition probability is not the same for all synapses. Specifically, we consider n ensembles of N/n synapses such that the total number of synapses is N . Synapses in ensemble k ∈ [1, n] have a transition probability q k =qq (k−1)/(n−1) , so that synapses in ensemble 1 are the most plastic with a transition rateq and those in ensemble n are the least plastic with a transition rateqq.

The continuous-time approximation
Following the same derivation as in the case of a population of homogeneous synapses leads to n distinct yet independent differential equations for the signal in each ensembleṠ the solution of which isS k (t) = q k N n e −q k t and the total signal is thenS(t) = ∑ n k=1S k (t). The initial SNR is given by where the last approximate formula holds as long as ϵ = 1 n−1 ln q −1 ≪ 1. Thus, compared to the homogeneous model, the initial SNR is reduced by a factor proportional to the logarithm of the slowest transition rate 1/ ln q −1 . The lifetime of the memory is determined by the SNR of the slowest ensemble and is T = 1 qq ln (qq(N/n) 1/2 ), i.e. there is a weak reduction in the lifetime due to fact that more ensembles mean fewer synapses per ensemble, specifically the slowest ensemble. Interestingly, the functional form of the SNR is approximately powerlaw with exponent one, with an exponential cut-off, corresponding to the slowest ensemble. It is not hard to see how a sum of exponentials can generate a powerlaw dependence [1]. Consider the following sum of exponentials for which the exponent is itself exponentially distributed, as in the synaptic model SN . In this continuum approximation it is clear that after an initial transient for times t < 1/q, the decay is approximately powerlaw until a time t > 1/(qq), after which there is an exponential cutoff. In the intermediate regime 1/q < t < 1/(qq) the decay is approximately powerlaw with exponent one. In this regime, the lifetime scales as N 1/2 and not ln N . Therefore, memory lifetimes can be significantly extended by adding more synapses.

Optimal readout of the memory signal
We have so far considered the case in which all synapses are read out simultaneously in order to determine the SNR of the memory. In the case of heterogeneous synapses, however, we could envisage an optimal readout by considering only certain ensembles of synapses so as to maximize the SNR. Specifically, we could read out only those synapses from ensembles In the case of heterogeneous ensembles of synapses, the optimal readout provides a small increase in SNR, as shown in Fig.S.1A. Furthermore, this increase is essentially independent of the total number of ensembles in the model. For heterogeneous synapses it is optimal, at intermediate times, to read out a large fraction of the total number of ensembles as shown in

Summary
In summary, allowing for heterogeneity in the transition rates, which is tantamount to having some synapses be very plastic and others less so, leads to small reductions in the initial SNR and memory lifetime for fixed N compared to the homogeneous case. However, the heterogeneity provides a significant boost to the SNR for intermediate times. Specifically , in the powerlaw regime lifetimes scale as N 1/2 . The optimal readout for heterogeneous synapses leads only to a small improvement compared to reading out all ensembles, and requires reading out a significant fraction of ensembles in the powerlaw regime.

The synaptic memory transfer model (Figs.2 and 3)
For clarity we first list here the main model parameters, variables and functions. The SNR for heterogeneous ensembles of synapses both by reading out all ensembles (dashed lines) as well as using an optimal readout (solid lines). The total number of synapses is N = 10 9 which are divided into n = 10, 20, 50 and 100 ensembles (black, read, green, blue). Here q i =qq (i−1)/(n−1) withq = 0.8 and q = 0.001. B. The width of the optimal readout scaled by the total number of ensembles of synapses.

Symbol
Description N Total number of synapses n Number of stages q i =qq (i−1)/(n−1) Learning rate of stage i q 1 =q Fastest learning rate q n =qq Slowest learning rate Signal-to-noise ratio in stage i at time t in continuous-time model

The Markov model
In the memory transfer model synapses take on different transition probabilities, as in the heterogeneous model. However, unlike in the heterogeneous model, synapses from one ensemble in the consolidation model may affect the state of synapses in another ensemble. Specifically, N synapses are arranged into n stages of N/n synapses each, and the stages interact in a feedforward manner. Memories are encoded only in the state of synapses in stage 1. The states of synapses in stage 2 depend on the states of synapses in stage 1 and so on. Furthermore, the synapses in stage 1 are taken to be the most plastic and synapses in each stage thereafter are progressively less plastic. In this way we seek to capture a consolidation process by which memories are initially encoded with a strong SNR in the input stage and then are transferred into deeper stages where the SNR lifetimes become progressively longer. It was shown previously that such a scheme implemented via a cascade of metaplastic states at the level of a single synapse can greatly improve memory capacity over models of simple synapses [1]. Here we implement a similar idea in a spatially segregated model with feedforward structure. Specifically, at time t, a memory m t of length N/n consisting of a random pattern of potentiating (m t i = 1) and depressing (m t i = −1) events is presented to the N/n synapses in stage one, which have synaptic state J t 1 . Synapse i is subjected either to a potentiating (m t i = 1) or to a depressing (m t i = −1) event with probability 1/2, and is updated with a probability q 1 as in the previous models. Therefore, the updating for synapses in stage 1 is identical to that for ensemble 1 in the synaptic model with heterogeneous transition probabilities in Section 3. We then assume that a synapse i in stage 2 is influenced by the state of synapse i in stage 1 in the following way.
If synapse i in stage 1 is in a potentiated (depressed) state at time t (J t 1 = 1 or J t 1 = −1 respectively), then synapse i in stage 2 will potentiate (depress) at time t + 1 with probability q 2 . The update rule for synapses in stage 3 proceeds analogously, but depends now on the state of synapses in stage 2, and so on. See Fig. 2a for a schematic of this update rule.
We can formalize the update rule mathematically as before. The probability of finding that a synapse in stage 1 is in the potentiated state at time t + 1 is which follows from Eq.S.2.
In order to derive the update rule for stage k > 1, we must take into consideration the fact that the probabilities in stage k > 1 are dependent on those in stage k − 1. The probabilities may, in fact, be correlated. Therefore, the probability of finding that a synapse in stage 2 is in the potentiated state at time t + 1 is ) , ) , where p (k,k−1;a,b) is the joint probability distribution for a synapse in stage k to be in a state a ∈ {−, +} and the corresponding synapse in stage k − 1 to be in a state b ∈ {−, +}. Again, because the number of potentiated and depressed synapses follow Binomial distributions, we can write We approximate this as SN R t k = √ Np t k which is a lower bound on the true SNR and the correct asymptotic form of the SNR forp t k ≪ 1. As before, these equations can be solved analytically, but become increasingly unwieldy for downstream stages. It is easier to do analysis in the continuous-time approximation.

The continuous-time approximation
In the continuous-time approximation, discrete time differences become time derivatives, and one has the following set of equationṡ together with the initial condition where as before S i (t) = N n (2p i (t)−1) and q i+1 < q i . At this point, we consider how Eqs.S.14-S.21 reflect the phenomenon of consolidation. At time t = 0 a new 'memory' encoded in stage 1. This memory trace drives an increase in the memory trace in stage 2, which will then decay at a rate q 2 , and so on. Therefore, while in the heterogeneous model the memory trace in stages 1 and 2 both decrease monotonically in time, in the memory transfer model, the memory trace in stage 2 is non-monotonic. It first increases, and then decreases, as shown in Fig.2b. At this point we can ask if the memory transfer model outperforms the benchmark heterogeneous model, even in the simplest case of two stages (or ensembles in the heterogeneous case). First we note that S h 1 = S mt 1 = q 1 N n e −q 1 t is identical for both the heterogeneous (h) and the memory transfer (mt) models. On the other hand, S h 2 = q 2 N n e −q 2 t , while S mt These curves intersect at a time T = 1 q 1 −q 2 ln(q 1 /q 2 ), and for times greater than T , i.e. t = T + t ′ one finds that , which is always greater than 1. Therefore, after a time T the memory transfer model will always outperform the heterogeneous model. In fact, we will show that improvement is largest when there are many stages, and the difference between transition probabilities in adjacent stages is small. If this is the case, we can write q 2 = q 1 − ∆q where ∆q ≪ 1 and we find that T ∼ 1/q 1 to leading order. Since q 1 is close to one (very plastic synapses in stage one) the improvement in performance of the consolidation model over the heterogeneous model comes about already at very short times.

Testing the meanfield model against the full Markov model
In order to compare the the meanfield model with the full Markov model we conduct Monte-Carlo simulations of the stochastic Markov model given binary synapses with a learning rate in stage i of q i = 0.8 * (0.01) (i−1)/9 for a total of n = 10 stages. Each stage has 10 6 synapses for a total of 10 7 synapses. The total signal averaged over 10 runs is shown as black circles in Fig continuous-time meanfield model and the full Markov model at short times (and hence in early stages). This discrepancy is due to the assumption of evolution in continuous time, whereas the Markov model evolves in discrete time. This can be seen by comparing the Markov model to the discrete-time meanfield model. In particular, we have solved the discrete-time meanfield model analytically for the first two stages. The solution for stage 1 is simply whereas for stage 2, after a lengthy calculation (details not shown), one arrives at The signal in downstream stages can be calculated but the formulae become increasingly unwieldy.
The result is shown in Fig as solid lines as examples. We also plot the result of the meanfield model (solid green line). We define the SNR as the signal divided by N 1/2 which, as we show above, is a lower bound on the SNR and is a good approximation for stage k as long as p k ≪ 1. It is clear that the fluctuations in the signal indeed are as large as the signal itself on a single run once the SNR approaches 1.

The continuous-time continuous-space approximation
If the number of stages is large enough we can recast Eqs.S.14-S.21 in the form of a partial differential equation. In doing so, we are thinking of the index of the stages as a spatial variable and assuming that nearby stages have similar signals. Specifically, we write . . , where dx = 1/n (and so x ∈ [0, 1]). Then Eqs.S.14-S.21 become where δ(x) is the Dirac delta function and we have dropped all terms of order 1/n 3 and higher. Furthermore, we choose q(x) =qq x to be consistent with the Markov model. Eqs.S.23-S.24 represent an advection diffusion process in which the initial condition is a pulse of amplitudeqN/n at x = 0. The pulse travels to the right and slowly diffuses, eventually exiting the system at x = 1. Moreover, both the velocity of the pulse and the diffusion depend on the spatial location of the pulse, i.e. they are not homogeneous. Note that the pulse represents the correlation of the synaptic weights with a particular memory. Therefore, it is this correlation, i.e. the 'memory' which is propagating.

The advection equation
We can determine the spatial dependence of the pulse velocity by ignoring the diffusion term and solving which is a pure advection equation. In this equation the initial pulse propagates to the right without changing shape. The curve along which it travels in space and time c(x, t) can be found by the so-called method of characteristics. Since the signal pulse doesn't change along this curve we can write which, comparing to Eq.S.26 means that ∂c ∂t = 1 and ∂c ∂x = n q q −x . From this we find that ∂x ∂t =q n q x , which when integrated with the initial condition that For advection or wave phenomena where the velocity of propagation is constant, the characteristics have the form x − vt. Here it is clear that the propagation is not at a constant speed, but rather slows down exponentially. The memory lifetime is just the time at which the memory pulse exits the system at x = 1, i.e. T ∼ n qq ln q −1 . (S.29) From this we can see that the memory lifetime is boosted by a factor n/ ln q −1 compared to the heterogeneous case (in the heterogeneous case the power law regime ends for t ∼ 1/(qq), see Section 3.2). Conspicuously missing is the scaling with the number of synapses N . This is because we have not actually solved the true advection diffusion equation. For the pure advection equation, the amplitude of the pulse does not change in time as long as the initial SNR is above one, nothing is gained by adding more synapses. Once we include diffusion this will change.

The advection-diffusion equation
We can use the small parameter ln q −1 /n to find an approximate solution to the full equations, Eqs.S.23-S.24. Before doing this, we will perform a change of variables to eliminate the spatially variable velocity, i.e. we will stretch space so that the characteristics are just straight lines. We already know the correct change of variables from the preceding section. Specifically, we define the new spatial variable y = n q ln q −1 (q −x − 1) and rewrite the PDE in terms of y. This requires using the chain rule for differentiation. The first and second derivatives are and The PDE becomes Since ϵ is small, we can ignore the small correction to the (now constant) velocity, but we cannot ignore the term proportional to y in the diffusion. The reason is that y itself ranges between 0 and values of order 1/ϵ. Rather, we will define a new spatial variable Y = ϵy. Which allows us to write The separation of variables y and Y has a physical interpretation here. The memory pulse propagates through the system, and as it does so, it slowly spreads out. The shape of the pulse depends on this diffusion, i.e. the second spatial derivative, which itself depends on the location of the pulse. However, the diffusion coefficient changes over a length scale which is large compared to the pulse shape itself. Therefore, we can treat the system as if the diffusion coefficient were locally constant. The constant coefficient advection-diffusion equation with delta function initial condition has a classical solution which, given the parameters in Eqs.S.34-S.35, takes the form

Readout of the memory signal: the role of correlations
We have already discussed the size of correlations in any one memory stage. Specifically, if the total number of synapses is N in a system with n stages, then the variance of the signal in stage i is (N/n)(1 − p 2 i (t)) which we have approximated as N/n. When reading out the memory signal from multiple stages we must be concerned with correlations in the fluctuations between different stages; these correlations will increase the variance of the signal.
We expect positive noise correlations in the signal between stages since the state of a synapse in stage i depends on the state of a synapse in stage i − 1 whether it correctly encodes the signal or not. We can estimate this correlation by noting that the probability that both synapses are in the same state (above and beyond the chance probability of 1/4) at a time t is q i (1 − q i−1 ). This is so because the synapse in stage i must have switched state to match the state of the synapse in stage i − 1, and the synapse in stage i − 1 cannot yet have switched state. Similar reasoning tells us that the correlation between stages i and i − 2 is proportional to The complete correlation matrix depends on such correlations of all orders.
In any case, we can write, for the total variance of the readout signal If the covariances σ 2 ij are of the same order as the variances σ 2 i then the total variance will be dominated by these covariances since there are n(n − 1) of them. However, these covariances are proportional to the learning rates, as we explain above, and so are not of order one. How can we estimate them? It is not hard to estimate the covariance between adjacent stages using the arguments above. It can be caluclated explicitly to give However, how do we take into account the contribution from all other stages? A simple calculation can give us a hint. The correlation between stage 1 and stage i is proportional to q 1 q 2 . . . q i . In our model we use q i = q (i−1)/(n−1) which means that the correlation is proportional to . The idea is to sum this quantity over all i to see how the total cross-correlation scales as a function of n. This can be done by taking i as a continuous variable x and integrating from 1 to n, which gives the integral This shows that adding up the small pairwise correlations between stages over all stages give a total contribution of the order √ n. Therefore, we can try approximating the total covariance as the covariance between adjacent layers, magnified by a factor √ n. This gives where the factor 2 is because σ 2 ij = σ 2 ji . and there are 1000 synapses per stage. We track a memory which is never presented and hence the signal here has mean zero. To generate one data point we do 10 simulations of 1000 time steps each. For each simulation we calculate the noise (standard deviation) and then calculate the mean and standard deviation of this noise over the 10 simulations. We then plot the mean standard deviation normalized by the expected standard deviation if the noise were uncorrelated, which is just √ N , the square root of the total number of synapses. Error bars show one standard deviation in the measurement of this normalized noise level over the ten trials. The solid lines are from Eq.S.38.
Given this good agreement between Eq.S.38 and numerical simulation of the stochastic model, we now study Eq.S.38 further. Specifically, we will consider two cases which correspond to the two types of readout we will consider in subsequent sections: 1 -a naive readout of all stages and 2 -a readout of only a subset of stages.

Correlations in a naive readout
In this case, doing the sum in Eq.S.38 explicitly leads to the following equation . Each stage has 1000 synapses.
If we assume that ln (q −1 )/n ≪ 1, i.e. the scaling used in the continuous space approximation, then this equation gives, to leading order Eq.S.40 is used to make the dashed curve in Fig.S.6. To put some numbers to this expression, taking n = 10 and q = 10 −2 gives σ = 1.3 √ N , while n = 100 and q = 10 −4 gives σ = 1.4 √ N . In any case, the noise clearly increases as a function of the number of stages.

Correlations in a readout of only a subset of stages
If we only readout from stage k to stage k + w, then Eq.S.38 gives, for ln (q −1 )/n ≪ 1 Eq.S.53 tells us that the size of the correlation depends on the width of the readout. Furthermore, when k is close to one or n, then the correlations vanish. We will make use of this formula in Section 4.4 Optimal readout of the memory signal.

Naive readout of the memory signal
Naively one can simply readout the signal of all of the synapses. This is equivalent to integrating SN R(x, t) over x. In this case one can perform the integral analytically in the limit ϵ = ln q −1 s /n ≪ 1. One obtains (S.42) which also makes use of Eq.S.40.

Optimal readout of the memory signal
As for the heterogeneous model we can maximize the SNR by only reading out a fraction of stages. In the case of the consolidation model, it turns out that the optimal readout is to follow the memory trace as it propagates and Note that for fixed n that width reaches a constant value at early times, i.e. it is independent of time for long times (until the pulse exits the system).
read out the stages near the maximum of the pulse. The fraction of stages to be read out as a function of the total system size decreases for increasing n as shown in Fig.S.7. We can therefore calculate the SNR of the optimal readout by assuming that the bounds of the integral move along with the pulse. This also allows one to calculate the optimal width. For any bounds a(t) and b(t) which vary in time, the SNR is just We then choose the bounds in order to track the pulse. We take a(t) = ln (q −1 (1 + ϵt))/ ln q −1 − µ and b(t) = ln (q −1 (1 + ϵt))/ ln q −1 + µ and then perform a change of variables by defining z = x−lnq −1 (1 + ϵt)/ ln q −1 , where ϵ = ln q −1 /n. This converts the integral to the following form ] .
(S.44) Now one takes a large time limit which simplifies things considerably. shows that the optimal width is independent of time at long times, as can be observed numerically in Fig.S.7B. Doing this yields the formula The integrand is not a symmetric function so one wouldn't be justified in integrating from −µ to µ in general. However, for small ϵ it is very close to symmetric. Fig.S.8 shows what this function looks like for several values of n. The position of the maximum is to the left of zero, but it turns out that this position scales like ϵ while the limits of integration µ will end up scaling like √ ϵ so it is a higher order effect. One can find the optimal µ by taking the derivative of the SNR with respect to µ and setting it to zero. Taking the integrand to be f (z) this gives This can be rewritten as ) . (S.47) Now we assume that µ can be expanded in a series in ϵ. The idea is that as the system size grows, the optimal width will decrease as a fraction of the total system size, which is what is observed numerically, see FigS.7B. We have already normalized x by dividing by n, so to compare with the numerically determined optimal width we will have to multiply by n. We are only interested in the first term of the series so we take and we will need to solve for the scaling α and the value µ 0 . The right hand side is easier. We expand f (µ) where h.o.t means higher order terms. The left hand side is trickier only in the sense that we should do a change of variables y = z/µ to put the small parameter in the integrand. Then we follow much as above. Finally, we end up with (S.50) We still do not know the proper scaling since α is an unknown. However, we can make a guess and see what happens. If we choose α < 1/2 then we get something on the left which scales like ϵ 1/2−α but on the right we get something exponentially small, so they can not possibly balance. If we take α > 1/2 we get something on the left which is large and goes like ϵ 1/2−α again, but on the right we have something which to leading order is just 4, so that also cannot be balanced. Therefore it must be that α = 1/2. With that choice the equation simplifies to where µ 0 = x/ ln q −1 . Amazingly, the solution x is nearly equal to 1 (it differs from 1 by about 1% ). Therefore, taking x = 1, the optimal width is as a fraction of the system size. In terms of the number of stages it is 2 √ n/ ln q −1 . Now, in order to evaluate the SNR we must determine the size of the correlations in the noise, given by Eq.S.53. Using the equation for the width of the optimal readout Eq.S.52, we find that the total variance is which is independent of the number of stages n. Furthermore, Eq.S.53 reaches a maximum of ) at stage k = n ln 2 ln (q −1 ) . This means, for example, that given the parameters in Fig.3 of the main text (q = 0.0001), the noise level reaches a maximum of 1.15 √ N when the signal is near stage 7, i.e. at very early times. Fig.S.10 shows numerical confirmation of this. It shows the SNR using the optimal readout for the same parameters as in Fig.3 of the main text (number of stages n=100) both assuming there are no correlations (solid line) and including the effect of correlations (dashed line). The correlations are clearly very small for the optimal readout. Specifically, they do not affect the scaling of the intitial SNR or the memory lifetime, and only very weakly the point at which the SNR crosses that of the heterogeneous model. Therefore, for simplicity, we will approximate the noise level simply as √ N . Now we can evaluate the SNR by plugging the formula for µ into Eq.S.45. Keeping only the first order term in ϵ gives This means that the decay of the SNR given the optimal readout is also a power law with power equal to one. However, unlike the naive readout, where at long times the SNR was independent of n, here the SNR actually increases as n 1/4 . Eq.S.54 fits the numerical results very well, see

A plausible read-out which is nearly optimal
The optimal readout we have implemented is computationally trivial to calculate, but it is unclear how it might be carried out in the brain. Here we show that it can, in fact, be implemented approximately through a simple rule. Specifically, for each stage there is a threshold value of the SNR below which the signal from that stage is no longer read out. We assume that this threshold can be learnt and is then held fixed for all memories.
In order to make the readout nearly optimal, we choose a threshold for stage k, which has a position x = x k equal to the value of SN R(x k , T k ) where T k is the time at which, using the optimal readout, we just cease to read out the signal in stage k. This time is given by the lower bound a(t) in the integral Eq.S.43. That means T k is defined implicitly through the relation a(T k ) = x k . This can be solved for T k , yielding where µ is the optimal one we already calculated. Plugging this into the formula for SN R(x k , t) gives the threshold only as a function of x k (S.56) This formula can be simplified for x k not too close to 0 which is equivalent to assuming long times. Then q −x k − 1 ∼ q −x k . Also we plug in the optimal µ = √ ϵ/ ln q −1 and then expand in the small parameter ϵ. The leading order term is which is exponentially decreasing in x k . Using this readout with optimally adjusted fixed thresholds would be exactly equivalent to the optimal readout if the integrand f (z) were symmetric. Since it's very nearly symmetric for small ϵ at long times the readout is essentially optimal. We will therefore use the optimal readout in what follows.

The performance of the memory transfer model:
Crossing times, SNR and memory lifetimes (Fig.3) The curves of SNR versus time in the main panel of Fig.3 in the main text are generated via the meanfield model (the ODEs in continuous time, discrete stages), for both the heterogeneous system (dashed lines) and the consolidation model (solid lines). They represent the curves obtained using the optimal readout described in detail in previous sections. Three measures of interest are indicated by colored circles in the main panel of Fig.3: 1 -the time at which the SNR of the consolidation model crosses (and thereafter is larger than) the SNR in an equivalent heterogeneous model, denoted T c for crossing time, 2 -The SNR of the consolidation model at long-times in the powerlaw regime, and 3 -The lifetime of a memory, i.e. the time at which the SNR of a memory dips below 1, denoted T LT . Here we describe how these measures are calculated.

The crossing time T c
The SNR for the heterogeneous model and the consolidation model with identicalq, q, N , and n are calculated numerically from the meanfield model, using the optimal readout. The time at which the SNR of the consolidation model crosses and exceeds that of the heterogeneous model is called T c . This is plotted in the inset of Fig.3 of the main text versus n. As shown in the section The continuous-time approximation, for n = 2 one can derive the crossing-time analytically to find T c ∼ 1/q as long as the learning rates are similar. For n > 2 it is no longer possible to calculate T c analytically as the SNR of many stages contribute to the total SNR in a nontrivial fashion. Nonetheless, numerically, it appears that T c always occurs at very early times compared to the memory lifetimes for all parameter values we have explored.

The SNR in the powerlaw regime
The SNR shown in the lower lefthand panel of Fig.3 in the main text as a function of n for different slowest learning rates q, is calculated numerically from the meanfield equations (symbols) and given by the analytical formula Eq.S.54 (lines). The SNR is evaluated at different times for the three different learning rates. The upshot is that the SNR in the powerlaw regime is proportional to N 1/2 and n 1/4 .

Memory lifetime
The memory lifetime can be calculated analytically from Eqs.S.29 and S.54. Specifically, the powerlaw regime ends once the pulse reaches the last stage. This occurs at a time T which is given by Eq.S.29. The SNR at this point is then given by Eq.S.54 evaluated at time T , as long as the SNR is greater than one. The memory trace then decays exponentially with a learning rate equal to that of the last stage. Finally, the SNR reaches a value of one at a time T LT = T + T exp which can be written If the SNR drops below one already in the powerlaw regime, then the lifetime can be calculated from Eq.S.54 alone and is If the SNR drops below one even before the powerlaw regime is reached then Eq.S.59 is no longer valid. Eqs.S.58 and S.59 are used to generate the curves in the lower right hand panel of Fig.3. To reiterate, if the initial number of synapses is large, for small n we expect that the final decay of the last stage will occur before the SNR reaches 1. Therefore Eq.S.58 is valid. As n increases the curve will move downwards such that the SNR reaches 1 already in the powerlaw regime and Eq.S.59 is valid. Finally, as n increases further, at some point the SNR drops below 1 even before the powerlaw regime. This means that in this limit fewer and fewer stages have a SNR greater than 1. In fact, in the limit of n → ∞, the SNR of the first stage will already be less then 1 at time t = 0. At this point the lifetime will also be zero. Therefore, the lifetime must decrease with increasing n outside of the powerlaw regime. These trends can clearly be seen in Figs.S.12 and S.13 which show the lifetime as a function of the number of stages n for different learning rates q and different numbers of synapses N . In particular, note how the green curve in Fig.S.13 reaches a maximum and then decreases for large n.
Note that a scaling of the memory lifetime as T LT ∼ nN 1/2 can be achieved by taking q ∝ 1/N 1/2 , and ensuring that the SNR does not drop below one in the powerlaw regime, i.e. Eq.S.58 is valid. This is illustrated in Figs.S.12 and S.13 for three different values of q which scale as q ∝ 1/N 1/2 . For each value of q three different numbers of synapses are considered: N = 10 5 , 10 7 and 10 9 (green, red and black respectively). For each case the memory lifetime was determined numerically for the consolidation model (solid circles) and the heterogeneous model (open squares), both using optimal readout. The lines are the predictions for the consolidation model, Eqs.S.58-S.59. As described above there are several qualitatively distinct scaling regimes of the lifetime as a function of the number of stages. For small n the decay of the slowest stage is of the same order or larger than the time it takes for the pulse to travel to reach the last stage, i.e. both terms in Eq.S.58 are of the same order. This can even lead to a minimum in the lifetime as a function of n which would indicate a non-optimal number of stages. For larger n the lifetime is dominated by the propagation time of the pulse and ends, in fact, only when the pulse leaves the system through the slowest stage. In this regime the scaling of the memory lifetime is approximately linear in n. For n > n max , the SNR drops below one even before the pulse leaves the system and lifetimes are given by Eq.S.59, i.e. the lifetime scales as n 1/4 . Finally, as n increases further, the SNR drops below one already before the powerlaw regime and the lifetimes decrease again.
While it is clear from Fig.S.12 that the consolidation model outperforms the heterogeneous model without interactions for all n, the greatest improvement is achieved in the intermediate regime in which lifetimes scale as n.
As described in the previous paragraph, this regime is bounded below by n min . For n < n min the propagation of the pulse to the last stage occurs too quickly compared to the slowest time scale. This only occurs for relatively small numbers of stages. The upper bound on the linear regime n max is given by setting the lifetimes given by Eq.S.58 and Eq.S.59 equal. Doing this and taking the limit N → ∞ yields the scaling of maximum number of stages with the number of synapses as n max ∝ √ ln N . Fig.S.14 shows n max as a function of the square root of the log of the number of synapses N , from Eqs.S.58-S.59, confirming the scaling for large enough N . Shown is the case of q = 30/N 1/2 for which n min ∼ 5. Therefore, there is a large range in n for which lifetimes scale as nN 1/2 . Note that n max scales only weakly with N , such that if the number of synapses is increased from 10 4 , which would be the case for ∼ 300 neurons with ten percent sparseness, to 10 14 , which is approximately the total number of synapses in an adult human brain, the number of stages is only doubled.

Comparison between the multi-stage memory transfer model and homogenous (single stage) models
For a fair comparison of the memory capacity of multi-stage and single stage homogeneous models it is important to take into account that the amount of information stored per memory is different in the two models. Indeed, in the multi-stage model the information is stored only in the first stage, which  contains N/n synapses (n is the number of stages), whereas in the single stage model all N synapses are available to store new memories. As we assumed that the memories are random and uncorrelated and that potentiating events are equally probable as depressing events, then each synapse can store up to one bit of information per memory. Hence, the first stage of the multi-stage model can store up to N/n bits of information, which can be significantly less what is potentially storable in a single stage model. In order to compare the single stage and the multistage model we need to consider a situation in which the amount of information stored per memory is the same. In order to do so, consider a single stage model in which the patterns of synaptic modifications are sparse. More specifically, each synapse is modified with a probability q f . In this case the amount of information contained in the pattern of synaptic modifications is: which, in the limit for q f → 0 can be approximated by −N q f log 2 q f . The information stored in each pattern of synaptic modifications is I mg = N/n bits in the case of the multistage model. For a fair comparison, we should choose q f such that I mg = I sg . This means that q f ∼ 1/n.
As q f decreases, the memory lifetime extends at the expense of the initial signal to noise ratio. Specifically: As it is clear from this formula, making the patterns of synaptic modifications sparser is equivalent to rescale the learning rate: The longest memory lifetime can be achieved forq = 1/ √ N , which produces an initial SNR that is order 1 (i.e. it does not scale with the number of synapses N ).
The only difference with the single stage model in which we do not introduce any correction for equalizing the amount of information per memory, is that nowq can vary in a wider range, as it can go from 1 (fastest) to q f q n , which is approximately q n /n. This means that the memory lifetime can in principle be as large as in the multistage model. However, it should be noted that any reduction of the learning rate leads to a decrease of an already small initial signal to noise ratio. As shown in Figure S.15, even if one considers single stage models with timescales that are longer than q n , all the SNR curves for single stage models are below the SNR of the multistage model. This is true for all times. This clearly indicates that there is a significant memory capacity advantage in using a more complex multistage model. An appropriate learning rule is implemented in order to update the recurrent synapses in the downstream stage such that they become more correlated with those in the upstream stage. That is, the upstream stage 'teaches' the downstream stage the correct synaptic weights.

The Markov model
There are n stages. Each stage is made up of N neuron all-to-all coupled neurons. Each one of the N = N 2 neuron − N neuron synapses (no self-coupling) can take on one of two non-zero values. Specifically, the synapse from neuron j to neuron i J ij ∈ {J + , J − }, where J + > J − . Furthermore, there are one-to-one connections from a neuron i in stage k to a neuron i in stage k + 1. These connections are so strong that any presynaptic activity elicits postsynaptic activity without fail. In the initial condition, the synaptic matrices for all n stages are in the equilibrium state where any synapse is in the potentiated state J + or in the depressed state J − with probability 1/2.

Encoding
A memory is encoded in stage 1. Specifically, one half of the neurons are randomly chosen to be activated (s i = 1 if i ∈ {active}), while the remaining neurons are inactive (s i = 0 if i ∈ {inactive}). A synapse J ij is then potentiated to J + with a probability q 1 if s i = s j and is depressed with probability q 1 if s i ̸ = s j . This encoding scheme is illustrated in Fig.S.16 for a simple model of two stages of four neurons each. After encoding a single memory, transfer activity is simulated (see below) after which the next memory is encoded, and so on.

Transfer
where θ i is a threshold, then neuron i is activated at time t + 1, i.e. s 1 i (t + 1) = 1. Again, because of the powerful feedforward connections, the same subset of neurons in stage 2 is activated, i.e. s 2 i (t + 2) = 1. We take θ i = θ to be the same for all neurons and assume that it can take one of two values θ ∈ {θ l , θ h } with equal likelihood during each replay. These values correspond to a low threshold and a high threshold respectively. We assume that the recurrent connections in stage 2 have been 'dialed down' to the point where they do not influence the postsynaptic activation, i.e. the recurrent connections in stage 2 are unimportant for this mode of dynamics. This is the case if all of the recurrent synapses in stage 2 are multiplied by a modulatory factor α which is very small. We now need a plasticity rule for synapses in stage 2. If θ = θ l , then J 2 ij = J − at time t + 2 with probability q 2 if and only if s 2 j (t + 1) = 1 and s 2 i (t + 2) = 0, otherwise the synapse is unchanged. This says that if, despite the low threshold, the presynaptic activity did not elicit postsynaptic activity, then the synapses in stage 1 must have been weak, therefore I will depress the corresponding synapses in stage 2. If θ = θ h , then J 2 ij = J + with probability q 2 if and only if s 2 j (t + 1) = 1 and s 2 i (t + 2) = 1. Otherwise the synapse is unchanged. This says that if, despite the high threshold, the presynaptic activity did elicit postsynaptic activity, then the synapses in stage 1 must have been strong, therefore I will potentiate the corresponding synapses in stage 2. At time t + 2 all of the neurons in stage 1 are silenced, i.e. s 1 i (t + 2) = 0 signalling the end of the current 'replay'. The interaction between stage 2 and stage 3 is analogous. That is, the factor α is set to 1 for stage 2 and dialed down for stage 3 and the synaptic weights are transferred according to the process described above. This is repeated until the synapses in the final stage are changed.
Then the entire replay process is repeated T times before a new memory is imprinted on stage 1. This process of transfer is illustrated in Fig.S.16 for the simpled case of two stages of four neurons each.

The fraction of synapses transferred during replay
The postsynaptic activation due to recurrent connections depends on the presynaptic input. This input depends on the number of neurons activated, which in this example will be exactly f N neuron , and the state of the synapses which changes over time. The distribution of recurrent inputs is binomial, but for f N neuron large enough is nearly Gaussian. The input due to potentiated synapses is I + = µ + + σ + c where c is a random variable with zero mean and unit variance, µ + = f N neuron J + /2 and σ + = J + √ f N neuron /4. The input due to depressed synapses is The total input is therefore approximately Gaussian distributed with mean and standard deviation given by A potentiated event will only occur in the downstream stage if θ = θ h and h i > θ h and a depressing event will only occur if θ = θ l and h i < θ l . The probability that the f N synapses associated with the postsynaptic activity of a single neuron are changed is therefore which, if the low and high thresholds are equally spaced from the mean can be written .

(S.63)
From the analysis of the previous section it is clear that the probability of a synapse in stage k being updated during the transfer process is not simply equal to the intrinsic learning rate q k . Rather, for a transfer process of T replays, it is equal toq Top: Plasticity events, in this case patterns of neuronal activation, cause memories to be encoded through a Hebbian plasticity rule. Before the next memory is encoded a transfer process copies patterns of synaptic weights from one stage of the neuronal memory system to the next, downstream stage. Middle: A pattern of neuronal activation is imposed during encoding. A Hebbian learning rule leads to potentiation (depression) in the synapses connecting neurons with the same (different) activity. Bottom: An illustration of the transfer process for f = 1/N neuron . LTP occurs whenever a presynaptic activation leads to postsynaptic firing. Otherwise LTD occurs.
which, for q k ϕf T ≪ 1 can be written Eq.S.64 is compared to numerical simulation of the full neuronal model for various values of f in Fig.S.17.

The fraction of synapses correctly transferred during replay
Of the total number of synapses transferred during replay, some fraction will be errors. The fraction of correctly potentiated synapses is just the fraction of the area under the curve greater than θ h which is due to synapses in the potentiated state. It is also the ratio of the expected number of potentiated synapses to the total number of activated synapses per neuron, i.e. ψ = E(k + )/(f N ). The expected number of potentiated synapses is just is the minimal number of potentiated synapses needed to exceed the high threshold θ h . Finally, we find that .
This formula holds also for depressed synapses if the thresholds are equally spaced from the mean input. Eq.S.67 reaches a value of one when θ = f N J + which is the maximum possible input. Eq.S.67 is shown in Fig.S.18 for various values of f .

The effective learning rate during transfer
From the previous sections it is clear that the effective learning rate depends on the intrinsic learning rate times the number of replays, times a constant which depends on the details of the transfer process. The probability of correctly updating a synapse is then ψq k and an incorrect update occurs with a rate (1 − ψ)q k . where ϵ = δ Nneuron and δ is the excess number of potentiated pre-synapses impinging on one neuron beyond N neuron /2 so ϵ ∈ {−1/2, 1/2}. Of course, we expect ϵ to be small. In fact, since the synapses are updated essentially as a Poisson process with the probability of potentiation and depression both being 1/2, the expected number of potentiated synapses is N neuron /2 and the standard deviation is √ N neuron /2 which means ϵ ∼ 1 2 √ Nneuron . However, it doesn't matter how small ϵ is, it will always lead to large fluctuations in downstream synaptic states if: 1 -for a fixed number of stages there are sufficiently many replays or 2 -for a fixed number of replays there are sufficiently many stages. The reasons is that even a small excess of potentiated synapses leads to an imbalance in potentiation over depression in the next stage. This effect is accentuated by the fact that plasticity is completely driven by the tails of the input distribution. Therefore sufficiently many replays will always make the ϵ in the next stage bigger than in the current one, making the system unstable. This can be seen in Fig.S.19 where the input to each neuron is plotted, averaged over 5000 replays. The inputs to neurons in stages 2, 4 and 6 are shown. The inputs to stage 2 are clearly tightly peaked around 30, since here f = 0.01, N neuron = 1000, J + = 5 and J − = 1. The inputs to stages 4 and 6 show much greater variability with those for stage 6 clearly grouping around 10 and 50, indicating that for each neuron all inputs are either depressed or potentiated respectively. The mean averaged over the network is still 30.
A simple solution to avoid such a build-up of correlations is to allow for a variable threshold in the plasticity rule. Specifically, the threshold should compensate for changes in excitability as in the BCM rule. The most trivial implementation of this is to subtract off the value ϵ f Nneuron 2 (J + −J − ) from the input to any cell. This is equivalent to shifting both thresholds by the same amount. This does not take into account the effect of correlations on the width of the input distribution, but that effect is order ϵ 2 and is negligible.  imposing a particular pattern of neuronal activity in the first stage during the encoding phase. Therefore we can associate a pattern of neuronal activity with every pattern of synaptic weights (memory). This allows us to consider a simple means of 'reading out' memories, by which we mean detecting a pattern of neuronal activation which has been previously used during the encoding phase. For this particular type of readout we drive one half of the neurons in stage k. The pattern of activation is random and may be a novel (not learned) pattern or may coincide with a previously learned pattern. Our task is to design a read-out circuit which can distinguish between these two types of patterns: novel or learned. Therefore this is a recognition task. Let us designate by the vector s k (t) the state of neurons in stage k at time t, where for neuron i s k i (t) ∈ {0, 1} and 1 means active. The activity of neuron i at a time t + 1 depends on the recurrent synaptic connectivity. Specifically, the input current to neuron i is h k i (t) = ∑ j J k ij (t)s k j (t), and if h k i (t) > θ then s k i (t + 1) = 1. The question is, 'How does h k i (t + 1) depend on whether or not s k (t) is a novel or a learned pattern?' If there is a systematic difference in this input current as a function of the novelty of the pattern of activation, then it is possible to recognize previously learned patterns.

Neuronal readout of memories
As an illustrative case consider a stage with learning rate q = 1, i.e. every synapse is overwritten at every time step. Then where we have dropped the superscript k. Now we find that (S.79) The expected value of the input is ) , (S.80) and so clearly depends on the correlation between the patterns of activation at times t − 1 and t. If they are perfectly correlated (anti-correlated) patterns, then E C (h) = Nneuron 2 J + if s i (t − 1) = 1 (s i (t − 1) = 0)and E C = Nneuron 2 J − if s i (t − 1) = 0 (s i (t − 1) = 1). If they are uncorrelated, then E U = Nneuron 4 (J + + J − ). Also, in this simple example we also have that V ar C (h i ) = 0 and V ar U (h i ) = 3 8 (J + − J − ) 2 N neuron . Therefore in this simple case the distribution of inputs given novel patterns of activation is an approximate Gaussian with the mean and variance given above, while for a learned memory it is two delta functions centered to the left and right of the mean of the Gaussian. Therefore, the threshold for the McCulloch-Pitts neurons can be placed between the mean of the Gaussian and the rightmost delta function E U < θ < E C . This will lead to many more neurons being active at time t + 1 whenever the pattern of activation is learned and not novel. A simple readout would then be to recognize a learned pattern if the sum σ = ∑ i s i (t + 1) is greater than a certain threshold. In the general case (q ̸ = 1) the separation between the input distributions given a learned or a novel pattern of activation is not as large. In fact, after a pattern is learned (memory encoded), the difference E C − E U will decrease exponentially in time with a time constant proportional to 1/q k where q k is the learning rate for stage k. Furthermore, V ar C > 0. Therefore, the task of recognition is a signal-detection problem given Gaussian distributions. In any case, it is clear that given any θ and any threshold for recognition of a learned pattern, some errors will be made. The probability of correct recognition will clearly decrease as a function of the age or SNR of the memory being tracked. This is illustrated in Fig.S.21 for the neuronal memory transfer model given the same parameter values as in Fig.6 of the main text. In Fig.S.21, the upper panel shows the SNR of each of the first five stages as well as the total SNR (black). The lower panel shows a recognition index x recog using each of the individual stages as well as all of them combined (black). To make these curves, after each encoding/transfer process (one time step), each stage is probed with both a novel pattern of activation and the pattern which was learned at t = 0. Both of these patterns produce a response (here θ = E U + StDev U ). In this case we simply compare the responses and say that the larger response wins (that pattern is recognized). We then repeat the entire simulation 1000 times. The recognition index is the number of times the learned pattern led to a larger response minus the number of times the novel pattern lead to a larger response divided by 1000. Therefore x recog ∈ {−1, 1} and x recog = 0 is chance level. No effort was made to optimize the readout in any way. Clearly, no errors occur when the SNR is large, and then the rate of errors increases with decreasing SNR.