A Three-Threshold Learning Rule Approaches the Maximal Capacity of Recurrent Neural Networks

Understanding the theoretical foundations of how memories are encoded and retrieved in neural populations is a central challenge in neuroscience. A popular theoretical scenario for modeling memory function is the attractor neural network scenario, whose prototype is the Hopfield model. The model simplicity and the locality of the synaptic update rules come at the cost of a poor storage capacity, compared with the capacity achieved with perceptron learning algorithms. Here, by transforming the perceptron learning rule, we present an online learning rule for a recurrent neural network that achieves near-maximal storage capacity without an explicit supervisory error signal, relying only upon locally accessible information. The fully-connected network consists of excitatory binary neurons with plastic recurrent connections and non-plastic inhibitory feedback stabilizing the network dynamics; the memory patterns to be memorized are presented online as strong afferent currents, producing a bimodal distribution for the neuron synaptic inputs. Synapses corresponding to active inputs are modified as a function of the value of the local fields with respect to three thresholds. Above the highest threshold, and below the lowest threshold, no plasticity occurs. In between these two thresholds, potentiation/depression occurs when the local field is above/below an intermediate threshold. We simulated and analyzed a network of binary neurons implementing this rule and measured its storage capacity for different sizes of the basins of attraction. The storage capacity obtained through numerical simulations is shown to be close to the value predicted by analytical calculations. We also measured the dependence of capacity on the strength of external inputs. Finally, we quantified the statistics of the resulting synaptic connectivity matrix, and found that both the fraction of zero weight synapses and the degree of symmetry of the weight matrix increase with the number of stored patterns.


Introduction
One of the fundamental challenges in neuroscience is to understand how we store and retrieve memories for a long period of time. Such long-term memory is fundamental for a variety of our cognitive functions. A popular theoretical framework for storing and retrieving memories in recurrent neural networks is the attractor network model framework [1][2][3]. Attractors, i.e. stable states of the dynamics of a recurrent network, are set by modification of synaptic efficacies in a recurrent network. Synaptic plasticity rules specify how the efficacy of a synapse is affected by pre-and post-synaptic neural activity. In particular, Hebbian synaptic plasticity rules lead to long-term potentiation (LTP) for correlated pre-and post-synaptic activities, and long-term depression (LTD) for anticorrelated activities. These learning rules build excitatory feedback loops in the synaptic connectivity, resulting in the emergence of attractors that are correlated with the patterns of activity that were imposed on the network through external inputs. Once a set of patterns become attractors of a network (in other words when the network "learns" the patterns), upon a brief initial activation of a subpopulation of neurons, the network state evolves towards the learned stable state (the network "retrieves" a past stored memory), and remains in that state after removal of the external inputs (and hence maintains the information in short-term memory). The set of initial network states leading to a memorized state is called the basin of attraction, whose size determines how robust a memory is. The attractor neural network scenario was originally explored in networks of binary neurons [1,2], and then extended from the 90s to networks of spiking neurons [4][5][6][7].
Experimental evidence in different areas of the brain, including inferotemporal cortex [8][9][10][11] and prefrontal cortex [12][13][14], has provided support for the attractor neural network framework, using electrophysiological recordings in awake monkeys performing delayed response tasks. In such experiments, the monkey has to maintain information in short-term (working) memory in a 'delay period' to be able to perform the task. Consistent with the attractor network scenario, some neurons exhibit selective persistent activity during the delay period. This persistent activity of ensembles of cortical neurons has thus been hypothesized to form the basis of the working memory of stimuli shown in these tasks.
One of the most studied properties of attractor neural network as a model of memory is its storage capacity, i.e. how many random patterns can be learned in a recurrent network of N neurons in the large N limit. Storage capacity depends both on the network architecture and on the synaptic learning rule. In many models, the storage capacity scales with N. In particular, the Hopfield network [1] that uses a Hebbian learning rule has a storage capacity of 0.138N in the limit of N ! 1 [15]. Later studies showed how the capacity depends on the connection probability in a randomly connected network [16,17] and on the coding level (fraction of active neurons in a pattern) [18,19]. A natural question is, what is the maximal capacity of a given network architecture, over all possible learning rules? This question was answered by Elizabeth Gardner, who showed that the capacity of fully connected networks of binary neurons with dense patterns scales as 2N [20], a storage capacity which is much larger than the one of the Hopfield model. The next question is what learning rules are able to saturate the Gardner bound? A simple learning rule that is guaranteed to achieve this bound is the perceptron learning rule (PLR) [21] applied to each neuron independently. However, unlike the rule used in the Hopfield model, the perceptron learning rule is a supervised rule that needs an explicit "error signal" in order to achieve the Gardner bound. While such an error signal might be available in the cerebellum [22][23][24], it is unclear how error signals targeting individual neurons might be implemented in cortical excitatory synapses. Therefore, it remains unclear whether and how networks with realistic learning rules might approach the Gardner bound.
The goal of the present paper is to propose a learning rule whose capacity approaches the maximal capacity of recurrent neural networks by transforming the original perceptron learning rule such that the new rule does not explicitly use an error signal. The perceptron learning rule modifies the synaptic weights by comparing the desired output with the actual output to obtain an error signal, subsequently changing the weights in the opposite direction of the error signal. We argue that the total synaptic inputs ('local fields') received by a neuron during the presentation of a stimulus contain some information about the current error (i.e. whether the neuron will end up in the right state after the stimulus is removed). We use this insight to build a field dependent learning rule that contains three thresholds separating no plasticity, LTP and LTD regions. This rule implements basic biological constraints: (a) it uses only information local to the synapse; (b) the new patterns can be learned incrementally, i.e. it is an online rule; (c) it does not need an explicit error signal; (d) synapses obey Dale's principle, i.e. excitatory synapses are not allowed to have negative weights. We studied the capacity and the size of the basins of attraction for a binary recurrent neural network in which excitatory synapses are endowed with this rule, while a global inhibition term controls the global activity level. We investigated how the strength of external fields and the presence of correlations in the inputs affect the memory capacity. Finally, we investigated the statistical properties of the connectivity matrix (distribution of synaptic weights, degree of symmetry).

The network
We simulated a network of N binary (McCulloch-Pitts) neurons, fully-connected with excitatory synapses (Fig 1A). All the neurons feed a population of inhibitory neurons which is modeled as a single aggregated inhibitory unit. This state-dependent global inhibition projects back onto all the neurons, stabilizing the network and controlling its activity level. At each time step, the activity (or the state) of neuron i (i = 1. . .N) is described by a binary variable s i 2 {0,1}. The state is a step function of the local field v i of the neuron: where Θ is the Heaviside function (Θ(x) = 1 if x > 0 and 0 otherwise) and θ is a neuronal threshold. The local field v i represents the overall input received by the neuron from its excitatory and inhibitory connections ( Fig 1B). The excitatory connections are of two kinds: recurrent connections from within the excitatory population, and external inputs.  The recurrent excitatory connections are mediated by synaptic weights, denoted by a matrix W whose elements w ij (the weight of the synapse from neuron j to i) are continuous non-negative variables (w ij 2 [0,1); w ii = 0). In the following, and in all our simulations, we assume that the weights are initialized randomly before the training takes place (see Materials and Methods).
Therefore, in the absence of external inputs, the local field of each neuron i is given by: where I 0 ðsÞ represents the inhibitory input. For the sake of simplicity, we simulated a synchronous update process, in which the activity of each neuron s i is computed from the local field v i at the previous time step, and all updates happen in parallel.
The network was designed so that, in absence of external input and prior to the training process, it should spontaneously stabilize itself to some fixed overall average activity level f (fraction of active neurons, or sparseness), regardless of the initial conditions. In particular, we aimed at avoiding trivial attractors (the all-off and all-on states). To this end, we model the inhibitory feedback (in absence of external inputs) as a linear function of the overall excitatory activity: The parameters H 0 and λ can be understood as follows: H 0 is the average inhibitory activity when the excitatory network has the desired activity level f, i.e. when P N i¼1 s i ¼ f N; λ measures the strength of the inhibitory feedback onto the excitatory network. This expression can be interpreted as a first-order approximation of the inhibitory activity as a function of the excitatory activity around some reference value fN, which is reasonable under the assumption that the deviations from fN are small enough. Indeed, by properly setting these two parameters in relation to the other network parameters (such as θ and the average connection strength) it is possible to achieve the desired goal of a self-stabilizing network.
In the training process, the network is presented a set of p patterns in the form of strong external inputs, representing the memories which need to be stored. We denote the patterns as fx m g (where μ = 1. . .p and x m i 2 f0; 1g), and assume that each entry x m i is drawn randomly and independently. For simplicity, the coding level f for the patterns was set equal to the spontaneous activity level of the network, i.e. x m i ¼ 1 with probability f, 0 otherwise. During the presentation of a pattern μ, each neuron i receives an external binary input x i ¼ Xx m i , where X denotes the strength of the external inputs, which we parameterized as . In addition, the external input also affects the inhibitory part of the network, eliciting a response which indirectly downregulates the excitatory neurons. We model this effect as an additional term H 1 in the expression for the inhibitory term (Eq 3), which therefore becomes: The general expression for the local field v i then reads: In the absence of external fields, x i = 0 for all i, and thus Eqs 4 and 5 reduce to the previous expressions Eqs 3 and 2.
The goal of the learning process is to find values of w ij 's such that the patterns fx m g become attractors of the network dynamics. Qualitatively, this means that, if the training process is successful, then whenever the network state gets sufficiently close to one of the stored patterns, i.e. whenever the Hamming distance d ¼ P N i¼1 j x m i À s i j between the current network state and a pattern μ is sufficiently small, the network dynamics in the absence of external inputs should drive the network state towards a fixed point equal to the pattern itself (or very close to it). The general underlying idea is that, after a pattern is successfully learned, some brief external input which initializes the network close to the learned state would be sufficient for the network to recognize and retrieve the pattern. The maximum value of d for which this property holds is then called the basin of attraction size (or just basin size hereafter for simplicity); indeed, there is generally a trade-off between the number of patterns which can be stored according to this criterion and the size of their basin of attraction.
More precisely, the requirement that a patternx m is a fixed point of the network dynamics in the absence of external fields can be reduced to a condition for each neuron i (cfr. Eqs 4 and 5): This condition only guarantees that, if the network is initialized into a states ¼x m , then it will not spontaneously change its state, i.e. it implements a zero-size basin of attraction. A simple way to enlarge the basin size is to make the requirement in Eq 6 more stringent, by enforcing a more stringent constraint for local fields: 8i : where ! 0 is a robustness parameter. When = 0, we recover the previous zero-basin-size scenario; increasing we make the neurons' response more robust towards noise in their inputs, and thus we enlarge the basin of attraction of the stored patterns (but then fewer patterns can be stored, as noted above).

The three-threshold learning rule (3TLR)
In the training phase, the network is presented with patterns as strong external fields x i . Patterns are presented sequentially in random order. For each pattern μ, we simulated the following scheme: Step 1: The pattern is presented (i.e. the external inputs x i are set to X x m i ). A single step of synchronous updating is performed (Eqs 1, 4 and 5). If the external inputs are strong enough, i.e. γ is large enough, this updating sets the network in a state corresponding to the presented pattern.
Step 2: Learning occurs. Each neuron i may update its synaptic weights depending on 1) their current value w t ij , 2) the state of the pre-synaptic neurons, and 3) the value of the local field v i . Therefore, all the information required is locally accessible, and no explicit error signals are used. The new synaptic weights w tþ1 ij are set to: where η is the learning rate, and θ 0 and θ 1 are two auxiliary learning thresholds set as We refer to this update scheme as the "three-threshold learning rule" (3TRL). After some number of presentations, we checked whether the patterns are learned by presenting a noisy version of these patterns, and checking whether the patterns (or network states which are very close to the patterns) are fixed points of the network dynamics.
When N ) 1, γ is large enough, and H 1 = fX, the update rule described by Eq 8 is essentially equivalent to the perceptron learning rule for the task described in Eq 7. This can be shown as follows (see also  The three-threshold learning rule (3TLR), and its relationship with the standard perceptron learning rule (PLR). The perceptron learning rule modifies the synaptic weights by comparing the desired output with the actual output to obtain an error signal, subsequently changing the weights in the opposite direction of the error signal (see the table in the left panel). For a pattern which is uncorrelated with the current synaptic weights, the distribution is Gaussian (in the limit of large N), due to the central limit theorem. H 0 is set such that, on average, a fraction f of the local fields are above the neuronal threshold θ; in the case of f = 0.5, this means that the Gaussian is centered on θ (left panel). In our model (Fig 1B), the desired output is given as a strong external input, whose distribution across the population is bimodal (with two delta functions on x i = 0 and x i = X); therefore, the distribution of the local fields during stimulus presentation becomes bimodal as well (right panel). The left and right bumps of this distribution correspond to cases where the desired outputs are zero and one, respectively. Note that, since the external input also elicits an inhibitory response, the neurons in the network which are not directly affected by the external input (i.e. those with desired output equal to zero) are effectively hyperpolarized. If X is sufficiently large, the two distributions do not overlap, and the four cases of the PLR can be mapped to the four regions determined from the three thresholds, indicated by vertical dashed lines (see text).
doi:10.1371/journal.pcbi.1004439.g002 one for which x i = X. The net effect of the stimulus presentation on the local field has to take into account the indirect effect through the inhibitory part of the network (see Eq 4), and thus is equal to −fX for the x i = 0 population and to (1 − f)X for the x i = X population. Before learning, the distribution of the local fields across the excitatory population, in the limit N ! 1, is a Gaussian whose standard deviation is proportional to ffiffiffiffi N p , due to the central limit theorem; moreover, the parameter H 0 is set so that the average activity level of the network is f, which means that the center of the Gaussian will be within a distance of order ffiffiffiffi N p from the neuronal threshold θ (this also applies if we use different values for the spontaneous activity level and the pattern activity level). Therefore, if is large enough, the state of the network during stimulus presentation will be effectively clamped to the desired output, i.e. s i ¼ x m i for all i. This fact has two consequences: 1) the local field potential can be used to detect the desired output by just comparing it to the threshold, and 2) each neuron i will receive, as its recurrent inputs . Furthermore, due to the choice of the secondary thresholds θ 0 and θ 1 in Eqs 9 and 10, the difference between the local field and θ 0 (or θ 1 ) during stimulus presentation for the x i = 0 population (or x i = X, respectively) is equal to the difference between the local field and y À f , respectively) in the absence of external stimuli, provided the recurrent inputs are the same. Therefore, the value of the local field v i during stimulus presentation in relation to the three thresholds θ, θ 0 and θ 1 is sufficient to determine whether an error is made with respect to the constraints of Eq 7, and which kind of error is made. Following these observations, it is straightforward to map the standard perceptron learning rule on the 4 different cases which may occur (see Fig 2), resulting in Eq 8.
In Fig 3 we demonstrate the effect of the learning rule on the distribution of the local field potentials as measured from a simulation (with f = 0.5 and = 1.2): the initial distribution of the local fields of the neurons, before the learning process takes place and in the absence of external fields, is well described by a Gaussian distribution centered on the neuronal threshold θ (see Fig 3A) with a standard deviation which scales as ffiffiffiffi N p . During a pattern presentation, the resulting distribution becomes a bimodal one; before learning takes place, the distribution is given by the sum of two Gaussians of equal width, centered around y 0 þ f ffiffiffiffi N p and y 1 À f ffiffiffiffi N p (Fig 3B). The left Gaussian corresponds to the cases where x i = 0 and the right one to the cases where x i = X. Having applied the learning rule, we observe that the depression region (i.e. the interval (θ 0 , θ)) and the potentiation region (i.e. (θ, θ 1 )) gets depleted (Fig 3C). In the testing phase, when the external inputs are absent, the left and right parts of the distribution come closer, such that the distance between the two peaks is equal to at least 2f ffiffiffiffi N p (Fig 3D). This margin between the local fields of the ON and OFF neurons makes the attractors more robust.

Storage capacity
Since our proposed learning rule is able to mimic (or approximate, depending on the parameters) the perceptron learning rule, which is known to be able to solve the task posed by Eq 7 whenever a solution exists, we expect that a network implementing such rule can get close to maximal capacity in terms of the number of memories which it can store at a given robustness level. The storage capacity, denoted by α = p/N, is measured as a ratio of the maximum number of patterns p which can successfully be stored to the number of neurons N, in the limit of large N. As mentioned above, it is a function of the basin size.
We used the following definition for the basin size: a set of p patterns is said to be successfully stored at a size b if, for each pattern, the retrieval rate when starting from a state in which a fraction b of the pattern was randomized is at least 90%. The retrieval rate is measured by the probability that the network dynamics is able to bring the network state to an attractor within 1% distance from the pattern, in at most 30 steps. The distance between the state of the network and a pattern μ is measured by the normalized Hamming distance 1 N P N i¼1 j s i À x m i j. Therefore, at coding level f = 0.5, reaching a basin size b means that the network can successfully recover patterns starting from a state at distance b/2. Distribution of local fields before and after learning for f = 0.5 and non-zero robustness. A. Before learning begins, the distribution of local field of neurons is a Gaussian distribution (due to central limit theorem) centered around neuronal threshold θ both for neurons with the desired output zero (OFF neurons) and with the desired output one (ON neurons). The goal is to have the local field distribution of ON neurons (red curve) to be above the threshold θ, and that of OFF neurons to be below θ. B. Once any of the to-be-stored patterns are presented as strong external fields, right before the learning process starts, the local field distribution of the OFF neuron shifts toward the left-side centered around y 0 þ f ffiffiffi ffi N p , whereas the distribution of the ON neurons moves toward the right-side, centered around y 1 À f ffiffiffi ffi N p , with a negligible overlap between the two curves if the external field is strong enough. Thanks to the strong external fields and global inhibition, the local fields of the ON and OFF neurons are well separated. C. Due to the learning process, the local fields within the depression region [i.e. (θ 0 , θ)] get pushed to the left-side, below θ 0 , whereas those within the potentiation region get pushed further to the right-side, above θ 1 . If the learning process is successful, it will result in a region (θ 0 , θ 1 ) which no longer contain local fields, with two sharp peaks on θ 0 and θ 1 . D. After successful learning, once the external fields are removed, the blue and red curves come closer, with a gap equal to 2f ffiffiffi ffi N p . The larger the robustness parameter , the more the gap between the left-and right-side of the distribution. Notice that now the red curve is fully above θ which means those neurons remain stably ON, while the the blue curve is fully below θ, which means those neurons are stably OFF. Therefore the corresponding pattern is successfully stored by the network.  for the calculation), as for a network with unconstrained synaptic weights [20]. In Fig 4A, we also compare our network with the Hopfield model. Our network stores close to the maximal capacity at zero basin size, at least eleven times more than the Hopfield model. Across the range of basin sizes, 3TLR achieves more than twice the capacity that can be achieved with the Hopfield model.
The enlargement of the basin of attraction was achieved by increasing the robustness parameter . We computed the maximal theoretical capacity as a function of at N ! 1 (see Materials and Methods) and compared it to our simulations, and to the maximal theoretical capacity of the Hopfield network. The results are shown in Fig 4B. For any given value of , the cyan curve shows the maximum α for which the success ratio with our network was at least 0.5 across different runs. The difference between the theory and the experiments in our model can be ascribed to several factors: the finite size of the network; the choice of the finite learning rate η, and the fact that we imposed a hard limit on the number of pattern presentations (see number of iterations in Table 1), while the perceptron rule for excitatory synaptic connectivity is only guaranteed to be optimal in the limit of η ! 0, with a number of presentations inversely proportional to η [25]. Note that the correspondence between the PLR and the 3TLR is only perfect in the large γ limit, and is only approximate otherwise, as can be shown by comparing explicitly the synaptic matrices obtained by both algorithms on the same set of patterns (see Materials and Methods).
A crucial ingredient of the 3TLR is having a strong external input which effectively acts as a supervisory signal. How strong do the external fields need to be? How much does the capacity depend on this strength? To answer these questions, we measured the maximum number of stored patterns as a function of the parameter γ which determines the strength of external fields as X ¼ g ffiffiffiffi N p . This parameter, in fact, determines how far the two Gaussian distributions of the local field are; as shown in Fig 2, the distance between the two peaks of the distribution is X. For large enough γ, the overlap of these two distributions is negligible and the capacity is maximal; but as we lower γ, the overlap increases, causing the learning rule to make mistakes, i.e. when it should potentiate, it depresses the synapses and vice versa. In our simulations with N = 1001 neurons in the dense regime f = 0.5 at a fixed epsilon = 0.3, we varied γ and computed the maximum α that can be achieved with a fixed number of iterations (1000). The capacity indeed gradually decreases as γ decreases, until it reaches a threshold, below which there is a sharp drop of capacity (see Fig 5). With the above values for the parameters, this transition occurs at γ % 2.4.
The 3TLR can also be adapted to work in a sparser regime, at a coding level lower than 0.5. However, the average activity level of the network is determined by H 0 , and their relationship also involves the variance of the distribution of the synaptic weights when f 6 ¼ 0.5 (see Materials and Methods). During the learning process, the variance of the weights changes, which implies that the parameter H 0 must adapt correspondingly. In our simulations, this adaptation was performed after each complete presentation of the whole pattern set. In practice, this additional self-stabilizing mechanism could still be performed in an unsupervised fashion along with (or in alternation with) the learning process. Using this adjustment, we simulated the network at f = 0.2 and compared the results with the theoretical calculations. As shown in Fig 6, we can achieve at least 70% of the critical capacity across different values of the robustness parameter . We also investigated numerically the effect of correlations in the input patterns. The PLR is able to learn correlated patterns as long as a solution to the learning problem exists. As the 3TLR approximates the PLR, we expect the 3TLR to be able to learn correlated patterns as well. As a simple model of correlation, we tested patterns organized in L categories [26,27]. Each category was defined by a randomly generated prototype. Prototypes were uncorrelated from category to category. For each category, we then generated p/L patterns independently with a specified correlation coefficient c with the corresponding prototype. We show in

Statistical properties of the connectivity matrix
We next investigated the statistical properties of the connectivity matrix after the learning process. Previous studies have shown that the distribution of synaptic weights in perceptrons with excitatory synapses becomes at maximal capacity a delta function at zero weight, plus a truncated Gaussian for strictly positive weights [25,[28][29][30]. Our model differs from this setting because of the global inhibitory feedback. Despite this difference, the distribution of weights in our network bear similarities with the results obtained in these previous studies: the distribution exhibits a peak at zero weight ('silent', or 'potential' synapses), while the distribution of strictly positive weights resembles a truncated Gaussian. Finally, the fraction of silent synapses increases with the robustness parameter (see Fig 8).
We have also computed the degree of symmetry of the weight matrix. The symmetry degree is computed as the Pearson correlation coefficient between the reciprocal weights in pairs of neurons. We observe a general trend towards an increasingly symmetric weight matrix as more patterns are stored, for all values of the robustness parameter (see Fig 9).

Discussion
We presented a biologically-plausible learning rule that is characterized by three thresholds, and is able to store memory patterns close to the maximal storage capacity in a recurrent neural networks without the need of an explicit "error signal". We demonstrated how the learning rule can be considered a transformed version of the PLR in the limit of a strong external field. Our network implements the separation between excitatory and inhibitory neurons, with learning occurring only at excitatory-to-excitatory synapses. We simulated a recurrent network with N = 1001 binary neurons, reaching to α c = 1.6 at zero basin size. We then used a robustness parameter to enlarge the basin size. The simulations showed that we are close to the theoretical capacity across the whole investigated range of values of . We expect that as N increases and the learning rate gets smaller, this difference would go to zero.  The degree of symmetry of the weight matrix. The Pearson correlation coefficient between w ij and w ji is computed at different values of α for three values of . As α increases the weight matrix tends to be more symmetric, but gets saturated for high α. For the same values of α, as the robustness increases, the correlation also increases, so the weight matrix becomes more symmetric. Error bars (across 10 runs) are smaller than the symbols. Two crucial ingredients of the 3TLR are necessary: (1) strong external inputs, (2) three learning thresholds which are set according to the statistics of inputs to the neuron. The learning rule only uses information that is local to a synapse and corresponding neurons. Like classic Hebbian learning rules, our 3TLR works in an online fashion. In addition, it can also perform as a 'palimpsest' [31][32][33]: in case the total number of patterns exceeds the maximal capacity (at a certain basin size) the network begins to forget patterns that are not being presented anymore.

Comparison with other learning rules
The 3TLR can be framed in the setting of the classic Bienenstock-Cooper-Munro (BCM) theory [34,35], with additional requirements to adapt it to the attractor network scenario. The original BCM theory uses firing-rate units, and prescribes that synaptic modifications should be proportional to (1) the synaptic input, and (2) a function ϕ(v) of the total input v (or, equivalently, of the total output). The function ϕ(v) is subject to two conditions: (1) ϕ(v) ! 0 (or 0) when v > θ (or < θ, respectively); (2) ϕ(0) = 0. The parameter θ is also assumed to change, but on a longer time scale (such that the changes reflect the statistics of the inputs); this (metaplastic) adaptation has the goal of avoiding the trivial situations in which all inputs elicit indistinguishable responses. This (loosely specified) framework ensures that, under reasonable conditions, the resulting units become highly selective to a subset of the inputs, and has been mainly used to model the developmental stages of primary sensory cortex. The arising selectivity is spontaneous and completely unsupervised: in absence of further specifications, the units become selective to a random subset of the inputs (e.g. depending on random initial conditions).
Our model is defined on simpler (binary) units; however, if we define , then ϕ behaves according to the prescriptions of the BCM theory. Furthermore, we have essentially assumed the same slow metaplastic adaptation mechanism of BCM, even though we have assigned this role explicitly to the inhibitory part of the network (see Materials and Methods). On the other hand, our model has additional requirements: (1) ϕ(v) = 0 when v < θ 0 or v > θ 1 , (2) plasticity occurs during presentation of external inputs, which in turn are strong enough to drive the network towards a desired state. The second requirement ensures that the network units become selective to a specific subset of the inputs, as opposed to a random subset as in the original BCM theory, and thus that they are able to collectively behave as an attractor network. The first requirement ensures that each unit operates close to critical capacity. Indeed, these additional requirements involve extra parameters with respect to the BCM theory, and we implicitly assume these parameters to also slowly adapt according to the statistics of the inputs during network formation and development.
A variant of the BCM theory, known as ABS rule [36,37] introduced a lower threshold for LTD, analogous to our θ 0 , motivated by experimental evidence; however, a high threshold for LTP, analogous to our θ 1 , was not used there, or-to our knowledge-in any other BCM variant. The idea of stopping plasticity above some value of the 'local field' has been introduced previously to stabilize the learning process in feed-forward networks with discrete synapses [38][39][40]. Our study goes beyond these previous works in generalizing such a high threshold to recurrent networks, and showing that the resulting networks achieve close to maximal capacity.

Comparison with data and experimental predictions
In vitro experiments have characterized how synaptic plasticity depends on voltage [41] and firing rate [42], both variables that are expected to have a monotonic relationship with the total excitatory synaptic inputs received by a neuron. In both cases, a low value of the controlling variable leads to no changes; intermediate values lead to depression; and high values to potentiation. These three regimes are consistent with the three regions for v < θ 1 in Fig 2. The 3TLR predicts that a fourth region should occur at sufficiently high values of the voltage and/or firing rates. Most of the studies investigating the dependence of plasticity on firing rate or voltage have not reported a decrease in plasticity at high values of the controlling variables, but these studies might have not increased sufficiently such variables. To our knowledge, a single study has found that at high rates, the plasticity vs rate curve is a decreasing function of the input rate [43].
Another test of the model consists in comparing the statistics of the synaptic connectivity with experimental data. As it has been argued in several recent studies [25,28,30,44,45], networks with plastic excitatory synapses are generically sparse close to maximal capacity, with a connection probability that decreases with the robustness of information storage, consistent with short range cortical connectivity [46,47]. Our network is no exception, though the fraction of silent synapses that we observe is significantly lower than in models that lack inhibition. Furthermore, network that are close to maximal capacity tends to have a connectivity matrix that has a significant degree of symmetry, as illustrated by the over-representation of bidirectionally connected pairs of neurons, and the tendency of bidirectionally connected pairs to form stronger synapses than unidirectionally connected pairs as observed in cortex [47,48], except in barrel cortex [49]. Again, the 3TLR we have proposed here reproduces this feature (Fig 9), consistent with the fact that the rule approaches the optimal capacity.

Future directions
Our network uses the simplest possible single neuron model [50]. One obvious direction for future work would be to implement the learning rule in a network of more realistic neuron models such as firing rate models or spiking neuron models. Another potential direction would be to understand the biophysical mechanisms leading to the high threshold in the 3TLR. In any case, we believe the results discussed here provide a significant step in the quest for understanding how learning rules in cortical networks can optimize information storage capacity.

Materials and Methods Simulation
The main equations of the network, the neuron model, the learning rule, and the criteria for stopping the learning algorithm are outlined in the Results section, Eqs 1-7. We present here additional details about network simulations.
Network setup before learning process. Before applying the learning rule, we required the network to have stable dynamics around a desired activity level f. A network with only excitatory neurons is highly unstable and typically converges towards the trivial all-off and allon states; therefore, we implemented a global inhibition such that the network operates around activity level f. The basal inhibitory term (H 0 ) and the inhibitory reaction term (H 1 ) are defined as: where H x ð Þ ¼ 1 2 erfc x ffi ffi and H −1 is the inverse of H, ψ is defined as θ = (N − 1)ψ; w and σ w are the mean and standard deviation of the synaptic weights, respectively. With these definitions the network dynamics is stable in the sense that the activity level converges to f very fast, regardless of the initial condition. In Eq 11, we see that H 0 depends on the activity level f and on the standard deviation of the weights σ w . In the dense regime, f = 0.5, we have H −1 (0.5) = 0, therefore the rightmost term of Eq 11 vanishes, which means that in this regime H 0 is independent of σ w . However, in sparser regimes, the network must be endowed with a mechanism to adjust for the changes in standard deviation, otherwise the learning process would bring the network out of the stable state, changing the basal activity level. In contrast, the mean synaptic efficacy w does not change significantly during the learning process.
In all our simulations, the initial values for {w ij } were sampled from a Gaussian distribution with mean and standard deviation equal to one, after which negative values were set to zero. This has the effect the w init ij is slightly higher than one. We also set w ii = 0 for all i. Table 1 shows the values of the parameters used in the simulations, in the dense and sparse regimes.
Direct comparison between the 3TLR and the PLR. In order to determine the degree to which the 3TLR is able to mimic the PRL, and the effect of deviations from the latter rule, we tested both rules on the same tasks. In these simulations, every part of the simulation code was kept identical-including the pseudo-random numbers used to choose the initial state and the arbitrary permutations for the update order of the units-except for the learning rule. We tested the network in the dense case f = 0.5, at = 3, varying the storage load α, using 10 samples for each point. We compared the probability of solving the learning task and the distribution of the discrepancies (absolute value of the differences) in the values of the resulting synaptic weights. We tested two values of the parameter γ, 6 (as in Fig 4) and 12. We found that at γ = 12 there was absolutely no difference between the two rules, while at γ = 6 the 3TLR performed slightly worse, and significant deviations from the PLR started to appear close to the maximal capacity of the 3TLR (see Fig 10).
Analytical calculation of the storage capacity at infinite N Entropy calculation. In this section, we present the details of the calculations for the typical storage capacity of our network in the limit of N ! 1, using the Gardner analysis [20,28].
The capacity is defined as the maximum value of α = p/N such that a solution to Eq 7 can typically be found.
We can rewrite Eq 7 as where l ¼ w ð15Þ Eq 13 becomes: Let us now consider a single unit i. We write s m i ¼ 2x m i À 1 À Á , and re-parametrize the weights as W ij ¼ w ij w À 1 2 À1; 1 ½ Þ, and also define Dropping the index i and neglecting terms of order 1, we obtain: Our goal is to compute the quenched entropy of this problem, i.e. the scaled average of the logarithm of the volume of W which satisfies the above equation: The computation proceeds along the lines of [20,28], by using the so-called replica trick to perform the average of the logarithm of V, exploiting the identity: performing the computation for integer values of n and using an analytical continuation to perform the limit n ! 0. We perform the calculation using the replica-symmetric (RS) Ansatz, which is believed to give exact results in the case of perceptron models with continuous weights. The final expression for the entropy depends on six order parameters; the first three are Q, q and M, whose meaning is where we used W a and W b to denote two different replicas of the system, which can simply be interpreted as two independent solutions to the constraint equation. Q is called the self-overlap, and is equal to s w w À Á 2 in our case, while q is the mutual-overlap. The remaining order parameters are the conjugate quantitiesQ,q andM. The entropy expression is: . In the following, we will also use the shorthand GðxÞ ¼ GðxÞ HðxÞ . We also used the notation h Á i σ to denote the average over the output σ, i.e. h'(σ)i σ = f'(1) + (1 − f ) ' (−1) for any function '. The value of the order parameters is found by extremizing S. conjugate order parameters diverge as:q Using these conditions, and calling α c the critical value of α, the saddle point equations, 29 to 34, become: C ¼ a c Qhð1 þ t 2 s ÞHðt s Þ À t s Gðt s Þi s ð44Þ A ¼ a c hHðt s Þi s ð45Þ 0 ¼ hsðGðt s Þ À t s Hðt s ÞÞi s ð46Þ where we defined These equations can be solved numerically to find the six parameters α c , Q, A, B, C and M. Note that in the special case K = 0 these equations have a degenerate solution with Q = 0 and the same α c as in the case of unbounded synaptic weights (e.g. α c = 2 for f = 0.5). This is because in that case the original problem has the property that scaling all weights by a factor of x is equivalent to scaling the boundary w by a factor of x −1 (see Eq 16); therefore, the optimal strategy is to exploit this property by setting x ! 0, i.e. effectively reducing the problem to the unbounded case. Of course, this strategy can only be pursued up to the available precision in a practical setting.

Author Contributions
Conceived and designed the experiments: CB NB RZ. Performed the experiments: AA CB. Analyzed the data: AA CB. Contributed reagents/materials/analysis tools: AA CB. Wrote the paper: AA CB NB.