Correcting the hebbian mistake: Toward a fully error-driven hippocampus

The hippocampus plays a critical role in the rapid learning of new episodic memories. Many computational models propose that the hippocampus is an autoassociator that relies on Hebbian learning (i.e., “cells that fire together, wire together”). However, Hebbian learning is computationally suboptimal as it does not learn in a way that is driven toward, and limited by, the objective of achieving effective retrieval. Thus, Hebbian learning results in more interference and a lower overall capacity. Our previous computational models have utilized a powerful, biologically plausible form of error-driven learning in hippocampal CA1 and entorhinal cortex (EC) (functioning as a sparse autoencoder) by contrasting local activity states at different phases in the theta cycle. Based on specific neural data and a recent abstract computational model, we propose a new model called Theremin (Total Hippocampal ERror MINimization) that extends error-driven learning to area CA3—the mnemonic heart of the hippocampal system. In the model, CA3 responds to the EC monosynaptic input prior to the EC disynaptic input through dentate gyrus (DG), giving rise to a temporal difference between these two activation states, which drives error-driven learning in the EC→CA3 and CA3↔CA3 projections. In effect, DG serves as a teacher to CA3, correcting its patterns into more pattern-separated ones, thereby reducing interference. Results showed that Theremin, compared with our original Hebbian-based model, has significantly increased capacity and learning speed. The model makes several novel predictions that can be tested in future studies.


Implementational Details
The model was implemented using the Leabra framework, which is described in detail in these sources: https://github.com/emer/leabra [1], [2], and summarized here. These same equations and default parameters have been used to simulate over 40 different models in [1] and [2], and a number of other research models. Thus, the model can be viewed as an instantiation of a systematic modeling framework using standardized mechanisms, instead of constructing new mechanisms for each model.
The basic activation dynamics are based on standard electrophysiological principles of real neurons, as captured by the AdEx (adapting exponential) model of Gerstner and colleagues [3], using a rate code approximation that produces a graded activation signal matching the actual instantaneous rate of spiking across a population of AdEx neurons. We generally conceive of a single rate-code neuron as representing a microcolumn of roughly 100 spiking pyramidal neurons in the neocortex.
The excitatory synaptic input conductances (i.e., net input) is computed as an average, not a sum, over connections, based on normalized, sigmoidaly transformed weight values, which are subject to scaling on a projection level to alter relative contributions. Automatic scaling is performed to compensate for differences in expected activity level in the different projections.
Inhibition is computed using a feed-forward (FF) and feed-back (FB) inhibition function (FFFB) that closely approximates the behavior of inhibitory interneurons in the neocortex. FF is based on a multiplicative factor applied to the average net input coming into a layer, and FB is based on a multiplicative factor applied to the average activation within the layer. These simple linear functions do an excellent job of controlling the overall activation levels in bidirectionally connected networks, producing behavior very similar to the more abstract computational implementation of kWTA dynamics implemented in previous versions.
There is a single learning equation, derived from a detailed model of spike timing dependent plasticity (STDP) by [4], that produces a combination of Hebbian associative and error-driven learning. For historical reasons, we call this the XCAL equation (eXtended Contrastive Attractor Learning), and it is functionally very similar to the BCM learning rule developed by [5]. The essential learning dynamic involves a Hebbian co-product of sending neuron activation times receiving neuron activation, which biologically reflects the amount of calcium entering through NMDA channels, and this co-product is then compared against a floating threshold value. To produce the Hebbian learning dynamic, this floating threshold is based on a long-term running average of the receiving neuron activation. This is the key idea for the BCM algorithm. To produce error-driven learning, the floating threshold is based on a much faster running average of activation co-products, which reflects an expectation or prediction, against which the instantaneous, later outcome is compared.
Weights are subject to a contrast enhancement function, which compensates for the soft (exponential) weight bounding that keeps weights within the normalized 0-1 range. Contrast enhancement is important for enhancing the selectivity of self-organizing learning, and generally results in faster learning with better overall results. Learning operates on the underlying internal linear weight value. Biologically, we associate the underlying linear weight value with internal synaptic factors such as actin scaffolding, CaMKII phosphorlation level, etc, while the contrast enhancement operates at the level of AMPA receptor expression.
The following shows the main equations used to simulate neural activity and learning (see https://github.com/emer/leabra in the README.md for complete details and discussion).

Activation Equations
• GeRaw += Sum (recv) Prjn.GScale * Send.Act * Wt -Prjn.GScale is the Input Scaling factor that includes 1/N to compute an average, and the WtScaleParams Abs absolute scaling and Rel relative scaling, which allow one to easily modulate the overall strength of different input projections.

Learning Equations
• AvgSS += (1 / SSTau) * (Act -AvgSS) super-short time scale running average, SSTau = 2 default, which is first pass of sequence of running-average integrations of activity that drive temporal-difference learning. learning strength factor for how much to learn based on AvgL floating threshold --this is dynamically modulated by strength of AvgL itself, and this turns out to be critical --the amount of this learning increases as units are more consistently active all the time (i.e., "hog" units). Params on AvgLParams, Min = 0.0001, Max = 0.5. Note that this depends on having a clear max to AvgL, which is an advantage of the exponential running-average form above.
• AvgLLrn *= MAX(1 -layCosDiffAvg, ModMin) also modulate by time-averaged cosine (normalized dot product) between minus and plus phase activation states in given receiving layer (layCosDiffAvg), (time constant 100) --if error signals are small in a given layer, then Hebbian learning should also be relatively weak so that it doesn't overpower it --and conversely, layers with higher levels of error signals can handle (and benefit from) more Hebbian learning. The MAX(ModMin) (ModMin = .01) factor ensures that there is a minimum level of .01 Hebbian (multiplying the previously-computed factor above). The .01 * .05 factors give an upper-level value of .0005 to use for a fixed constant AvgLLrn value --just slightly less than this (.0004) seems to work best if not using these adaptive factors.
• AvgSLrn = (1-LrnM) * AvgS + LrnM * AvgM mix in some of the medium-term factor into the short-term factor --this is important for ensuring that when neuron turns off in the plus phase (short term), that enough trace of earlier minus-phase activation remains to drive it into the LTD weight decrease region --LrnM = .1 default.
• srs = Send.AvgSLrn * Recv.AvgSLrn • srm = Send.AvgM * Recv.AvgM -LWt is the linear, non-contrast enhanced version of the weight value, and Wt is the sigmoidal contrast-enhanced version, which is used for sending netinput to other neurons. One can compute LWt from Wt and vice-versa, but numerical errors can accumulate in going back-and forth more than necessary, and it is generally faster to just store these two weight values.
soft weight bounding --weight increases exponentially decelerate toward upper bound of 1, and decreases toward lower bound of 0, based on linear, non-contrast enhanced LWt weights. The Wb factors are how the weight balance term shift the overall magnitude of weight increases and decreases.
• LWt += DWt increment the linear weights with the bounded DWt term • DWt = 0 reset weight changes now that they have been applied.