Flexible Kernel Memory

This paper introduces a new model of associative memory, capable of both binary and continuous-valued inputs. Based on kernel theory, the memory model is on one hand a generalization of Radial Basis Function networks and, on the other, is in feature space, analogous to a Hopfield network. Attractors can be added, deleted, and updated on-line simply, without harming existing memories, and the number of attractors is independent of input dimension. Input vectors do not have to adhere to a fixed or bounded dimensionality; they can increase and decrease it without relearning previous memories. A memory consolidation process enables the network to generalize concepts and form clusters of input data, which outperforms many unsupervised clustering techniques; this process is demonstrated on handwritten digits from MNIST. Another process, reminiscent of memory reconsolidation is introduced, in which existing memories are refreshed and tuned with new inputs; this process is demonstrated on series of morphed faces.

In this paper, we introduce a memory model, in which its memory attractors do not lie in the input or neural space as in classical models but rather in a feature space with large or infinite dimension. This model is isomorphic to a symmetric Hopfield network in the kernels' -space, giving rise to a Lyapunov function in the dynamics of associative recalls, which enables the analogy to be drawn between memories and attractors in the kernel space.
There are several advantages to our novel kernel approach to attractor memory. The input space can be composed of either continuous-valued or binary vectors. The number of attractors m is independent of the input dimension n, thus posing a saturatedfree model, which does not suffer corrupted memories with memory overload. Attractors can be efficiently loaded, deleted, and updated on-line, something that has previously been only a property of symbolic computer-memory models. Furthermore, for the first time in neural memory models, we have demonstrated a method allowing input dimension not to be constrained to a fixed size or be a priori bounded; dimension can change with time, similar to organic memory allocation for memories of greater importance or increased detail. These attributes may be very beneficial in psychological modeling.
The process of kernel memory consolidation results in attractors in feature space and Voronoi-like diagrams that can be projected efficiently to the input space. The process can also describe clusters, which enables the separation of even very close-by attractors. Another re-consolidation process enables tracking monotonic updates in inputs including moving and changing objects.

Generalizing Radial-Basis-Function Networks
Our network can be thought of as generalizing Radial Basis Function (RBF) networks [23]. These are 2-layered feed-forward networks with the first layer of neurons having linear activation and the second layer consisting of neurons with RBF activation function. Recurrent versions of the RBF networks [24,25] add time-delayed feedback from the second to the first layer. Our network enables a more generalized structure, both in terms of number of layers and in allowing for many more general activation functions.
Unlike previous RBF networks, our activation functions are chosen from a large variety of kernels that allow us to distinguish attractors that are similar or highly correlated. Furthermore, unlike any previous RBF network with its fixed architecture and activation functions, our selected neural kernel functions can change during learning to reflect the memory model's changing properties, dimension, or focus. We go on to prove that the attractors are either fixed points or 2-cycles, unlike general recurrent RBF networks that may have arbitrary chaotic attractors [26,27]; regular attractors are advantageous for a memory system.

Synaptic plasticity and Memory Reconsolidation
Reconsolidation is a process occurring when memory becomes liable during retrieval and can then be updated. This process is implicated in learning and flexible memories when healthy; it leads to amnesia and compulsive disorders when corrupted. Reconsolidation is observed both in neurophysiological and psychological studies ( [28]- [33]) and has been modeled in artificial neural systems as well ( [19,34]). While the actual processes underlying reconsolidation are still being studied, the property of dependance on sample ordering has been established in both electrophysiology of CA3 neurons [13] and in psychophysics [35]. In reconsolidation, memory representations are sensitive to the order of sample data: when samples change in an orderly manner, the reconsolidation process learns and updates effectively. When samples are shuffled and consistent direction of change is lost, existing memories do not update. We show here that the importance of input ordering is inherent in any update processes, reminiscent of reconsolidation. We also demonstrate how reconsolidation works in flexible environments and with large-scale data beyond the model shown in [19].
Our flexible model assumes global memory update. This is an interesting approach for a few reasons. First, it results in more stable and robust updates: in other models the ''closest'' attractor may be selected incorrectlyly due to noise. Second, it enables a direct analogy to an existing neural model of reconsolidation [19] since there the whole synaptic matrix is adjusted, not simply a chosen attractor. Moreover with global updates our memory can demonstrate phenomena analogous to the gang effect [36]. While we have taken a global update approach our model retains the property in which the retrieved attractor (the attractor closest to the current input) is most affected.

Kernel Based Algorithms and Memories
The memory system introduced here takes advantage of developments introduced in Support Vector Machine (SVM) [37], Least-Square SVM [38] and Support Vector Clustering [39], where kernel functions enable data handling in higher feature spaces for richer boundaries, yet do so efficiently and cheaply. Our support-vector-like memory system incorporates the realistic property of flexible attractors with high dimensional feature spaces, while being tractable and implementable by neural units.
Zhang et al. [40] introduced a feedforward network with particular kernels in the form of wavelet and Radial Basis Functions (RBF) that were fit to perform a face recognition task efficiently. The kernel heteroassociative memories were organized into the modular network. Our architecture can be recurrent, which is more powerful than the feedforward method, can handle discrete and analog inputs, and the kernels we use can change online adding increased flexibility and capability.
Caputo [41] explored analogies from associative memory to ''kernel spin glass'' and demonstrated an attractor network, loading bipolar inputs and using generalized Hebbian learning to load non-flexible memories with greater memory capacity than the Hopfield network. In this work, a kernel algorithm generalized the Hebbian rule and the energy function of Hopfield networks, while capacity estimations generalized Amit's approach [1]. This method built in the free energy function in addition to the Hamiltonian. Our system, by comparison, allows for both binary and continuous inputs, is far more flexible in that the kernels adapt themselves over time and that attractors and features can be added and removed. Further, our system is more practical in that it has the added capability to cluster data. Support vector memory by Casali et al. [42] utilized support vectors to find the optimal symmetric Hopfield-like matrix for a set of binary input vectors. Their approach is very different from ours despite the similar title, in that it considers only binary symmetric case and has bounded attraction space. Support-vector optimization is used to find optimal matrix W for given m|n matrix X of etalons. This matrix must satisfy relationship sign(WX)~sign(X). Support vectors are found to provide optimal margins of WX. Kernels are not used in this work and hence the name is somewhat confusing. Our kernel memory is far richer: the number of memory attractors is not bounded by input dimensionaccomplished by varying the input space; our encoding is more efficient, our memory can use discrete or analog space, one-shot learning, and overall is more flexible.
In support vector clustering [39], clusters are formed when a sphere in the -space spanned by the kernels is projected efficiently to input space. Here the clustering is a side effect of the created memories that are formed as separated fixed points in the -space, and where the Voronoi polyhedron is projected on a formation of clusters in the input space. Formation of memories is local, sharing this concept with the Localist Attractor Network [43] and the Reconsolidation Attractor Network [34].

Organization
This work will be presented as follows: At first, the model of kernel heteroassociative memory is introduced, followed by the special case of auto-associative memory where attractors emerge. A neural representation is layered, and robustness (attraction radius) is estimated. We then introduce a technique that allows adding and removing attractors to the existing kernel associative network, and follow by introducing another technique that adds or removes input dimensionality on line. We next show a procedure of consolidating data into representing attractors, and demonstrate clusters emerging on handwritten digits and conclude by introducing the functional level of reconsolidation in memory and applications to morphed faces.

Our Kernel Heteroassociative Memory Model
A general framework of heteroassociative memory can be defined on the input E x and output E y spaces, with dimensionalities n and p respectively, and with m pairs of vectors . . m to be stored. The input vectors in the space E x can be written as columns of matrix X (n|m) and associated vectors in the output space E y as columns of matrix Y (p|m). A projective operatorB B : E X ?E Y such thatB Bx i~yi can be written in a matrix formB B with and be solved as with '' z '' stands for the Moore-Penrose pseudoinverse of X [44]. If the columns of X are linearly independent, the pseudoinverse matrix can be calculated by Let us define matrixS S, (m|m), where the elementss s ij are the pairwise scalar products of the memorized vectors, that is s s ij~( x i ,x j ), or in matrix notation:

S S~X T X:
ThenB B can be written as:B We propose to formulate the pseudoinverse memory association (recall) by calculating for each input vector x[E x the output by: This is a ''one-pass'' non-iterative linear associative memory. It has the property that if two input samples are close to each other, then the two outputs will be close to each other as well: In order to overcome the common dependence of memory capacity on input dimension, we transform the input space E X to a new input space E which we call feature space, whose dimensionality can be far greater than the dimension of E X , n (it could even be an infinite-dimensional Hilbert space). The transformation : E X ?E is considered to be transferring from input to feature space.
The respective associative memory algorithm can now be defined as follows: B~Y½ (X) z ð7Þ Analogously writing S as the memory loads by: and the association (recall) procedure is calculated by: Remark 1 Linear independence of the vectors (x i ) in thespace is required in order to use identity (3) for the Moore-Penrose pseudoinverse (see [44]). It is achieved as we will see below by using piece-wise Mercer kernels, and does not limit the number of attractors. This identity is used to bring equation (7) to the form of (8) and to introduce S. We note that during both loading (10) and recall (11) procedures, the function appears in the pair ( (x i ), (x j )). We can thus define a Kernel function over E x |E x and gain computational advantage.
Let us denote a scalar product in the feature space E by K(u,v)~( (u), (v)). This is a symmetric, real-valued, and nonnegative-definite function over E x |E x called a kernel [37]. We now can write S and z using the Kernel K: The value of the Kernel function is scalar. Thus even if was a function of high dimension the calculation of the multiplication is a scalar and thus fast to calculate. Mercer kernels as used in Support Vector Machines [37] are not sufficient for creating the associative memory we introduce, since our memories also require that all attractors are linearly independent in the feature space. To enable such independence we define the piece-wise Mercer kernels and in Section ''Piecewise Mercer Kernels'' of Materials and Methods (MM) we prove that they can always be found and always lead to independence.
As opposed to Hebbian learning that requires O(n 2 m) multiplications, we need O(m 3 znm 2 ) arithmetic operations over real scalars. The loading algorithm is displayed in Fig. 1. The memory is proven below to associate loaded pairs correctly and to associate close by values otherwise (see Materials and Methods (MM)).
2.1.2 Memory Independent on Input Dimension. The kernel heteroassociative memory has no a priori bound on capacity in the following sense: for any given learning sample there exists a kernel such that the memory with this kernel will provide the desired association.
To specify this we formulate the following theorem, which is proven in MM Section ''Correctness of Association'': Theorem 1 For any memory size m, let (x i ,y i )[E x |E y , i~1 . . . m be a learning sample consisting of m input-output pairs. Then there exists a piece-wise Mercer kernel K such that the associative memory that has this kernel and governed by equations (9)-(11) assigns x i to y i for all i~1 . . . m.
Remark 2 For the correct association, the memories have to be linearly independent in the feature space. As we have shown here this does not pose a memory limit, because for any given learning sample we can find a (piece-wise Mercer) kernel that guarantees such independence.

The Kernel Autoassociative Memory Model
We next focus on the special case where E x~Ey , and the stored vectors x m~ym :j m , m~1 . . . m. Here the loading algorithm is the same as in Fig. 1, and recall is facilitated by the iterative form: The activation function f (x) is applied by coordinates and constitutes a bounded monotonically increasing real-valued function over R such that lim x?{?
f (x)~a, and lim x?z?
The scheme of kernel auto-associative memory working in recall mode is shown in Fig. 2. We prove (in lemma 4 in MM Section ''Proving Convergence of the Autoassociative Recall Algorithm'') that the recall procedure always converges and that the attractors are either fixed points or 2-limit cycles. Joining all operations to a single equation we get: Here byŝ s m,n we denote the elements of S {1 . In coordinate form this equation is: The pseudocode of the associative recall is shown in the Fig. 3. As will be shown in the next section, the double nonlinearity of the recall dynamics does not reduce the biological plausibility since the kernel memory can be designed as a layered neural network with only one nonlinear operation per neuron.
In Materials and Methods, Section ''Example of the associative recall'', we provide an explicit example of kernel autoassociative memory with E x~R 3 and E~R 6 . We demonstrate there how a set of five vectors is memorized and how the iterative recall works.

Kernel Associative Memory as a Neural Network
The autoassociative kernel memory can be directly implemented in a recurrent layered neural network (Fig. 4a): The network has n inputs. The first layer has m neurons that perform kernel calculations; the i-th neuron computes z i~K (x,x i ). In the special case where the kernel is a radial-basis function K(u,v)~R( u{v k k) these neurons are the common RBF neurons [23]. The second layer has m neurons, its weight matrix is S {1 . The neurons of the second layer can be either linear or have a generalized sigmoid activation function.
The third layer also has n neurons, its weight matrix is X T . Its activation function can be linear, generalized sigmoid or the even more general sigmoid from Equation (15) above. The network has ''one-to-one'' feedback connections from the last layer to the inputs. In recall mode it works in discrete time, like Hopfield networks.
Theorem 2 Suppose that the kernel associative memory has a generalized sigmoid activation function in the second layer. Then the attractors emerging by the iterative recall procedure are either fixed points or 2-cycles.
Proof appears in Material and Methods Section ''Proving Convergence of the Autoassociative Recall Algorithm''.

The Attraction Radius.
A key question for any neural network or learning machine is how robust it is in the presence of noise. In attractor networks, the stability of the associative retrieval and the robustness to noise can be measured by the attraction radius.
Definition 2 Suppose the input to an attractor network belongs to the metric space with distance r. For an attractor j m of the network let R m be the largest positive real number such that if r(j m ,x)vR m the dynamics of the associative recall with starting point x will converge to j m . The value R~min m R m is called the attraction radius of the network (AR).
When inputs and memorized patterns belong to a normed vector space, if the additive noise does not exceed the attraction radius in this norm then all memories will be retrieved correctly during the associative recall.
The attraction radius can be estimated in the following special case: Theorem 3 Suppose that a kernel associative memory has identity activation function in the output layer and a generalized sigmoid f in the hidden layer. Suppose also the kernel K satisfies the global Lipschitz condition on a certain domain D5R n , i.e., there exists L such that Vx 1 ,x 2 , y[EK(x 1 ,y){K(x 2 ,y)DƒLEx 1 {x 2 E. Then the stored patterns are attractors, and the memory attraction radius is where c is a constant that depends only on f .
The proof of this theorem is given in Materials and Methods Sec. Proof of Theorem 3.
We further made a series of experiments of direct measurement of attraction radius for a dataset of 30|30 gray-scale face images. Results of this experiment are represented in Fig. 5. Remark 3 We have also proven that under the conditions of Theorem 3 all stored patterns are attractors. Yet, the memory was not proven to be free from spurious equilibria. However, spurious attractors were never observed in numerical experiments. The typical situation is that all the input domain is divided into attraction basins of the memorized vectors. The basins look like Voronoi polyhedra as depicted in Fig. 6. 2.3.2 Bounded Synaptic Weight Range. There is a connection between a bound on the values of the synapses (see [45]) and the kernel function defining the network.
Proposition 1 Let the kernel memory have input data such that every two inputs Ex i {x j E §d, Vi,j~1 . . . m, i=j and the piece-wise Mercer kernel K with this d and certain m. Then the synapses defined by matrix S {1 are bounded and the following bound holds: Proof. From the proof of Lemma 3.2 in Material and Methods we know that S could be written as S~S 0 zmI, where S 0 §0 is a positive semidefinite matrix. By linear algebra we have The kernel associative memory works as a symmetric network in an abstract feature space is used only in implicit way. Any implementation of the kernel associative memory with neural computational units requires a recurrent layered structure (Fig. 4). We can maximize the network capacity by using the approximation S&I. This approximation is suitable if the stored patterns are sufficiently distant in the kernel view, see Remark 5. With this approximation one can save m 3 connections without significant loss of association quality by eliminating the middle layer in Fig 4a and the other two layers will have weight matrices X and X T , identical with respect to transposition; see Fig. 4b. So, to store m vectors of R n we would need m|n real numbers only (lossless coding).
The definition of memory capacity and connections/neurons ratio now leads to max(m,n) min(m,n) §1:

Remark 4
The approximation S&I is suitable if the stored patterns are almost orthogonal in the feature space. For localized kernels (e.g. RBF) this means that the patterns are distant enough from each other (comparing to the characteristic scale of the kernel). Because the condition of orthogonality is applied in the feature space, not in the input space, this condition does not imply any relative size of m versus n: With this optimization the kernel-memory network can be made arbitrarily sparse by choosing a sparse kernel, i.e., a kernel that explicitly depends only on a (small) portion of the coordinates of its argument vectors. The non-zero weights will correspond to the inputs that the sparse kernel depends on. The attractors will have a sparse structure analogous to the kernel as well. If our goal is to memorize arbitrary dense data we can still use the sparse network as long as encoding and decoding layers are externally added to it. To add a new attractor to the network we create a new neuron in the S matrix layer. The dimension of matrix S is increased from m to mz1. To do so we compute s i,mz1~K (x i ,x mz1 ) and we update the inverse S {1 efficiently using the linear-algebra identity [46]:

Flexibility of the Memory
where A~S m 0 0 s mz1,mz1 and    To receive S m{1 , the last column and last row from matrix AzB will be removed.
This results in the two algorithms of Fig. 7.
Remark 5 In the case where S is approximately a diagonal matrix, its inverse can be calculated by the approximation S~lIzeS 1 for small e, and S&lI, which does not change during updates.
Remark 6 The procedure Add-Attractor (see Fig. 7) is local in time. To store a new pattern in the memory we only have to know the new attractor and the current connection matrix S {1 .
In Fig. 6 we display an example of adding an attractor. 2.4.2 Flexibility in Input and Feature Spaces. External inputs may come with more or fewer features than previously, causing the input dimensionality to change with time. We propose a mechanism that enables the network to handle such heterogeneity of dimension with no need to relearn the previously learned inputs.
Assume that the current dimension in the input space consists of the ''initial dimension'' n and ''new'' q dimensions; denote this as x~x a ; x b ð Þ. We will allow the change of dimensionality by changing the kernel itself: from the kernel K n that considers the first n dimensions to kernel K nzq that depends on all dimensions.
The change of kernel requires the recalculation of S {1 . However, this need not require O(m 3 ) operations if we constrain to kernels that can be written in an additive form: where K int describes the interaction of n and q. An explicit kernel with this property is the polynomial kernel (see Section ''Variable Kernel and Dimensionality''). Algorithms for dimensionality control appear in Fig. 8. An example is given in Example 3 in the next section. We also prove Lemma 4 in Materials and Methods Section ''Variable Kernel and Dimensionality'', stating that a small alteration to the kernels enables changing input dimensionality without loosing previously learnt attractors.

Memory Consolidation and Clustering
The memory system with its loading algorithm enables consolidation of inputs into clusters using the competitive learning method. Suppose we have a learning sample of N vectors x 1 , . . . x N and m clusters have to be created. Random vectors initiate the m attractors. When a new input is provided, the recall procedure is performed and the attractor of convergence x Ã k is updated by Parameter sequence a n is selected in order to provide better convergence of attractors: for instance, we can take a n such that lim n?? a n~0 but P n a n ??. This step is repeated until all attractors stabilize.
We tested the consolidation algorithm using the MNIST database of handwritten digits [47]. The data consists of ten different classes of grayscale images (from '0' to '9', each of 28|28 pixels in size) together with their class labels.
Experiment 1: Clustering with the Kernel Memory. The goal of this experiment is to demonstrate performance of memory clustering. For this purpose memory was trained on the learning sample in order to form attractors. Then attractors were assigned to classes, and classification rate was measured on an independent test sample.
For the MNIST data we used principal-component (PC) preprocessing. We took the first d~200 PCs which contain 96.77% of the variance. The learning sample contained 10000 digits, 1000 from each class. The kernel was chosen in the form: where R 2~P d k~1 w k , and a is a parameter. This is a Gaussian kernel depending on weighted metric. Weights were chosen as: We also tried the formula: where E l and STD l are expectation and standard deviation over the l-th class, and Q is the quantity of classes. However formula (19) gave better results. Because of the complexity of the MNIST data, we chose to have multiple clusters per class. Table 1 summarizes the classification rates for different amounts of attractors in the memory. The classification is slightly superior to other unsupervised clustering techniques (even that the goal of the memory system is not directly in clustering). The number of memory attractors required for good clustering is also smaller than other techniques, e.g. [48]. Figure 9.a) provides an example of typical memory attractors of each class.
We also made a series of experiments with w k~s k R ; R~ffi ffiffiffiffiffiffiffiffiffiffiffi ffi X s 2 k q , where s k is an STD of k-th principal component. This weighting does not depend on class labels in any way. We can see (last row of table 1) that results are poor for small number of attractors per class, but for higher number of attractors classification rate is even better. Experiment 2. Clustering under changing input dimensionality. This experiment demonstrates clustering while input dimensionality increases and the kernel is being changed. For this purpose, the resolution of the original images was reduced twice, to 14|14 (Fig. 9). Then the images were passed through a linear transformation in order to use the kernel (19). The memory was trained on 10,000 such digit images, forming 100 attractors. The recognition quality obtained was 76.4%. Then the kernel was extended in order to work with the original size (28|28), and another 10,000 digits were added, now in full-size. This second session of learning started from the previous set of attractors, without retraining. The final classification rate was enhanced to 85.4%. Experiment 3. Explicit example of adding input dimension. Consider the R 2 data where points lie on two Archimedes' spirals: ), an interaction term is not necessary in this example. We took s 1~0 :25 and s 2~0 :2. At first, the network was loaded with the 40 data points in R 2 . Each point was labeled and a classification was executed. The recognition quality on an independent test sample was 86:4%. Then the training was continued with the additional 40 inputs in R 3 and the final classification rate increased to 97:5%.

Synaptic Plasticity and Memory Reconsolidation
Reconsolidation is a storage process distinct from the one time loading by consolidation. It serves to maintain, strengthen and modify existing memories shortly after their retrieval [49]. Being a key process in learning and adaptive knowledge, problems in reconsolidation have been implicated in disorders such as Post Traumatic Stress disorder (PTSD), Obsessive Compulsive disorder (OCD), and even addiction. Part of the recent growing interest in the reconsolidation process is the hope that controlling it may assist in psychiatric disorders such as PTSD [50] or in permanent extinction of fears [51].

Current Model of Reconsolidation in Hopfield
Networks. A model of reconsolidation was introduced in [19]. It contains a learning mechanism that involves novelty-facilitated modifications, accentuating synaptic changes proportionally to the difference between network input and stored memories. The formula updating the weight matrix C is based on the Hebbian rule: Here t is the time of the reconsolidation process, w m is a weight parameter defining learning rate, x t is the current input stimulus, H is a Hamming distance from x t to the set of network's attractors, and g is the sensitivity to the novelty of stimuli. This formula differs from the original Hebbian rule by having both weight decay and Hamming-distance terms affecting the learning.
The model predicts that memory representations should be sensitive to learning order.
2.6.2 Our Reconsolidation Algorithm. In the case of Hebbian learning, the network's synaptic matrix is composed of a linear space. In our kernel associative memory, on the other hand, the corresponding space is no longer linear but rather is a Riemannian manifold, see Materials and Methods Section 3.7. Additions and multiplications by a scalar are not defined in this space and thus formula (23) cannot no longer be applied.
To remedy the situation we define a Riemannian distance (see Material and Methods Section 3.7) and a geodesics which enables the memory to change gradually as new but close stimuli arrive [52]: a point on a geodesic between x 1 and x 2 that divides the path in ratio a=(1{a) is a generalization of the convex combination ax 1 z(1{a)x 2 . Suppose, initially we have a memory X 0 that contains m attractors x 0,1 ,x 0,2 . . . x 0,m . Then we obtain X 1 by replacing one attractor by a new stimulus: x 0,1 ?x 1,1 . The distance between X 0 and X 1 can be thought of as a measure of ''surprise'' that the memory experience when meets new stimuli. To track the changes, the memory moves slightly on the manifold from X 0 to X 1 . See algorithm in Fig. 10.
2.6.3 Numerical Experiments. We exemplify the power enabled to us by the reconsolidation with the following experiments. Experiment 4. Morphed faces. The goal of this experiment is both to show the performance of the reconsolidation process we describe on large-scale data and to compare its properties with the recent psychological study [35].
Morphed faces were created by Joshua Goh at the University of Illinois. The faces were obtained from the Productive Aging Lab Face Database [53] and other volunteers (All the face images used in our research were taken from the Productive Aging Lab's Face   rectangle 320|240, images were cropped to this size before entering to the network. The database contains 150 morph sequences, each of them consists of 100 images.
In our simulations we created a network with 16 attractors representing 16 different faces; it had 76800 input and output neurons, and two middle layers of 16 neurons each. Four arbitrarily selected network's attractors are depicted in Fig. 11. A Gaussian kernel was chosen in order to simplify calculations with large scale data.
Attractors were initialized with first images from 16 arbitrarily selected morph sequences. When the learning order followed image order in the morphing sequence, attractors changed gradually and consistently. The ability to recognize the initial set of images gradually decreased when attractors tended to the final set. In case of random learning order attractors quickly became senseless, and the network was not able to distinguish faces. This experiment generalizes the result shown in [19] but is done on real images demonstrating the efficiency of the reconsolidation process in kernel memories for high dimension and multi-scale data. In accordance with [35], the formation of ''good'' dynamic attractors occurred only when morphed faces were presented in order of increasing distance from the source image. Also, as shared also with [34,19]: the magnitude of the synaptic update due to exposure to a stimulus depends not only on the current stimulus (as in Hebbian learning) but also on the previous experience, captured by the existing memory representation. Experiment 5. Tracking Head Movement. This example focuses on rotating head images for reconsolidation based on the VidTIMIT dataset [54], and it demonstrates our algorithm on a more applied example of faces and computer vision. The VidTIMIT dataset is comprised of video and corresponding audio recordings of 43 people. The recordings include head rotation sequences. The recording was done in an office environment using a broadcast quality digital video camera. The video of each person is stored as a numbered sequence of JPEG images with a resolution of 512|384 pixels.
The ability to track and recognize faces was tested on the sets of 15 last frames from each sequence. Example of attractors during the reconsolidation in this experiments is depicted in the Fig. 12. With reconsolidation and ordered stimuli the obtained recognition rate was 95.2%. If inputs were shuffled randomly, attractors got messy after 30-50 updates, and the network did not demonstrate significant recognition ability. Experiment 6. Tracking The Patriot Missiles. The following experiment takes the reconsolidation model into a practical technology that follows trajectories in real time in complex dynamic environments.
We analyzed videos of Patriot missile launches with resolution 320|240, originally in RGB color, and transformed them to grayscale. The memory was loaded with vector composed of two 40|40-pixel regions (windows) around the missile taken from two consequent frames and a two-dimensional shift vector indicating how the missile center has moved between these frames. Optimal number of attractors was found to be 16-20. Using memory reconsolidation algorithm we were able to calculate velocity vector every time, and therefore track the missile with great precision, with only average error of 5.2 pixels (see Fig. 13).

Conclusions
We have proposed a novel framework of kernel associative memory as an attractor neural network with a high degree of flexibility. It has no explicit limitation, either on the number of stored concepts or on the underlying metric for association, since the metric is based on kernel functions. Kernels can be slightly changed as needed during memory loading without damaging existing memories. Also due to the kernel properties, the input dimension does not have to be fixed. Unlike most other associative memories our model can both store real -valued patterns and allow for analogous attractor-based dynamic associative recall.
We endowed our memory with a set of algorithms that insure flexibility, enabling it to add/delete attractors as well as features (dimensions) without need to retrain the network. Current implementation of our memory is based on a simple competitive clustering algorithm and consolidates memories in a spirit similar to the localist attractor networks [43]. We have experimentally tested the memory algorithms on the MNIST database of handwritten digits, a common benchmark for learning machines. The obtained clustering rate for this database (91.2%) is slightly better than the best known result for unsupervised learning   algorithms on this benchmark. The model further allows the process of reconsolidation after memory is stored when retrieval by similar patterns is activated. We demonstrated the properties of reconsolidation on gray scale large image faces in morphing experiments. Based on the theoretical and experimental research made in the present paper we conclude that the proposed kernel associative memory is promising both as a biological model and a computational method.

Piece-wise Mercer Kernels
The classical Kernels K(x,y) introduced by Vapnik to the field of Machine Learning had the Mercer condition. That is, for all square integrable functions g(x) the kernel satisfied: The Mercer theorem states that if K satisfies the Mercer condition there exists a Hilbert space H with basis e 1 ,e 2 , . . . e n . . . and a function : and all a n w0. That is, K is a scalar product of (u) and (v) General Mercer kernels are not sufficient for creating the associative memory since our kernel memories require that all attractors are linearly independent in the feature space, and some Mercer kernels do not assure it, such as the basic scalar-product kernel K(u,v)~vu,vw. As shown in Lemma 2, linear independence of attractors in the feature space is needed to provide correct association in our system. The piece-wise Mercer kernels as we define next have this desired property. Definition 3 A kernel K(u,v) is said to satisfy the piece-wise Mercer condition if there exist dw0, mw0, and a Mercer kernel K 0 (u,v) such that: and K'(u,v)ƒK(u,v), u{v k kƒd: ð27Þ The piece-wise Mercer kernel is an extension of strictly positive definite (SPD) kernels. These kernels have some internal property of regularization. For usual Mercer kernels, e.g. common scalar product there are still situations when Gram matrix of certain vector set in the feature space will be degenerate. In contrast to this, the m{d piece-wise Mercer kernel can guarantee that for any finite set of vectors their Gramian will be full-rank and even greater than mI.
Lemma 1 If K is a piece-wise Mercer Kernel, it also satisfies the Mercer condition.
Proof. Indeed, for all square integrable functions g(x) : E x ?R, the Mercer condition and inequality (26)(27) give: and the Mercer condition for K is fulfilled. Remark 7 There is following relation between our definition of piece-wise Mercer kernel and standard notion of strictly positive definite (SPD) kernel (which is formulated by replacing '' §'' with ''w'' in (24): any piece-wise Mercer kernel is also SPD and for any continuous SPD kernel K there exist certain m and d such that K is m{d piece-wise Mercer. Example 1 We demonstrate how to build a piece-wise Mercer kernel that would fit a given sample to be loaded in memory. Consider Gaussian Kernel K(u,v)~exp({ (u{v) 2 2s 2 ):R( u{v k k). We can construct the kernel: The kernel K 0 is constructed in that way that it is continuous and convex as a function of u{v k k for any m and d such that 1{m{R(d)w0. According to the Polya criterion (see, [37], sec. 10.8.4) we get that K 0 fulfills the Mercer condition. This shows that Gaussian kernel K is a piecewise Mercer kernel. Proof. Since K is a Mercer kernel according to lemma 1, statement 1 is true by the Mercer theorem.

Correctness of Association
To prove 2) note that by definition 1 there exists a kernel K 0 that satisfies the inequality (26).   ) that shows how to construct piece-wise Gaussian kernels. Then, the conditions of Lemma 2 are fulfilled by kernel K and input set x i . Therefore, S, is invertible, (x i ) are independent in the feature space, and association (9) -(11) is well defined.

Proving Convergence of the Autoassociative Recall Algorithm
Lemma 4 Suppose we have an autoassociative memory with kernel K, stored vectors x 1 …x m forming columns of matrix X, and a matrix S with elements: The dynamical system corresponding to the associative recall is: Suppose kernel K(u,v) is continuous, and it satisfies the piece-wise Mercer condition for a certain dw0 such that the stored vectors x 1 x 2 … x m satisfy x i {x j wd for i=j. Then attractors of the dynamical system (29) are either fixed points or 2-cycles.

Proof:
We will prove that lim( t?? x tz1 {x t{1 )~0. For this we construct an energy function in a way that is analogous to Hopfield-like networks: Step 1. Show that the energy has lower bound. Because f is bounded, the closure of the co domain of f in (29) is a certain compact set Q. So, x t will remain in Q for all t §1. The energy (30) is bounded over Q|Q as a continuous function over a compact set.
Step 2. Show that there exists a projective selfconjugated operator C : E ?E such that C (x i ) = (x i ), i = 1…m. By theorem 1 there exists a feature space H:E such that the kernel gives a scalar product in this space.
For every finite set of vectors (x i ),i~1 . . . m in H we can construct a projective operator C that projects to the subspace spanned with (x i ). Indeed, applying Gram-Schmidt orthogonalization to (x i ) (see, e.g. [55]) we can build an orthonormal set (basis) of vectors x i ,i~1 . . . m.
Then define an operator as follows: This C is projective, and it projects to the finite-dimensional subspace vj i w~v (x i )w. (Here by vj i w we denote a subspace spanned with all x i ,i~1 . . . m).
Step 3. Show that energy decreases monotonically every 2 steps.
Applying properties of the scalar product in E x and symmetry of C we get: Because x tz1~f (y t ), x tz1 is closer to y t than any other point on the trajectory, and the kernel is monotonic with respect to distance between x and y, the expression (31) will be non-negative. Moreover, is E t {E tz1 §0; it is zero if and only if a fixed point is reached.
Step 4. Show that the total amount of energy decrease is finite if and only if sequences x 2k and x 2kz1 converge. Suppose the energy lower bound is {E Ã .
So, there exists mw0 such that Then The sum on the left hand of this equation will be finite if and only if sequences x 2k and x 2kz1 converge, and we are done. Remark 8 By Theorem 3 and Remark 2 we have proven that (all) stored patterns are attractors. This means that E t~c onst if x equals one of stored vectors.
We next prove Theorem 2.
Proof. The proof is analogous to the proof of lemma 4. Note that if the k-th attractor is reached z k z i~0 , i=k. Activation function in the form of generalized sigmoid brings z closer to an attractor. So, where in the proof of lemma 4 the linear activation function is replaced with a generalized sigmoid, convergence to an attractor can only be fasted reached.

Variable Kernel and Dimensionality
For polynomial kernel of degree d we have: K(x,y)~1 3 ((1zvx a ,y a w) d z(1zvx b ,y b w) d z z((1zvx a ,y a w)(1zvx b ,y b w)) d' ), For such decomposed kernels, the matrix S {1 can be efficiently (O(m 2 )) updated using formula (16). Lemma 5 If the kernel K is a linear combination of basis functions, there exists an additive kernel K 1 having the same feature space E (x) .
Having the same feature space is important because it leads to identical behavior of two kernel memories with these to kernels. We note that as before, if S is an approximately diagonal matrix, the inverse S {1 can be estimated efficiently. A diagonal matrix appears for example in the Radial-Basis-Function-like kernels. In this situation the approximations are given by S~lIzgS 1 .
3.5.1 Memory Stability with Changing Kernels. Suppose all vectors submitted for learning and recall belong to a certain compact set Q. Let us define the norm for kernels: Proposition 2 For given space E nzq and a compact set Q in it there exists a constant Mw0 such that if K nzq {K n ƒM K n k k for any set of vectors x 1 …x m stored in the memory with K n these vectors remain in attraction basins of corresponding attractors for the memory with kernel K nzq whose attractors expand x 1 . . . x m to dimension nzq.
Proof. The proof directly follows from the norm definition and direct estimation of attractor shift when the kernel is changed. iterations of the recall will make f converge to f 0 with exponential velocity, that immediately implies convergence of x t to x 0 , and we are done.

Reconsolidation and Riemannian Distance
3.7.1 Riemannian Metric. Riemannian manifold is a smooth, at least twice differentiable manifold (see [52,56], or any textbook on Riemannian Geometry), which has a scalar product at every point. The scalar product, or Riemannian Metric, is defined in a tangent space of a point as a positive quadratic form.
Riemannian metrics enables to measure curve length and introduce geodesics: trajectories of a free particle attached to the manifold. Between every two points there exist at least one geodesic that have minimal length among all curves joining these two points. Length of this geodesic gives Riemannian distance over the manifold.
There is following Riemannian distance between two kernel associative memories: Here X and Y are two memories with the same m,n, and K. They have S-matrices S and T respectively and Q is the ''cross-matrix'': q ij~K (x i ,y j ), i,j~1 . . . m. 3.7.2 Geodesic Update. To find a point analogous to convex combination of X 0 and X 1 we build a geodesic c X1 X0 joining these to points, and take a point X Ã 1~c X1 X0 (t). Here t[½0,1 is a step parameter related to size of a shift during each update. For t~0 we stay at X 0 , for t = 1 the point is changed to X 1 . Repeatedly, we track from X 1 to X 2 when a stimulus x 1,2 appears, etc.
The algorithm of memory update using geodesics uses the property of kernel memory that an arbitrary attractor can be added or removed to the network in O(m 2 zn) operations with no impact to all other attractors.