Anisotropic Gaussian kernel adaptive filtering by Lie-group dictionary learning

The present paper proposes a novel kernel adaptive filtering algorithm, where each Gaussian kernel is parameterized by a center vector and a symmetric positive definite (SPD) precision matrix, which is regarded as a generalization of scalar width parameter. In fact, different from conventional kernel adaptive systems, the proposed filter is structured as a superposition of non-isotropic Gaussian kernels, whose non-isotropy makes the filter more flexible. The adaptation algorithm will search for optimal parameters in a wider parameter space. This generalization brings the need of special treatment of parameters that have a geometric structure. In fact, the main contribution of this paper is to establish update rules for precision matrices on the Lie group of SPD matrices in order to ensure their symmetry and positive-definiteness. The parameters of this filter are adapted on the basis of a least-squares criterion to minimize the filtering error, together with an ℓ1-type regularization criterion to avoid overfitting and to prevent the increase of dimensionality of the dictionary. Experimental results confirm the validity of the proposed method.


Introduction
Adaptive filtering is a technique to update the parameters of a signal/data processing structure [1]. In this paper, we deal with kernel-based adaptive filtering. A kernel adaptive filter is a kind of nonlinear filter that exploits a kernel method, which is a technique to construct effective nonlinear systems [2]. Kernel adaptive filters found widespread applications in diverse fields, ranging from stock market prediction [3] to acoustic echo cancellation [4] and visual object tracking [5].
In kernel adaptive filtering, most kernels present the following form [6]: kð�; c; gÞ ≔ exp ðÀ gk � À ck 2 Þ; ð1Þ where parameters c 2 R L and γ > 0 represent the center and the width of a Gaussian kernel, respectively. In other words, this kind of kernel presents only two parameters, namely, mean and variance (also referred to as scalar precision).

Related work
Several instances of nonlinear adaptive filtering have been reported in the scientific literature. Among them, kernel adaptive filtering developed in a reproducing kernel Hilbert space (RKHS) is known as an efficient online nonlinear approximation approach [7,8]. Wellknown kernel adaptive filtering algorithms are kernel least mean square (KLMS) [9][10][11][12], kernel normalized least mean square (KNLMS), kernel affine projection algorithms (KAPA) [13,14], and kernel recursive least squares (KRLS) [15]. In this context, it is worth citing fractional adaptive signal processing [16][17][18][19][20] as these modern filtering algorithms outperform their counterparts in terms of accuracy and convergence, for example in active noise control systems. A distinguishing feature of kernel-based adaptive filtering is the ability to adjust the values of the parameters of each kernel so as to minimize the filtering error. To what concerns kernel centers, research endeavors suggested to move all the center vectors in the dictionary to minimize the squared filtering error [21][22][23]. To what concerns kernel widths, it is known that the widths of the kernels are important parameters that contribute to improve the performance of kernel machines [24][25][26][27][28] and some attempts to adaptively estimate the widths of the kernels have been reported [27,28]. Moreover, in a recent work [6], Wada et al. have proposed an adaptive update method for both the Gaussian center and width concurrently. The above mentioned papers have addressed the problem of estimating a precision parameter of the Gaussian model given as in (1). However, this is a special case of multivariate Gaussian kernel function.
The structure of most kernel adaptive filtering algorithms grows linearly with each new input sample. A solution to cope with this problem is to build a dictionary. Well-known criteria for dictionary learning are novelty [29], approximate linear dependency (ALD) [15], surprise [30], and coherence-based criterion [31]. Another known criterion is ℓ 1 -regularization [32,33], which sets some coefficients to zero and discards the corresponding entries. In this instance, a model dynamically changes, in that new members may be added to a dictionary and old members may be suppressed from a dictionary.

Innovative contribution
It should be noted that a kernel of the form (1) implicitly assumes uncorrelatedness between components in the sample vector, which implies that the kernel can be isotropic. However, observed samples usually present some sort of mutual correlation [34,35].
In this paper, we employ a generalized Gaussian kernel defined as kð�; c; GÞ ≔ exp ðÀ ð� À cÞ > Gð� À cÞÞ; ð2Þ where G 2 R L�L is a symmetric positive definite (SPD) matrix. We refer to Γ as a precision matrix, which has no constraint but positive definiteness, while the model of (1) can be regarded as a special case where Γ = γI. In other words, in (1) the precision matrix is allowed to be only isotropic. Unlike (1), this general form has more degrees of freedom and therefore it is more flexible in modeling samples distributions; however, an adaptive method for finding the precision matrix is not straightforward. We will establish a dictionary learning method for generalized Gaussian kernel adaptive filtering. In a dictionary, each entry consists of a pair formed by a center vector and a precision matrix. The main contributions of the proposed method are (a) a model of the filter consisting of kernels with a different precision matrix each (b) an update rule for each center vector, and (c) a learning rule for each precision matrix on the SPD manifold in order to ensure their symmetry and positivity definiteness during adaptation.

Organization and list of abbreviations
Section 2 presents general concepts in kernel adaptive filtering. Section 3 proposes a dictionary learning method for the generalized Gaussian kernel adaptive filtering. Section 4 shows the results of numerical experiments to evaluate the efficacy of the proposed method. Section 5 concludes the paper. A list of abbreviations used within this paper is presented in Table 1.

Kernel adaptive filters
Kernel adaptive filters possess noteworthy features [8], such as universal approximation ability, absence of local minima and moderate complexity in terms of computation burden and memory. In this section, we first discuss sample distributions modeling in the context of kernel adaptive filtering and then we briefly review kernel adaptation algorithms. In kernel adaptive filtering, an input sequence u ðnÞ 2 U � R L is mapped to a RKHS ðH; h�; �iÞ on U induced from a positive definite kernel kð�; �Þ : U � U ! R. Here, symbol U denotes a multidimensional input space, while symbol h�; �i : H � H ! R denotes an inner product in the RKHS. A RKHS H can implicitly increase the dimensionality of a feature space that enables us to represent non-linear signals, which are generated by a non-linear system [36]. The short-term scalar output sequence of the filter is computed as where P ðnÞ 2 H denotes a filter weight vector at time n and � : U ! H denotes a nonlinear mapping. In general, the inner product in a high dimensional space is not given in an explicit form. Rather, the inner product in a RKHS can be calculated by using the properties of RKHS, namely: (i) all elements in a RKHS are constructed by a kernel κ(�, �), (ii) it is convenient to choose φ(u) = κ(�, u), (iii) it holds that hκ(�, u i ), κ(�, u j )i = κ(u i , u j ) [2,31]. The Fig 1 shows a schematic of the adaptive filter.
We consider the problem of adaptively estimating the weights P (n) . It is known [31] that P (n) can be written as where the h ðnÞ j 2 R are scalar weight coefficient for κ(�, c j ). Here, fc j g j2J ðnÞ is a set of input samples, termed dictionary. The symbol J ðnÞ denotes an index set of dictionary elements at time n.
Notice that, in the above equations, for the sake of notation conciseness the kernel functions have been indicated without reference to the kernel width parameter γ that takes the same value in each kernel.
In kernel adaptive filtering algorithms, obsolete kernel functions cannot be discarded, which is a serious limitation of these algorithms, in particular in the presence of a non-stationary environment. In order to fix this issue, a dictionary may be constructed by means of ℓ 1 -regularization [32,33] which promotes sparsity, hence improving the efficiency of the dictionary. Let d ðnÞ 2 R denotes a desired output signal at time n. The cost function for the adaptive kernel algorithm is written as follows: where β (n) and μ play the role of a weighted ℓ 1 norm and of a regularization parameter, respectively. Here, the weights fw ðnÞ j g j2J ðnÞ are dynamically adjusted as w ðnÞ , with a small constant ρ > 0 to prevent the denominator from vanishing.
It is worth noticing that a conventional stochastic gradient descent method would be ineffective to seek the minimum of the cost function (7) since the weighted ℓ 1 norm is not smooth. However, since the cost function A (n) is convex, a forward-backward splitting scheme [37] may be applied. A forward-backward splitting scheme reads: where h ðnÞ ≔ ½h ðnÞ > ; 0� > , k ðnÞ ≔ ½k ðnÞ > ; kðu ðnÞ ; u ðnÞ Þ� > , the coefficient λ > 0 denotes a step size, the coefficient σ denotes a stabilization parameter, and k�k denotes a standard vector 2-norm. The symbol 'prox' denotes the proximal operator [37], which is defined as follows: given a vector q ≔ ½q 1 ; q 2 ; . . . ; q r � > 2 R r , it holds that ðprox lmb ðnÞ ðαÞÞ j ≔ sgnfq j g max fjq j j À lmw ðnÞ j ; 0g: ð9Þ For further details on this technique, interested readers might consult [6,38,39]. This learning rule promotes the sparsity of the h ðnÞ j 's, which results in some coefficient h ðnÞ j approaching zero and the corresponding center vector c j getting removed from the dictionary.

Model and dictionary learning for generalized Gaussian kernel adaptive filtering
As recalled in the introduction, most kernel machines using Gaussian kernel functions implicitly assume uncorrelatedness within the sample-variables, even though observed samples usually present correlation. In the following, a flexible filtering structure based on a superposition of generalized Gaussian functions is proposed and algorithms for learning its parameters are established.

Adaptive kernel filter based on a superposition of generalized Gaussian functions
The proposed model is structured as a superposition of generalized Gaussian kernels given as in (2) with time-varying centers c ðnÞ j and precision matrices G ðnÞ j . The output sequence of one such kernel adaptive filter is computed as For the sake of completeness, let us discuss how a multikernel adaptive filter fits within the general theory of RKHS. An extended discussion on multikernel adaptive filtering may be found in [38,40].
Let H 1 and H 2 denote two reproducing kernel Hilbert spaces and let H ≔ H 1 � H 2 denote their direct sum. The norm of the direct sum of f 1 In particular, if the two Hilber spaces are non-overlapping, namely Consequently, the norm in H may be defined as: Also, take a kernel k 1 2 H 1 and a kernel k 2 2 H 2 . An element f 2 H can be evaluated by the sum kernel κ ≔ κ 1 + κ 2 [2]: The above construction may be generalized to an arbitrary number of Hilbert spaces without difficulty.
Assume now that M different kernels fk m ð�; �Þg M m¼1 are available. Denote by H m a RKHS determined by the m-th kernel and define H as the corresponding sum space. In analogy to the simpler case (14), the output of the filter is obtained by combining a weight P 2 H and the 'sum kernel' k 2 H as where each weight P m 2 H m and P is identified with the (direct) sum of the single weights P m . Since there is no need for the index set of the dictionary in each RKHS to equate each other [38], the filter structure (10) may be identified as a multikernel adaptive filter with time-varying weights: with the convention that P ðnÞ

Center vectors adaptation
In this subsection, a dictionary learning method for generalized Gaussian kernel adaptive filtering is proposed. To update the center vectors, we chose the loss function: F ðnÞ ðD ðnÞ Þ ≔ je ðnÞ j 2 ¼ jd ðnÞ À y ðnÞ j Such criterion is a function of dictionary elements, namely, of center vectors as well as of precision matrices.
The adaptation of each center vector may be achieved by a gradient steepest descent algorithm: where η c > 0 denotes a step size and Let us remark that adaptation rules to move center vectors for the standard Gaussian kernel adaptive filters were also proposed in [21][22][23].

Precision matrices adaptation
In order to update the precision matrices, we consider two types of data-driven adaptation methods. One consists in applying the update rule for SPD matrices proposed in [41]. The other is a novel update rule where an effective normalization is employed. The Fig 2 illustrates, in a schematic way, these update rules. In order to update precision matrices, the same loss function (17) may be invoked.

Matrix Exponentiated Gradient (MEG) adaptation.
To update the precision matrices in a dictionary fG j g j2J ðnÞ while preserving their SPD structure, a matrix exponentiated gradient (MEG) update [41] may be applied. The update rule for Γ j can be derived to minimize the loss function in (17): where η w > 0 denotes a step size and For a square matrix X, symðXÞ ≔ 1 2 ðX þ X > Þ denotes the symmetric part of X, while exp (X) and log(X) denote matrix exponential and principal matrix logarithm, respectively [41].
It is interesting to observe that the adaptation rule (20) may be re-interpreted in the framework of manifold calculus. In fact, let us define a retraction [42] as R X (V) ≔ exp(logX + V) in the SPD space, where X denotes any positive-definite symmetric matrix and V denotes any symmetric matrix of the same size. Also, let us define r X F ≔ sym @F @X À � as the Riemannian gradient of a loss function F with respect to the SPD matrix X [43]. Then, the adaptation rule (20) may be re-framed as G ðnþ1Þ

Normalized Matrix Exponentiated Gradient (NMEG) adaptation.
Even though matrix exponentiated gradient updates each precision matrix Γ while preserving its SPD structure, the computation of logΓ can be unstable when the eigenvalues of Γ are too close to zero. A symmetric positive-definite matrix Γ with L all-distinct eigenvalues may be decomposed as W diag(λ 1 , λ 2 , . . ., λ L )W > , with W orthogonal. Therefore, logΓ = W diag(logλ 1 , logλ 2 , . . ., logλ L )W > : If an eigenvalue lays too close to zero, matrix logarithm becomes numerically unstable. In general, a matrix logarithm is well-defined only in a neighbor of the identity matrix I. To overcome this problem, the following normalizing function by the current value G ðnÞ j is proposed: where X denotes any symmetric positive-definite matrix and (�) −1/2 denotes a combination of matrix inversion and symmetric square-rooting. On the basis of the observations recalled in the footnote 1 , the inverse symmetric square root of a SPD matrix G may be computed rather inexpensively by Wdiagðl À 1=2 1 ; l À 1=2 2 ; . . . ; l À 1=2 L ÞW > . Since a precision matrix G ðnÞ j is symmetric and positive-definite, its inverse always exists and its matrix square root always returns a symmetric, real-valued matrix. Let us remark how the introduced normalization keeps both symmetry and positive-definiteness of its argument, in fact, to what concerns symmetry: where G j ¼ G j ðG j Þ is to be thought of as a compound function, in fact, it holds that G j ðG j Þ ≔ ðL ðnÞ j Þ À 1 ðG j Þ. Notice thatG ðnÞ j can be written as where I 2 R L�L is an identity matrix. Since logI = 0, the adaptation rule (25) simplifies tõ To find the derivative of function F (n) (Γ j ) with respect toG j , the following chain rule [44] is used: where the notation (X) kl indicates the (k, l)-th entry of matrix X, Tr(�) denotes matrix trace, and S kl is the single-entry matrix [44], whose (k, l)-th entry is 1 and each other entry takes the value 0. From the property (28), we get @F ðnÞ ðG j ðG j ÞÞ thanks to the symmetry of the involved matrices and expressions. Using the formula (29), the adaptation rule (27) can be written as Thanks to the normalizing function, we can update the precision matrices stably. Then, the (n + 1)-th precision matrix is obtained by applying the inverse normalizing function. Therefore, the following update rule is derived: From (31), we can see that unlike (20), this adaptation rule dose not require the computation of logΓ. We call this adaptation rule normalized matrix exponentiated gradient (NMEG).
As a special instance, let us consider the case L = 1. The NMEG update rule in the case of L = 1 can be derived by replacing each precision matrix Γ with a scalar parameter γ > 0 in (31): which apparently keeps each parameter γ j in the positive half-line during adaptation. The partial derivative of the cost function, in this case, reads Such special case was proposed and discussed in the contributions [6,28]. The adaptation rule (31) was derived on the basis of matrix normalization, therefore, it is legitimate to wonder if it constitutes a valid algorithm to update a matrix in the space of SPD tensors. The answer is positive, indeed, since the rule (31) may be regarded as an application of a general geodesic-based stepping rule on the manifold of symmetric positive-definite matrices endowed with the canonical metric, namely where the function g X (V) denotes a geodesic arc in the SPD space departing from a point X in the direction V and is given by as explained, for example, in [43] and [45]. Notice, in addition, that the argument of the function g in (34) is proportional to the Riemannian gradient of the criterion function F with respect to the canonical metric, as defined in the previous Subsection 3.3.1.

Sparse KNLMS incorporated with generalized Gaussian kernel parameters
To avoid overfitting and to prevent monotonic growth of a dictionary, the proposed adaptation rules for the generalized Gaussian parameters are applied jointly with an ℓ 1 -regularization [33]. The proposed method is summarized in Algorithm 1.
7: for j 0 to size of D ðnÞ À 1 do 8: Update the center vectors c ðnÞ j using (18). 9: Update the precision matrices G ðnÞ j using MEG (20) or NMEG (31). 10: Update the filter coefficients h j according to a forward-backward splitting scheme (8). 11: end for 12: for j such that h j = 0 do 13: Remove the j-th element from the dictionary D ðnÞ . 14: end for 15: n n + 1 16: end for

Numerical experiments
In this section, we compare the KNLMS-ℓ 1 [33], the NMEG (L = 1) [6,28] in (32), the MEG in (20), and the NMEG in (31) through three types of simulations. The first simulation is a time series prediction in a toy model defined by Gaussian functions with scalar widths. The second simulation is an online prediction in a toy model defined by Gaussian functions with precision matrices. The last simulation consists in an online prediction of the state of a Lorenz chaotic system. In these simulations, mean squared error (MSE) and mean dictionary size were adopted as the evaluation criteria. Both indices were averaged over 200 independent trials to compensate for statistical fluctuations in each single trial.

Time series prediction in a toy signal model constructed by standard Gaussian functions
Consider the following synthetic signal model:   matrices Γ. The NMEG (L = 1) converges faster than the other algorithms in this comparison. However, when the iteration index n reaches about 100, 000, the NMEG (L = 1) and NMEG exhibit almost the same mean MSE even though the NMEG uses generalized Gaussian kernels. The Fig 4 shows that the NMEG (L = 1), the MEG, and the NMEG are able to keep a small dictionary size.

Time series prediction in a toy signal model constructed by generalized Gaussian functions
Further, consider the following synthetic signal model: corrupted by the same kind of noise, and driven by the same input sequence, as in the previous experiment. Parameters values pertaining to learning schemes utilized in this experiment are given in Table 3. In addition, the parameters values for the forward-backward splitting scheme are λ = 0.09 and σ = 0.03. We tested the behavior of the proposed adaptive kernel filter theory on two different cases characterized by two instances of Λ:  (20) is difficult to compute, while the NMEG is able to perform well in both cases. The Fig 5c and 5d confirm that the NMEG produces the smallest dictionary. The above results clearly confirm the efficacy of the proposed normalization method for updating precision matrices.

Modeling of a Lorenz chaotic system
Adaptive kernel filters are widely used in time-series prediction [46]. We tested the devised algorithm to model a Lorentz chaotic system [30]: where α = 8/3, δ = 10, and θ = 28 [11]. The continuous-time equations were sampled by subdividing each unitary interval in 100 sub-intervals. The x component was used to test the algorithm's prediction ability. The x time series was normalized to zero-mean and unit variance. A segment of such time series is displayed in Fig 6. The input signal to the modeling algorithm was constructed as u (n) = [x (n−5) , x (n−4) , . . ., x (n−1) ] > and the current value x (n) was taken as the desired response. The values of the learning parameters in this experiment are given in the Table 4. In addition, the parameters for the forward-backward splitting scheme are λ = 0.5 and σ = 0.05. The Figs 7 and 8 show the MSE and the mean dictionary size at each iteration, respectively.

PLOS ONE
Simulation results indicate that the proposed MEG and NMEG exhibit much better performances, namely, they achieve much smaller mean dictionary size and much smaller MSE values than the other algorithms used for comparison. Comparing the MEG algorithm with the NMEG, the NMEG exhibits better performance in terms of both MSE and mean dictionary size although their parameters are set to the same values. The Fig 9 shows a result of shortterm prediction of the Lorenz time series. It can be seen that the NMEG has higher tracking ability than the NMEG (L = 1). This result confirms the validity of the proposed model in the case that the components of the input signals are mutually correlated.

Conclusions
This paper proposed a flexible dictionary learning strategy in the context of generalized Gaussian kernel adaptive filtering, where the kernel parameters are all adaptive and data driven. We introduced a novel update rule for precision matrices, which allows one to update each precision matrix stably thanks to an effective normalization. The main advantage of the proposed approach is that the number of parameters in the proposed generalized Gaussian kernels is larger than the number of parameters in the conventional kernel functions. The adaptation rule of kernel parameters are successfully established within a Lie-group theoretic setting. In addition, together with the ℓ 1 regularized least squares, the overall kernel adaptive filtering algorithms can avoid overfitting and monotonic inflation of a dictionary. Numerical tests  confirmed that the proposed algorithm entails lesser mean squared error and dictionary size in modeling nonlinear systems.