Orthogonal Procrustes Analysis for Dictionary Learning in Sparse Linear Representation

In the sparse representation model, the design of overcomplete dictionaries plays a key role for the effectiveness and applicability in different domains. Recent research has produced several dictionary learning approaches, being proven that dictionaries learnt by data examples significantly outperform structured ones, e.g. wavelet transforms. In this context, learning consists in adapting the dictionary atoms to a set of training signals in order to promote a sparse representation that minimizes the reconstruction error. Finding the best fitting dictionary remains a very difficult task, leaving the question still open. A well-established heuristic method for tackling this problem is an iterative alternating scheme, adopted for instance in the well-known K-SVD algorithm. Essentially, it consists in repeating two stages; the former promotes sparse coding of the training set and the latter adapts the dictionary to reduce the error. In this paper we present R-SVD, a new method that, while maintaining the alternating scheme, adopts the Orthogonal Procrustes analysis to update the dictionary atoms suitably arranged into groups. Comparative experiments on synthetic data prove the effectiveness of R-SVD with respect to well known dictionary learning algorithms such as K-SVD, ILS-DLA and the online method OSDL. Moreover, experiments on natural data such as ECG compression, EEG sparse representation, and image modeling confirm R-SVD’s robustness and wide applicability.


Introduction
In many application domains, such as denoising, classification and compression of signals [1][2][3], it is often convenient to use a compact signal representation following Occam's Razor principle. Dimensionality reduction can be accomplished either with feature selection [4,5] or sparse decomposition techniques [6].
Sparsity is a classical linear algebra approach leading to parsimonious representation. Consider an overcomplete dictionary matrix D 2 R nÂm (n < m) with columns d i , i = 1,. . .,m, called atoms, and a signal vector y 2 R n ; the sparsity approach consists in expressing y as linear combination ∑ i x i d i with as few as possible non-zero coefficients x i 2 R. Formally, the sparse approximation problem consists in finding x 2 R m minimizing the least squares error a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Procrustes analysis simultaneously to all the atoms in each group capturing more complex data structures and being more efficient. The technique is able to find an optimal dictionary after few iterations of the scheme. Notice that the proposed method differs from K-SVD [28], which instead updates one atom at a time together with the corresponding sparse coefficients. Several experimental sessions show that R-SVD is effective and behaves better than several well known dictionary learning algorithms such as K-SVD, ILS-DLA and the online method OSDL.
In Sec. 2 we describe the problem and the proposed R-SVD algorithm. In Sec. 3 we conduct an experimental analysis studying the group size parameter of the method (sec. 3.1), showing results on synthetic data (sec. 3.2), investigating the role of the sparse decomposition method (sec. 3.3), and comparing with other dictionary learning algorithms (sec. 3.4). In Sec. 4 we report applications to ECG signal compression (sec. 4.1), EEG signal representation (sec. 4.2), and image modeling (sec. 4.3). Finally, we draw some conclusions in Sec. 5.

Method
In this section we use the notation A ¼ fa i g q i¼1 2 R pÂq to indicate a p × q real-valued matrix with columns a i 2 R p ; i ¼ 1; :::; q. Suppose we are given the training dataset Y ¼ fy i g L i¼1 2 R nÂL . The dictionary learning problem consists in finding an overcomplete dictionary matrix D ¼ fd i g m i¼1 2 R nÂm (n < m), which minimizes the least squares errors k y i À Dx i k 2 2 , so that all coefficient vectors x i 2 R m are k-sparse. Formally, by letting X ¼ fx i g L i¼1 2 R mÂL denote the coefficient matrix, this problem can be precisely stated as argmin subject to k x i k 0 k; i ¼ 1; :::; L: One can multiply the i-th column of D and divide the i-th row of X by a common non-null constant to obtain another solution attaining the same value. Hence, w.l.o.g. atoms in D are constrained to be unit ℓ 2 -norm, corresponding to vectors d i on the unit (n − 1)-sphere S nÀ 1 centered at the origin.
The search for the optimal solution is a difficult task due both to the combinatorial nature of the problem and to the strong non-convexity given by the ℓ 0 conditions. We tackle this problem adopting the well established alternating optimization scheme [21], which consists in repeatedly executing the two steps: Step 1. Sparse coding: solve problem (1) for X only (fixing the dictionary D) Step 2. Dictionary update: solve problem (1) for D only (fixing X).
In particular, for sparse decomposition in Step 1 we use the greedy algorithm OMP because of its simplicity yet efficiency. Clearly, other sparse recovery methods could be adopted (e.g. BP, Lasso, k-Limaps, SL0). Experimentally, we observe that this choice does not substantially affect R-SVD performances in comparison with K-SVD.
Step 2 represents the core of the R-SVD method as detailed in the following.

Dictionary learning by Procrustes analysis
Let us first recall the idea of the Procrustes analysis. It consists in applying affine transformations (e.g., moving, stretching and rotating) to a given geometrical object in order to best fit the shape of another one. When the admissible transformations are restricted to orthogonal ones, it is referred to as Orthogonal Procrustes analysis [25].
Basically, in the proposed method R-SVD, after splitting the dictionary D into atom groups, the Orthogonal Procrustes analysis is applied to each group to find the best rotation (either proper or improper) that minimizes the total least squares error. Consequently, each group is updated by the optimal affine transformation thus obtained. Formally, let us denote by [m] ≔ {1,. . .,m} the set of first m positive integers and let I & [m] denote a set of indices for matrix columns or rows. Given any index set I of size s = |I|, let D I 2 R nÂs be the submatrix (subdictionary) of D formed by the columns indexed by I, that is D I = {d i } i 2 I , and let X I 2 R sÂL be the submatrix of X formed by the rows indexed by I; hence s is the size of atom group D I . In this setting, we can decompose the product DX into the sum DX ¼ D I X I þ D I c X I c of a matrix D I X I dependent on the group I and a matrix D I c X I c dependent on the complement I c = [m] \ I. Therefore, the objective function in Eq (1) can be written as that corresponds to solving a subproblem of Step 2 by restricting the update to group D I of unit ℓ 2 -norm atoms.
Our method aims at yielding a new atom group S ¼ D 0 I , in general suboptimal for problem (2), by an orthogonal transformation matrix R (i.e. R T R = I) applied on D I , namely D 0 I ¼ RD I . The set of orthogonal matrices R of order n, called orthogonal group O(n) (not to be confused with group of atoms), can be partitioned into the special orthogonal subgroup SO(n) formed by proper rotations, i.e. those with detR = 1, and the set O(n) \ SO(n) of improper rotations (or rotoreflections), i.e. those with detR = −1. Therefore, the search for such an optimal transformation can be stated as the following minimization problem where H≔D I X I 2 R nÂL . Notice that in denoting E and H we omit the dependence on I. The problem (3) is known as the Orthogonal Procrustes problem [25] and can be interpreted as finding the rotation of a subspace matrix H T to closely approximate a subspace matrix E T [29, §12.4.1].
The orthogonal Procrustes problem admits (at least) one optimal solutionR which is [29] the transposed orthogonal factor Q T of the polar decomposition EH T = QP, and can be effectively computed asR ¼ Q T ¼ VU T from the orthogonal matrices U and V of the singular value decomposition EH T ¼ USV T Hence the rotation matrix we seek isR ¼ VU T , the new dictionary D 0 has the old columns of D in the positions I c and the new submatrix D 0 I ¼RD I in the positions I, while the new nonincreased value of reconstruction error is At this point the idea of the whole algorithm is quite straight-forward: 1. at each dictionary update iteration (Step 2) partition the set of column indices [m] = I 1 t I 2 t . . . t I G into G subsets, 2. then split D accordingly into atom groups D I g , g = 1,. . .,G, and 3. update every atom group D I g .
These updates can be carried out either in parallel or sequentially with some order. We have chosen the sequential update with ascending order of atom popularity, i.e. sorting the indices i 2 [m] w.r.t. the usage of atom d i , computable as ℓ 0 -norm of the i-th row in X. For sake of simplicity we set uniformly the group size to s = |I g | for all g, possibly except the last group (G = dm/se) if m is not a multiple of s: |I G | = m − Gs. Regarding this choice, we have seen experimentally that our method is agnostic w.r.t. grouping criteria such as random balanced grouping, cumulative coherence based partitioning, and clustering by absolute cosine similarity.
After processing all G groups, the method moves to the next iteration, and goes on until a stop condition is reached (eg. the maximum number of iterations as commonly chosen, or an empirical convergence criterion based on successive iterates). The main steps can be summarized in Algorithm 1 (the Matlab code implementing the algorithm is available on the website http://phuselab.di.unimi.it/resources.php).

Algorithm 1 R-SVD
Input: Y 2 R nÂL : column-vector signals for training the dictionary Output: D 2 R nÂm : trained dictionary; X 2 R mÂL : sparse encoding of Y 1: Initialize dictionary D picking m examples from Y at random 2: repeat 3: Sparse coding: D J = RD J 11: end for 12: return D, X 13: until stop condition Notice that in our method the renormalization of atoms to unit length at each iteration is not necessary since they are inherently yielded with such a property from this Procrustes analysis, and hence in practice some large part of renormalizing computations as in ILS-DLA [22] and K-SVD [28] can be avoided.

Computational time analysis
A useful computational speedup in the update of every group D I can be described as follows. Let us pre-compute XY T 2 R mÂn and XX T 2 R mÂm at the beginning of each dictionary update (Step 2). The matrix EH T undergoing the SVD can be computed as where D I and D I c come from previous update step, and the term X I Y T is the submatrix formed by rows I of XY T , while X I (X T ) I c by rows I and columns I c of XX T . With elementary matrix products this computation requires O(sn(n + m − s)) flops, which is lower than O(nmL) since s < n < m ( L.
Notice that, since rank HE T rankH s, it is not necessary to obtain the full SVD of HE T , but rather truncate the decomposition to the first s singular vectors u i , v i : We thus use the truncated SVD algorithm by [30] based on structured random matrix that requires O(n 2 log s) flops. The computation of HE T and its SVD is repeated G = dm/se times. The computational time of R-SVD is dominated by the pre-computation of XY T and XX T , and therefore taking into account the sparsity of matrix X the asymptotic estimate for one iteration of R-SVD is T RÀ SVD ðk; n; m; L; sÞ

Experimental analysis
In this section we test the proposed R-SVD algorithm devoting at first an in-depth analysis to how to choose the group size s defined above. Then we apply the method on synthetic data conducting extensive experiments on both R-SVD and K-SVD using OMP as sparsifier. A further investigation is conducted on two different sparse decomposition methods, namely k-Limaps and SL0, and alternative dictionary learning methods, namely ILS-DLA by Engan et al. [20] and OSDL by Sulam et al. [24].
Following [28], the dictionary D 2 R nÂm is randomly drawn, with i.i.d. standard Gaussian distributed entries and each column normalized to unit ℓ 2 -norm. The training set Y 2 R nÂL is generated column-wise by L linear combinations of k dictionary atoms selected at random, and by adding white Gaussian noise matrix N with various signal-to-noise ratio (SNR), i.e. Y = DX + N. We measure the performances of the algorithms in terms of the reconstruction error (or quality) expressed as E SNR ¼ 20 log 10 ðk Y k F = k Y ÀDXk F Þ dB, whereD andX are the learned dictionary and the sparse encoding matrix respectively.

Setting the group size
It is naturally expected that the group size s affects both reconstruction quality and running time. In order to give some insight on this parameter, we run R-SVD algorithm on synthetic training sets by setting L = 8000, D 2 R 50Â100 , k = 5 and SNR = 30 dB for noise N and letting s range in the interval 1 Ä 25. Notice that when s = 1, our method is similar to K-SVD [28] except in the recovery of the sparse coefficients yielded by SVD decomposition.
In Fig 1 we report the reconstruction error E SNR (solid curve), and the computational times of both R-SVD (dashed curve) and K-SVD (dotted line), all averaged over 100 trials. It can be noticed that R-SVD method behaves better near the value s = 10. We thus choose this tradeoff setting for the experimental assessments of the method in the following sections.

Comparative results on synthetic data
In order to test the R-SVD method and compare it to the K-SVD algorithm, we first run experiments on random training instances. We consider dictionaries of size 50 × 100 and 100 × 200, dataset of size up to L = 10000 and sparsity k = {5, 10}. The algorithms K-SVD and R-SVD are run for T = 200 dictionary update iterations, that turns out to be sufficient to achieve empirical convergence of the performance measure. For each experimental setting we report the average error over 100 trials.
In Fig 2 we highlight the learning trends of the two methods, plotting at each iteration count the E SNR values on synthetic vectors Y = DX + N, varying the additive noise SNR = 10, 30, 50, 1 (no noise) dB. It can be seen that, after an initial transient, the gap between R-SVD and K-SVD increases with the iteration count, establishing a final gap of 2 dB or more in conditions of middle-low noise power (SNR ! 30 dB).
In order to explore the behavior of R-SVD and K-SVD in a fairly wide range of parameter values, we report in Moreover, it is useful to evaluate the number of correctly identified atoms in order to measure the ability of the learning algorithms in recovering the original dictionary D from the noise-affected data Y. This is accomplished by maximizing the matching between atoms d i of the original dictionary with atomsd j of the dictionaryD yielded by the algorithm: two atoms ðd i ;d j Þ are considered matched when their cosine distance is small [28], i.e. precisely 1 À jd T id j j < d≔0:01: In Table 1 we report the average number of recovered atoms on random instances. Notice that R-SVD performs slightly better independently of the additive noise power.

Choice of the sparse decomposition method
So far we have used OMP as sparsifier for both R-SVD and K-SVD methods. Here we investigate R-SVD's performances adopting the sparse decomposition techniques k-Limaps [9] and SL0 [10]. The former (developed by the authors) has proven its ability to perform better than other sparsity methods well know in literature (see experimental sections in [9]), while the latter is particularly suitable for fast applications.
Each method has been applied to compute line 3 of Alg. 1, R-SVD, and the corresponding operation in the K-SVD algorithm. The SL0 (available on authors' webpage http://ee.sharif. edu/*SLzero/) is set up with scale parameter μ 0 = 2 and σ decrease factor equal to 4/5, while k-Limaps is initialized setting to 100 the maximum number of iterations. The experiments of R-SVD and K-SVD incorporating k-Limaps and SL0 were conducted under the same conditions of subsection 3.2. The resulting gap between final E SNR 's is shown in Fig 4. Notice that, using k-Limaps as sparsifier, R-SVD provides performances that are almost uniformly and moderately better than K-SVD, unless the additive noise is very high. In such a case there is another evidence that the two algorithms are equally misled by noise-affected data. Moreover, while the behavior of the two algorithms incorporating SL0 is more contrasting, it can be seen however that R-SVD with SL0 usually performs better.

Alternative dictionary learning methods
Here we extend the comparative experiments considering R-SVD (integrating OMP) versus the iterative alternating scheme method ILS-DLA, and the on-line dictionary learning method OSDL by Sulam et al. [24].
In these experiments we refer to random training sets of size L = 5000, 10000, dictionaries of size 64 × 128 (where the atom size is a perfect square as required in OSDL), various levels of additive noise (10, 30, 50 dB and the case of no noise), and sparsity set to 10% of the atom size. The methods are run for 200 dictionary update iterations and the obtained results are averaged over 100 trials. In Table 2 we report the average gaps between the E SNR of R-SVD and each of ILS-DLA and OSDL, respectively. We can notice that R-SVD systematically demonstrates  better performances than ILS-DLA and OSDL. More specifically, ILS-DLA shows a regular behavior, with a considerable rise of the gaps with the increase of additive noise's SNR. Concerning OSDL, we have even more significant gaps in favor of R-SVD, but with a less regular distribution, which is not easy to be interpreted. This may be in part due to OSDL having been originally conceived for learning of large dimension image patches.

Experiments on natural data
In order to assess the applicability of the R-SVD method to various domains and tasks, we test it on ECG compression, EEG sparse representation, and image modeling. All the experiments are conducted on publicly available data, comparing the R-SVD and K-SVD performances, and adopting OMP as sparse decomposition method.

ECG compression
Sparsity techniques have been already applied to the compression of electrocardiogram (ECG) signals [31,32]. To highlight the benefit of the dictionary learning approach in this task, we tested the two methods K-SVD and R-SVD recasting the compression as a problem of sparse approximation with a dictionary. In this experiment, the compression process is broken down into three stages. The former is a preprocessing step consisting in R-peak detection of the signal, and its normalization (i.e., filtering, zero-padding and centering) so as to split it into n-length normalized RR-segments. The second stage focuses on the dictionary learning where either R-SVD or K-SVD are used to train a dictionary on a group of RR-segments taken from an initial transient of the signal (train chunk). The latter stage concerns the encoding via sparse reconstruction of all RR-segments belonging to a broader interval of the signal (test chunk). This step is carried out referring to the learnt dictionaries, and applying the OMP algorithm. Naturally, in order to make the coding step easier, magnitudes and positions of non-null sparse coefficients are handled separately.
The compression level achieved by each method is measured as compression rate (CR), that is the ratio between the number of bits of the original signal and that of the compressed representation. Assuming that the test chunk y is composed of N q-bit resolution samples forming M RR-segments, we have where k is the sparsity level,q denotes the bit resolution of the quantized coefficients, m is the number of the atoms in the dictionary, and log 2 m k À Á is the binary coding length of coefficient positions. Beingŷ the reconstructed version of y, the overall reconstruction quality measure is here specialized for ECG signal as E SNR ¼ 20 log 10 kyk 2 kŷÀ yk 2 dB (this technical change is due to the necessity of measuring the quality of signals formed by variable length segments).
We conducted the experiments on two ECG records taken from the Long-Term ST Database in PhysioNet [33]. They are 24 hours long recordings taken from different subjects with the aid of Holter portable devices, all sampled at f s = 250 Hz and q = 12-bit resolution. Performances are reported in Fig 5. Upper plots show the E SNR vs CR obtained from OMP, referring to dictionaries learnt by R-SVD or K-SVD, and to an untrained dictionary (i.e. randomly picking m RR-segments from the train chunk). Lower plots report the computational time spent by the two techniques in the learning stage. To make the experiment realistic, we set n = f s , m = 5n andq ¼ q, the dictionary training is carried out on L = 5000 RR-segments (about 120 minutes), while the test chunk is composed of M = 15000 RR-segments (about 4 hours). In order to analyze the methods at several rates, we varied k in the interval 5 Ä 80 with step size 1.
Such experiments, besides confirming that trained dictionaries behave better than untrained one, prove the effectiveness of the R-SVD training method that outperforms K-SVD both in training ability and learning time.

EEG sparse representation
Electroencephalogram signals (EEG) are exploited in several Brain-Computer interfaces (BCI) mainly due to their high temporal resolution. An element of difficulty when dealing with this kind of signal is its non-stationary temporal behavior, an unwanted property that has negative consequences on common tasks such as classification. Several adaptive techniques and updating rules have been proposed to produce compelling dictionaries [34,35]. Here we touch upon how dictionary learning can help in tackling these undesired aspects, allowing to produce compact and faithful signal representations.
Operatively, we split the task into three stages. The former aims at producing a pool P of nlength normalized EEG chunks corresponding to a given motor imagery (e.g. left hand, right foot). The second stage concerns the dictionary learning in the strict sense: starting from a random dictionary D init of dimensions n × 2n, both the R-SVD and K-SVD methods are adopted to specialize D init on the basis of a pool of EEG chunks randomly taken from P. Finally, the quality of the learnt dictionaries, let's say D K-SVD , D R-SVD , is evaluated applying the OMP algorithm on a distinct pool of EEG chunks also taken from P.
We conducted the experiments referring to the dataset IVa from BCI competition III [36] (http://www.bbci.de/competition/iii/desc_IVa.html): EEG portions of signal corresponding to one motor imagery (either right foot or right hand) are extracted according to the given signal labelling, normalized to zero-mean signals and set into the pool P. The learning phase is carried out by setting the atom dimensions n = 150 or 300, the number of iterations T = 150, and expressing the sparsity k as various percentages of n. The cardinality of training set and test set is L = 2n and M = 4n respectively. In Fig 6 we plot the E SNR trends of the learning phase on the training sets. We can observe that, R-SVD has a markedly fast gain, showing a significant gap with respect to K-SVD since the very first iterations. In Table 3 we report the reconstruction quality in terms of E SNR attained in the test phase using the dictionaries D K-SVD and D R-SVD for sparse coding (using OMP). All results in the two phases are averaged on 50 trials. It is evident that the R-SVD method systematically exceeds K-SVD on both EEG chunk sizes. Moreover, the reconstruction quality increases with k, as well as the gap between the performances of K-SVD and R-SVD.

Image modeling
Sparse representation of images via learnt overcomplete dictionaries is a well-known topic in the fields of image processing and computer vision [37]. In particular, in problems such as image denoising or compression [2,38], we are interested in constructing efficient representations of patches (i.e. small portions of images) as a combination of as few as possible typical patterns (atoms) learnt from the data themselves. Naturally, the referred dictionary is crucial for the effectiveness of the method. Here, we compare performances achieved referring to the patch dictionaries learnt by either the well-known K-SVD or the R-SVD methods.  Table 3. Average E SNR obtained on the EEG test sets. n: chunk dimension. %n: sparsity expressed as a percentage of n. k: sparsity level. D K-SVD : E SNR obtained referring to the dictionary learnt by K-SVD. D R-SVD : E SNR obtained referring to the dictionary learnt by R-SVD. Given a pool of image patches P, a pre-processing is first applied aiming at both removing the patch mean intensity and reshaping all the patches to vectors in R n .

Reconstruction quality E SNR
The learning phase is carried out by setting the number of iterations to 50, and varying the sparsity k in 5 Ä 30. The initial dictionaries for the two algorithms have size n × 1.5n, while training and test sets have size L = 2n and M = 4n respectively; they are all made up of randomly selected patches from P.
Experimentally, we considered patches of size 9 × 9 (n = 81) and 16 × 16 (n = 256). In both cases we randomly extracted 100,000 patches from 500 images of Caltech 101 [39] and Berkeley segmentation image database [40]. In Fig 7 we plot the E SNR trends of the learning phase on the training sets. We can observe that, since the first iterations, R-SVD behaves better than K-SVD, maintaining a positive gap at convergence, especially in the cases of lower sparsity. In Table 4 we report the reconstruction quality in terms of E SNR attained in the test phase using the dictionaries learnt by K-SVD and R-SVD, and then applying OMP. All results in the two  Table 4. Average E SNR obtained on the image test sets. n: linear patch dimension. k: sparsity level. Last three columns are E SNR achieved with initial untrained dictionary (D init ), dictionary learnt by the K-SVD method (D K-SVD ) and dictionary learnt by the R-SVD method (D R-SVD ). phases are averaged on 50 trials. As we can observe, the R-SVD method systematically exceeds K-SVD referring to both patch sizes and to several sparsity degrees. These results further confirm the effectiveness and generality of the R-SVD method.

Conclusions
In this paper we have proposed a new technique, namely R-SVD, of dictionary learning for sparse coding. It preserves the well established iterative alternating scheme adopted for example in the K-SVD algorithm: one step is for the coding of sparse coefficients, and the other one is for the dictionary optimization promoting sparsity. The main novelty of R-SVD concerns how it tackles the dictionary optimization step: instead of choosing single best atoms via SVD, it transforms groups of atoms through the best rotations found in the spirit of the Orthogonal Procrustes analysis, so as to minimize the representation error. Extensive experiments have been conducted on both synthetic and natural data. In the former case, we investigated the behavior of R-SVD varying the atom group size and the sparse decomposition method, and we set up extensive simulations to assess the robustness and feasibility of the method, also in comparison with alternative dictionary learning algorithms. In the latter case, we considered the signal and image processing domain, showing good performances on the tasks of ECG compression, EEG sparse coding, and image modeling.
Some open issues remain to be studied. The main one is how to tackle problem (2), i.e. find the best atom group adopting more general transformations other than rotations with the Procrustes shape analysis approach. Another question concerns the resolution of problem (3) possibly through approximation techniques guaranteeing a better computational efficiency.