Robust Generalized Low Rank Approximations of Matrices

In recent years, the intrinsic low rank structure of some datasets has been extensively exploited to reduce dimensionality, remove noise and complete the missing entries. As a well-known technique for dimensionality reduction and data compression, Generalized Low Rank Approximations of Matrices (GLRAM) claims its superiority on computation time and compression ratio over the SVD. However, GLRAM is very sensitive to sparse large noise or outliers and its robust version does not have been explored or solved yet. To address this problem, this paper proposes a robust method for GLRAM, named Robust GLRAM (RGLRAM). We first formulate RGLRAM as an l 1-norm optimization problem which minimizes the l 1-norm of the approximation errors. Secondly, we apply the technique of Augmented Lagrange Multipliers (ALM) to solve this l 1-norm minimization problem and derive a corresponding iterative scheme. Then the weak convergence of the proposed algorithm is discussed under mild conditions. Next, we investigate a special case of RGLRAM and extend RGLRAM to a general tensor case. Finally, the extensive experiments on synthetic data show that it is possible for RGLRAM to exactly recover both the low rank and the sparse components while it may be difficult for previous state-of-the-art algorithms. We also discuss three issues on RGLRAM: the sensitivity to initialization, the generalization ability and the relationship between the running time and the size/number of matrices. Moreover, the experimental results on images of faces with large corruptions illustrate that RGLRAM obtains the best denoising and compression performance than other methods.


Introduction
In the community of pattern recognition, machine learning and computer vision, a commonly-used tenet is that the interested datasets lie in single or multiple linear subspaces. This low rank structure can be exploited to reduce dimensionality, remove noise and complete the missing entries. In view of this, the low rank subspace methods have been attracting broad attentions in the past few years. These methods include not only the classical paradigms such as Principal Component Analysis (PCA) [1], Singular Value Decomposition (SVD) [2] and Linear Discriminant Analysis (LDA) [3], but also the recently emerged Low Rank Matrix Recovery (LRMR) composed mainly by Matrix Completion (MC) [4,5], Robust Principal Component Analysis (RPCA) [6,7] and Low Rank Representation (LRR) [8,9].
Conventionally, each sample is modeled by a vector and the collection of data samples is represented by a matrix. However, some real-world samples such as gray-level images are twodimensional in essence. Under this circumstance, the traditional vector subspace model can not keep the bilinear structure of samples. To remedy this deficiency, one-dimensional subspace methods are successively generalized to the two-dimensional case. The resulting two-dimensional subspace methods mainly include two-dimensional PCA (2dPCA) [10], two-dimensional SVD (2dSVD) [11], two-dimensional LDA (2dLDA) [12], two-directional two-dimensional PCA ((2d) 2 PCA) [13], Generalized Low Rank Approximations of Matrices (GLRAM) [14,15] and so on. Among them, the latter two methods have the equivalent tri-factorization formulations, that is, they use two-sided transformations rather than single-sided ones. Compared to SVD, GLRAM is verified experimentally to have better compression performance and consume less computation time [14].
Recently, Liu et al. [15] revealed the relationship between GLRAM and SVD, and gave a lower-bound of GLRAM's objective function; Shi et al. [16] proposed a method of matrix completion via GLRAM. The existing GLRAM is designed to handle Gaussian noise and the Frobenius norm is adopted to evaluate the magnitude of approximation errors. However, GLRAM does not work well in the presence of large sparse noise or outliers. As far as we know, the robustness of GLRAM does not have been explored or solved yet. To this end, this paper proposes a novel robust approach to GLRAM, named Robust GLRAM (RGLRAM). Mathematically, RGLRAM is formulated as an l 1 -norm minimization problem and the Augmented Lagrange Multiplier (ALM) method [17] is applied to solve this non-smooth optimization problem. We also extend RGLRAM to the tensor case and present an iterative scheme to robust low rank tensor approximations.
The rest of this paper is organized as follows. Section 2 briefly reviews GLRAM. Model and algorithm of RGLRAM are proposed respectively in Section 3 and 4. Section 5 presents the weak convergence results for the proposed algorithm under mild conditions. In Section 6, we discuss the special case of RGLRAM for a single matrix and generalize RGLRAM to the case of tensors. Experimental results on synthetic data and images of faces are reported in Section 7. Conclusions and directions for future work can be found in Section 8.

Review of Generalized Low Rank Approximations of Matrices
As the generalization of SVD, Generalized Low Rank Approximations of Matrices (GLRAM) treats each sample as a two-dimensional matrix pattern and it simultaneously carries out trifactorizations on multiple matrices. Given a collection of N two-dimensional training samples fD i 2 R mÂn g N i¼1 , we decompose each sample into the product of three matrices: where both L 2 R mÂr 1 and R 2 R nÂr 2 are orthogonal transformation matrices, M i 2 R r 1 Âr 2 , max(r 1 , r 2 ) < min(m, n). The goal of GLRAM is to seek N+2 matrices L, R and fM i g N i¼1 such that LM i R T approximates D i for all i.
Mathematically, GLRAM can be formulated as a minimization problem: where I r 1 is an r 1 ×r 1 identify matrix and kÁk F is the Frobenius norm of a matrix. If {L, R, M 1 ,. . .,M N } is the optimal solution of problem (2), then it holds that M i = L T D i R. Based on this property, we only need to compute two transformations L and R by minimizing the reconstruction errors: Generally speaking, the above optimization problem does not have a closed-form solution. An iterative procedure for computing L and R was proposed in [14]. Specifically, for fixed L, the optimal R is computed by stacking the top r 2 singular vectors of Similarly, for given R, the optimal L is constructed by the top r 1 singular vectors of The above iterative procedure is repeated until convergence. In this paper, the stopping condition is set as where ε is a small positive number. GLRAM has good compression performance due to the fact that D i is approximated by LM i R T . To store {L, R, M 1 ,. . .,M N }, mr 1 + nr 2 + Nr 1 r 2 scalars are required. Hence, the compression ratio via GLRAM is Nmn/(mr 1 + nr 2 + Nr 1 r 2 ). Moreover, formulation (1) is equivalent to: where denotes the Kronecker product and vec(Á) is a vectorization operator which stacks the columns of a matrix into a long column vector. Based on formulation (5), GLRAM can be interpreted as a linear representation problem, where R L is the basis matrix.

Robust Model of Generalized Low Rank Approximations of Matrices
We rewrite the tri-factorization formulation (1) as where E i 2 R mÂn are noise matrices. If the entries of E i obey independent normal distributions with zero means, then their maximum likelihood estimator is equivalent to minimizing the objective function of problem (2). As is known to all, this objective function, also called cost function, is very sensitive to large sparse corruptions or outliers. For this purpose, we now assume that the entries of E i are generated by independent Laplacian distributions. According to the method of maximum likelihood estimation, the hidden factors {L, R, M 1 ,. . .,M N } can be obtained by solving the following l 1 -norm minimization problem: where kÁk 1 is the component-wise l 1 -norm of a matrix (i.e. the sum of absolute values of a matrix). To simplify the representation of the objective function in problem (7), we introduce N additional variables E 1 ,. . ., E N . At this moment, the above minimization problem is equivalent to: We call problem (7) or (8) as Robust Generalized Low Rank Approximations of Matrices (RGLRAM).
Compared with problem (2), problem (7) or (8) is more robust to large sparse noise or outliers on account of the fact that the component-wise l 1 -norm of a matrix is the tightest convex relaxation of the component-wise l 0 -norm (i.e. the number of non-zero elements). Although problem (8) belongs to the l 1 -norm minimization, it is non-convex. Therefore, we can not directly use the available methods of compressed sensing to solve it. In the next section, we will apply the Augmented Lagrange Multipliers (ALM) method to solving problem (8).

Algorithm to Robust Generalized Low Rank Approximations of Matrices
The method of Lagrange multipliers is a strategy for converting the optimization problem with equality constraints into an unconstrained problem. In view of this, we propose an iterative algorithm based on ALM to solve problem (8).
For the sake of simplicity, we set Without considering the orthogonal constraints in problem (8), we construct its partial augmented Lagrange function: where hÁ,Ái indicates the inner product between matrices, the penalty factor μ is positive, Y i 2 R mÂn are Lagrange multiplier matrices and Y ¼ fY i g N i¼1 . For given Y, a block-based gradient descent search technique is proposed to minimize f μ (L, R, M, E, Y). Concretely speaking, we let one block of variables be unknown and other blocks of variables be fixed at each iteration, and then the unknown block of variables is updated by minimizing f μ (L, R, M, E, Y). The specific update formulations are derived as follows.
Computing L. If L is unknown and other variables are fixed, we update L by minimizing f μ (L, R, M, E, Y) with respect to L. Considering the orthogonal constraints in problem (8), we have Hence, it holds that arg min L T L¼I r 1 f m ðL; R; M; E; YÞ (11) is known as the Stiefel manifold. For the optimization problems on Stiefel manifolds, Edelman et al. developed new Newton and conjugate gradient algorithms [18]. These two algorithms have heavy computation burden because they require the Hessian of the objective function. To reduce the computation complexity, we relax the orthogonal constraint into a sphere constraint and thus have It is obvious that the optimal solution of max kLk 2 F ¼r 1 hL; Pi is ffiffiffi ffi does not satisfy the orthogonal constraint. Then we set where QR(Á) means the thin QR decomposition on a matrix and L is an orthonormal basis of the range space of P. It can not be guaranteed that the iteration formulation (13) decreases the value of f μ (L, R, M, E, Y). Since ffiffiffi ffi the value of f μ (L, R, M, E, Y) surely does not increases after updating M i . Computing M. When M j is unknown and other blocks of variables are given, the update formulation for M j is calculated as: Computing R. For each j 2 {1, 2,. . ., N}, we have This equation means that the iterative formulation of R is similar with that of L. Let Then R is updated as It is worth noting that Eq (17) is followed by the calculation procedure of M in order to decrease the value of f μ (L, R, M, E, Y). Computing E. Fix L, R, M and Y, and minimize f μ (L, R, M, E, Y) with respect to E j : where S d ðÁÞ : R mÂn ! R mÂn is an absolute value shrinkage operator defined by ( for arbitrary real matrix X = (x ij ) m×n and δ > 0.
Computing Y. For given L, R, M and E, we calculate Y j according to the following formulation Denote the m×n zero matrix by O m×n . The whole iterative procedure is outlined as follows. Algorithm: ALM algorithm for RGLRAM Input: Data matrices fD i g N i¼1 , r 1 and r 2 .
Step 1: while not converged do Update L according to (13); Update M according to (15); Update R according to (17); Update M according to (15); Update E according to (18); Update Y according to (20); Update μ as μ:= min(ρμ, μ max ); End while. Output: L; R; fM i g N i¼1 and fE i g N i¼1 . During the initialization, we can stipulate L = orth(randn(m,r 1 )) and R = orth(randn(n,r 2 )), where orth(Á) is an orthonormal basis operator for the range of a matrix and randn(m, r 1 ) is an m×r 1 random matrix generated by the standard normal distribution. The stopping condition of can be set as ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi or the maximum number of iterations is reached, where ε is a sufficiently small positive number. Most of the time spent by the above algorithm is in the matrix multiplication and the thin QR decomposition. For the convenience of description, we suppose that m ! n > r 1 ! r 2 . Under the assumption, the computation complexity of GLRAM at each loop is O(Nmnr 2 ). Subsequently, we discuss the computation complexity of the ALM algorithm. It takes O(Nmnr 2 ) time for computing P and Oðmr 2 1 Þ time for performing the QR decomposition on P. Hence, the computation time of L is O(Nmnr 2 ). Similarly, the computation time of R is also O(Nmnr 2 ). Moreover, the computation times of M, E and Y are all O(Nmnr 2 ). In conclusion, the total computation complexity of the ALM algorithm at each loop is O(Nmnr 2 ), which means that RGLRAM has the same computation complexity as RGLRAM.

Convergence Analysis on the Proposed Algorithm
To the best of our knowledge, the convergence theory of ALM algorithms has not been established on non-convex problems or convex problems with more than two blocks of variables. Consequently, it is very difficult for us to prove the convergence of the proposed ALM algorithm. However, empirical evidence in Section 7 shows the strong convergence behavior of our algorithm. This section will discuss the weak convergence results of the proposed algorithm under mild conditions. If both L and R are known, then problem (8) is reformulated as The augmented Lagrange function of the above problem is To solve problem (22), we can simply revise the ALM algorithm by deleting the iterative formulations of L and R. In such a case, we have the following statement.
Theorem 1. Let {M (k) , E (k) , Y (k) } be the sequences produced by the revised ALM algorithm. = 0 and hðEÞ ¼ X N i¼1 kE i k 1 . It is obvious that g(M) and h(E) are two closed, proper and convex functions. Let is the diagonal block matrix operator. Then the equality constraints in problem (22) can be re-expressed as Bx + y = d. We denote g(x) = g(M) and h(y) = h (E). According to the basic convergence result given in [19], we have lim under the constraints of Bx + y = d. This ends the proof. Moreover, problem (8) is essentially equivalent to the following minimization problem without orthogonal constraints: Its un-augmented Lagrange function is On the basis of the differentials or sub-differentials of φ(L, R, M, E, Y) with respect to each block of variables, we have the following KKT conditions for problem (24): where @kÁk 1 is the set of sub-differentials of the component-wise l 1 -norm of a matrix. For arbitrary μ > 0, the condition Y j 2 @kE j k 1 is equivalent to . Hence, we arrive at according to the equation D j = LM j R T + E j . The above analysis means that the condition Y j 2 @kE j k 1 in (26) can be replaced by (28). be an iterative sequence produced by the ALM algorithm. If lim Proof. By the iterative formulations (18) and (20), we have The term lim k!1 X ðkþ1Þ À X ðkÞ F ¼ 0 implies that both sides of (29) and (30) tend to zeros as k approaches to infinity. Therefore, it holds that Moreover, by we have lim This completes the proof.

Special Case and Tensor Extension of RGLRAM
This section will conduct a further discussion on RGLRAM from two aspects. Firstly, we consider the robust low rank approximations of a single matrix. Secondly, we study the robust low rank tensor approximations by generalizing the proposed method to higher order tensors.
The Case for N = 1 Now, we consider a special case of RGLRAM, that is, N = 1. Under this case, problem (8) is rewritten as We call this problem as l 1 -norm SVD. In other words, problem (34) is the robust version of SVD. Recently, Wright et al. [6] and Candès et al. [7] proposed a principal component pursuit method to recover simultaneously the low rank and the sparse components. The proposed method, also referred to as Robust PCA (RPCA), is described mathematically as a convex optimization problem: where kAkÃ is the nuclear norm of A (i.e. the sum of singular values of A) and the tradeoff parameter λ > 0. A good rule of thumb for determining the parameter λ is that l ¼ 1= ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi maxðm; nÞ p [7]. The main algorithms for solving the nuclear minimization problem (35) suffer from high computation burden of SVD at each iteration. If the rank of the low rank component is properly estimated, then RPCA is boiled down to problem (34).
We further set r 1 = r 2 = r. Then problem (34) is transformed into where U 2 R mÂr ; V 2 R nÂr . This problem is also named as l 1 -norm PCA. An alternative optimization method was proposed to solve it [21]. Specifically, we minimize one argument U or V while keeping the other one fixed and then the optimization problem is decomposed into m or n independent linear programmings. Kwak [22] proposed another method of PCA based on l 1norm maximization. This method changes problem (36) into the following formulation Although V is absent in the above problem, it is still very hard to directly solve it. A greedy strategy [22] was proposed to solve problem (37). However, the successive greedy solutions might not be ideal.

Extensions to Higher Order Tensors
A K-order tensor is defined as X 2 R I 1 ÂI 2 ÂÁÁÁÂI K whose (i 1 The inner product, component-wise l 1 -norm and Frobenius norm of matrices can be easily generalized to tensors. Refer to the survey [23] for further understanding on tensor algebra.
In the following, we first consider the tensor expression of RGLRAM. Three-order tensors D, M and E are formed by concentrating the collections of fD i g The corresponding partial augmented Lagrange function is where Y 2 R mÂnÂN is the Lagrange multiplier tensor. Next, we extend RGLRAM to the general case of tensors and propose the corresponding iterative scheme to robust low rank tensor approximations. Consider a K-order data tensor D 2 R I 1 ÂI 2 ÂÁÁÁÂI K corrupted by large sparse noise. We assume that the tensor D is intrinsically low rank. Thus, D is decomposed into the sum of a low rank tensor and a noise component, that is, where M 2 R r 1 Âr 2 ÂÁÁÁÂr K , L i 2 R I i Âr i , E 2 R I 1 ÂI 2 ÂÁÁÁÂI K , r i < I i , i = 1,2,. . ., K. We can further impose the orthogonal constraints on all mode matrices L i . To recover the low rank and the sparse terms, we solve the following minimization problem min L;M;E kEk 1 s:t: where L ¼ fL i g K i¼1 . If K = 3 and L 3 ¼ I r 3 , then problem (41) is changed into problem (38). Without considering the orthogonal constraints in problem (41), we construct its partial augmented Lagrange function: Similarly, we propose an ALM method to solve problem (41). Concretely speaking, if Y is fixed, we alternately update each block of variables by minimizing f m ðL; M; E; YÞ with respect to one argument.
Computing L. For fixed j 2 {1, 2,. . ., K}, let L j be unknown and other blocks of variables be known. Then, the update formulation of L j is in the following: In view of the orthogonal property of L j , we take Computing M. Once L j is updated, we immediately compute M. The calculation procedure is as follows Computing E. Let E be unknown and other variables be fixed. We update E according to the following formulation: where S 1=m ðÁÞ is the tensor generalization of absolute value shrinkage operator (19).
Computing Y. The update for Y is as follows The complete algorithm description of the robust low rank tensor approximations is omitted due to its similarity with RGLRAM.

A. Ethics Statement
Some face datasets were used in this paper to verify the performance of our method. These face datasets are publicly available for face recognition research, and the consent was not needed. The face images and the experimental results are reported in this paper without any commercial purpose.

B. Experimental Results
In this section, we carry out experiments on synthetic data and real-world face datasets, and illustrate the feasibility and effectiveness of the proposed method. The experimental results of RGLRAM are compared with previous state-of-the-art methods: PCAL1 [22], RPCA and GLRAM. Both PCAL1 and RPCA are implemented on each training sample D i and the inexact ALM algorithm [17] is employed to solve RPCA. For the aforementioned four methods, the tolerance error ε is set to be 10 −8 and the maximum number of iterations is 1000. All experiments are performed using Matlab R2012a on an Intel Core i3-3220 3.30 GHz machine with 3.48 GB RAM.
Synthetic data. In this subsection, we synthesize N data matrices fD i 2 R mÂn g N i¼1 according to the following formulations: D i = A i + E i , where A i are low rank and E i are sparse. Concretely speaking, A i are generated randomly as follows: where U 2 R mÂr 0 1 , V 2 R nÂr 0 2 , S i 2 R r 0 1 Âr 0 2 and their entries are independently drawn from the standard normal distribution. As for each E i , we first define N projection operators ( for arbitrary X ¼ ðx jk Þ mÂn 2 R mÂn , where O i are produced by uniformly sampling from {1,2,. . ., m}×{1,2,. . ., n} with probability p, i = 1,2,. . ., N. Then we generate E 0 i 2 R mÂn by a uniform distribution on the interval (−a, a). Finally, we set The magnitude of noise is measured by the Inverse Signal-to-Noise Ratio (ISNR) defined as Large value of ISNR means large noise and it is probably disadvantageous to recovering the low rank components.
For each D i , we can obtain its low rank approximation A _ i and noise component E _ i by some method discussed in this paper. The Relative Errors (REs) are adopted to evaluate the recovery performance of the low rank and the sparse components respectively and they are defined as follows: The smaller the relative error is, the better the recovery performance is. In detailed implementation, the parameters setting is stipulated as follows: m = n = 100, N = 50, r 1 = r 2 = r, r 0 1 ¼ r 0 2 ¼ r 0 . We initialize the parameter r to be r 0 and vary it from 10 to 50 with an interval of 10. Three groups of experiments are designed according to the magnitude of noise, that is, we take a = 0.5,5 and 15, respectively. In each group of experiments, we consider two different sampling rates for O i : p = 0.1 and 0.2. For fixed r, a and p, the experiments are repeated 10 times and the average results are reported. The experimental results are shown in Tables 1, 2 and 3, respectively. From these three tables, we have the following observations. (I) For given levels of noise, PCAL1 cannot effectively recover the low rank and the sparse components although it has superiority in running time. The smallest relative error of PCAL1 is 0.154 and the values of RE1 is larger than or equal to 1.58 when a = 5 or 15. To some extent, the failure of PCAL1 is resulted in the relatively large ISNR. (II) RPCA achieves good recovery performance for small r and has strong robustness to large sparse noise. When r = 10, the largest relative error of RPCA is 0.00644, which indicates that RPCA obtains satisfactory recovery performance. However, RPCA deteriorates drastically with the increasing of r. Besides, it also has the longest running time. (III) For relatively small noise corruptions (i.e. a = 0.5), the RE1 values of GLRAM are grudgingly acceptable while the RE2 values are unacceptable. In the presence of medium or large noise corruptions (i.e. a = 5 or 15), GLRAM can not effectively recover both the low rank and the sparse matrices. (IV) If a 2 {0.5, 5} or p = 0.1, the smallest relative error of RGLRAM is 1.46×10 −7 , which means RGLRAM almost exactly recovers both the low rank and the sparse components. The ISNR corresponding to the case (p,a) = (0.2,15) is so large that RGLRAM can not effectively the low rank and the sparse terms. With regard to the running time, RGLRAM  is generally longer than GLRAM and shorter than RPCA. In summary, RGLRAM is the most effective method for recovering the low rank components among these four methods. Sensitivity to initialization and parameter choice. This subsection will evaluate the sensitivity of the ALM algorithm to the initialization of {L, R, E, Y} and the choice of {r 1 , r 2 }. We still use the artificial datasets generated by the manner of the previous subsection.
A simple initialization strategy for {L, R, E, Y} is adopted in Subsection 7.1, that is, both L and R are orthogonalized randomly while E i and Y i are initialized to be zero matrices. For the special synthetic data, we observe from Tables 1, 2 and 3 that RGLRAM successfully recover the low rank and the sparse components if a 2 {0.5, 5} or p = 0.1. Under this situation, the standard deviation of running time varies in the interval [0.76, 4.58], that of RE1 varies in the interval [1.77×10 −9 , 7.52×10 −8 ] and that of RE2 varies in the interval [9.02×10 −10 , 1.43×10 −8 ]. These observations on some datasets illustrate that the ALM algorithm is not very sensitive to the random initialization of L and R. Subsequently, we will discuss the sensitivity of the ALM algorithm to the initialization of E and Y.
In previous experiments, we initialize E i and Y i with zero matrices. Nowadays, a new initialization method is considered as below: E i = λ 1 × randn(m, n), Y i = λ 2 × randn(m, n), where λ 1 and λ 2 are two given real numbers, i = 1,2,. . ., N. It is obvious that E i and Y i are zero matrices if λ 1 = λ 2 = 0. We consider 11 different values for λ 1 and λ 2 , that is, λ 1 , λ 2  Robust GLRAM experimental results are described in Fig 1, where the left shows the relative errors under different λ, and the right plots the error bar of running time with varying λ. From Fig 1, we can see that the relative errors lie in the range from 10 −9 to 10 −8 , which means RGLRAM obtains the relatively stable values. In addition, RGLRAM has small fluctuation in the average running time and the corresponding standard deviation is no more than 1.22 seconds. The above experiments implies that for our synthetic data, the ALM algorithm is also not too sensitive to the random initialization of both E and Y.
Next, we investigate the influence of r on the recovery performance. For this purpose, we consider the case that p = 0.1, r 0 = 20 and a 2 {0.5, 5}. We vary the value of r from 18 to 30. Fig 2 illustrates the comparison of relative errors with different r. We can see from this figure that both RE1 and RE2 of RGLRAM are less than 0.0068 when a = 0.5 and 20 r 24, and less than 0.0080 when a = 5 and 20 r 28. These observations on synthetic data show RGLRAM can successfully recover the low rank and the sparse components when the rank r belongs to some interval. Furthermore, r should be large than or equal to r 0 to obtain better recovery performance. It is probable that the parameter r has large taking value interval in the presence of large noise. In a word, RGLRAM is not very sensitive to the choice of r on the synthetic data generated by our paper.
Model evaluation and computation complexity verification. In this subsection, we will evaluate the generalization performance of RGLRAM and verify its computation complexity through extensive experiments.
The S-fold cross validation is employed to evaluate the generalization ability of RGLRAM. We first randomly partition the given samples into S equal sized subsamples. Then one subsample is chosen as the validation for testing the model, and the remaining S-1 subsamples are used as the training set. This validation process is repeated S times. Finally, we average the S results from the folds. In implementation, we first obtain two orthogonal matrices L and R by learning from the training set, and then apply them to the testing set.
We still use the synthetic data in Subsection 7.1 and set N = 50, p = 0.1, a = 5, m = n = r, S = 10. For a test sample D test = A test + E test , we hope to obtain its low rank approximation LM test R T , where A test and E test are the low rank and the sparse components of D test respectively. The optimal M test can be obtained by solving the minimization problem: This optimization problem can be regarded as the special case of problem (8) and resolved by simply revising the ALM algorithm. The relative error RE1 is employed to evaluate the model, and the training error and the test error are listed in Table 4. It can be seen from this table that RGRLAM not only has very small training error, but also almost exactly recover the low rank components in the test procedure. Hence, RGRLAM has good generalization ability. Section 4 discusses the computation complexity of the ALM algorithm in one loop. Now, we verify the relationship of running time with the size of matrices and the sample size. To this end, we design two groups of experiments. In the first group of experiments, we study how the total running time changes with the increasing of size of matrices. For the sake of convenience, we let N = 1, a = 0.5, p = 0.1, m = n = 100i and r 1 ¼ 25. For fixed parameters, the experiments are repeated only once to save time and the experimental results are illustrated in Fig 3, where the right performs a parabola fitting on the discrete running times as a reference curve. We can draw two conclusions from this figure: RGLRAM almost exactly recovers both RE1 and RE2 in most cases; the running time is about a parabola function relation with m. The second conclusion is identical with the result of Section 4.
The second group of experiments aims to explore the relationship of total running time with N. Let m ¼ n ¼ 100; r 1 ¼ r 2 ¼ r 0 1 ¼ r 0 2 ¼ 20; a ¼ 0:5; p ¼ 0:1 and N = 50i, where i = 1,2,. . .,40. We conduct the experiments only once for given parameters and outline the  From the left subplot, we can see both RE1 and RE2 approach to zeros. A linear fitting on the discrete running times is shown in the right subplot. From this plot, we find out that the running time has the linear relationship with N on the whole. This conclusion is also in accordance with the result in Section 4.
Applications in face images datasets. We carry out the experiments on two well-known face image datasets: Olivetti Research Laboratory (ORL) dataset [24] and Yale face dataset [3].  We first compare the performance of low rank recovery by adding salt and pepper noise to each image with a density of 0.1. At this moment, the aforementioned two face datasets corrupted by sparse noise are represented by two collections of matrices. We consider four low rank recovery methods and one denoising method listed as follows: PCAL1, GLRAM, RPCA, RGLRAM and Total Variation (TV) [25]. Let r 1 = r 2 = 20 in both GLRAM and RGLRAM, and the low rank to be 20 in PCAL1. Compared with PCAL1, RPCA and TV, both GLRAM and RGLRAM have better compression performance. The compression rate of GLRAM or RGLRAM is 10.08 on ORL, and that of GLRAM or RGLRAM is 9.86 on Yale. The experimental results are partially shown in Figs 5 and 6, respectively.
From the above two figures, we can see that the PCAL1, GLRAM and TV can not efficiently remove the noise and recover facial contour features. In contrast, both RPCA and RGLRAM have better recovery performance. In terms of image reconstruction, RPCA seems to have better quality than RGLRAM, which maybe interpreted as that we choose the relatively small ranks in RGLRAM. Furthermore, RGLRAM has superiority over RPCA in running time and compression ratio. For instance, RGLRAM requires 82.79 seconds on ORL and 32.93 seconds on Yale, and the running time of RPCA is 212.30 seconds on ORL and 86.65 seconds on Yale. To sum it up, RGLRAM has the best performance in recovering low rank components.
Secondly, we compare the values of Peak Signal to Noise Ratio (PSNR) with different rank and noise density, where PSNR is defined as follows: where D i indicate the image matrices corrupted by noise and A _ i are the recovered matrices. We set r 1 = r 2 = r and vary r from 10 to 30 with an interval of 5. The density of salt and pepper noise is denoted by p. The experiments are repeated 10 times for fixed parameters and the average values of PSNR are recorded. The PSNR comparison among PCAL1, GLRAM and RGLRAM is shown in Table 5, and the comparison between TV and RPCA is shown in Table 6.
By combining Tables 5 with 6, we can see that RGLRAM achieves the best PSNR in all cases. These five methods are sorted in order of decreasing of PSNR as follows: RGLRAM, GLRAM, RPCA, PCAL1 and TV. On the average, the PSNR of RGLRAM is 4.55 large than that of GLRAM on ORL, and 5.28 on Yale. These results show that RGLRAM has the best recovery performance.

Conclusions
In this paper, we study the model of robust GLRAM and present an iterative algorithm based on the technique of ALM. The proposed method can also be employed to solve the problem of robust SVD or l 1 -norm PCA. Then we consider the tensor version of robust GLRAM and propose an iterative scheme to robust tensor approximations. It is illustrated experimentally that our method can recover perfectly both the low rank and the sparse components under some conditions.
Our study so far is limited to recovering the low rank and the sparse components for the collections of low rank matrices with a fraction of their entries being arbitrarily corrupted. It would be interesting to investigate the situation that the noise is the superposition of dense small perturbation and sparse gross errors. Moreover, it still needs further research on robust tensor approximations and robust GLRAM with missing entries.