Learning Low-Rank Class-Specific Dictionary and Sparse Intra-Class Variant Dictionary for Face Recognition

Face recognition is challenging especially when the images from different persons are similar to each other due to variations in illumination, expression, and occlusion. If we have sufficient training images of each person which can span the facial variations of that person under testing conditions, sparse representation based classification (SRC) achieves very promising results. However, in many applications, face recognition often encounters the small sample size problem arising from the small number of available training images for each person. In this paper, we present a novel face recognition framework by utilizing low-rank and sparse error matrix decomposition, and sparse coding techniques (LRSE+SC). Firstly, the low-rank matrix recovery technique is applied to decompose the face images per class into a low-rank matrix and a sparse error matrix. The low-rank matrix of each individual is a class-specific dictionary and it captures the discriminative feature of this individual. The sparse error matrix represents the intra-class variations, such as illumination, expression changes. Secondly, we combine the low-rank part (representative basis) of each person into a supervised dictionary and integrate all the sparse error matrix of each individual into a within-individual variant dictionary which can be applied to represent the possible variations between the testing and training images. Then these two dictionaries are used to code the query image. The within-individual variant dictionary can be shared by all the subjects and only contribute to explain the lighting conditions, expressions, and occlusions of the query image rather than discrimination. At last, a reconstruction-based scheme is adopted for face recognition. Since the within-individual dictionary is introduced, LRSE+SC can handle the problem of the corrupted training data and the situation that not all subjects have enough samples for training. Experimental results show that our method achieves the state-of-the-art results on AR, FERET, FRGC and LFW databases.

Introduction addition, these methods are problematic when the training set is unbalanced in sense that certain individuals have very few training samples compared to others.
The low-rank matrix recovery technique has been successfully applied to various fields, for instance multimedia [15], document analysis [16], salient object detection [17] and image processing [18]. One representative is robust principal component analysis (RPCA) [19], which decomposes a corrupted matrix into a sparse component and a low rank component. RPCA can be exactly solved via a nuclear norm regularized minimization problem. Considering that the training images are corrupted, low-rank matrix recovery has been used for denoising. Ma et al. [20] exploit rank minimization and propose a discriminative low-rank dictionary learning for sparse representation (DLRD_SR). DLRD_SR separates the noises in the training images by minimizing the rank of the sub-dictionary of each class. A low-rank matrix recovery algorithm with structural incoherent for robust face recognition has been presented by Chen et al. [21]. Chen's method considers the noises in the training images and achieves well results when the training images are corrupted because of occlusion and disguise. A discriminative and reconstructive dictionary is constructed and a discriminative low-rank representation for image classification is obtained [22]. These algorithms separate the sparse noises from the training images and are robust to severe illumination variations or occlusions.
In practical scenarios, facial images contain uncertain and noisy information, such as illumination conditions, expression conditions, or occlusions. If given the training images of each subject which cover the facial variations of that person under testing conditions, face recognition will become an easy task. Obviously, this situation is not practical and face recognition is a small sample size problem in general [23]. However, the original SRC algorithm assumes that there is a sufficient number of training samples for each class. Therefore, Wagner et al. [24] extend SRC and introduce a method to obtain a set of training images of each subject for covering all possible illumination changes. In order to cope with the small sample size problem under the SRC framework, Extended SRC (ESRC) is proposed by Deng et al. [25], which utilizes images collected from external data to construct a intra-class variant dictionary. The variant dictionary is applied to represent the possible variations between the training and testing images. With the help of intra-class variant dictionary, ESRC outperforms SRC, especially when a single training image per class. However, there are two shortcomings in ESRC. Firstly, ESRC needs an external dataset and requires that the external data is very relevant to training and testing data, which may not be as readily available in real applications. On the other hand, images collected from external data might contain noisy, redundant, or undesirable information which would degrade the capability in covering intra-class variations [26]. Secondly, ESRC can't deal with the cases where the training data are corrupted well. Given the training images of each class, ESRC and SRC don't consider the difference between subject-specific feature, also known as the discriminative vector of each subject, and the intra-class variant feature. The intra-class variant feature capturing image-specific details, such as expression conditions, is non-discriminative and can be shared by all subjects. Fig 1 shows an example, in which one training image of Person A is occluded by a sunglass. SRC may treat the occluded region (sunglass) as the inherent feature of Person A and makes a wrong decision. For this reason, Mi et al. [27] have proposed two novel robust face recognition methods based on linear regression (RLRC 1 and 2). They consider that each class-specific subspace is spanned by two kinds of basis vectors. The first one is the common basis vectors shared by many classes; the other one is the class-specific basis vectors owned by one class only.
In this paper, we just consider face recognition from frontal views. Hence, the facial images of the same person often correlate with each other and if we stack the training images within the same subject into a matrix, this matrix should be approximately low-rank. To build a robust classifier against small sample size problem and the problem of unbalanced training set, we propose a novel face recognition framework by using low-rank and sparse matrix decomposition, sparse coding techniques (LRSE+SC). First, the training images of each individual are decomposed into a representation basis matrix of low rank and a sparse error matrix. The representation basis matrix determines the class-specific subspace. Many methods, for example DLRD_SR, ignore the interesting information contained in sparse large noises. The sparse error matrix, which represents the gross corruption of the training images, such as expression, occlusion, or illumination conditions, is very important for face recognition. It consists of the noise or within-individual variance and can explain why two images of the same subject do not look identical. Second, the representation basis matrix of all subjects are collected and the supervised dictionary is established. Meanwhile, we integrate the sparse error matrix of all subjects into a within-individual variant dictionary shared by all classes. We then combine the supervised dictionary with the within-individual variant dictionary to encode a query image with sparsity constraint. In this way, the class-specific dictionary differentiates the subjects and the within-individual variant dictionary is used to provide the essential reconstruction for the query image. Fig 2 presents the motivation of the proposed approach. Finally, as SRC, a reconstruction-based scheme for classification is adopted. From Fig 2, the query image is able to successfully recognized by our method. The experiments demonstrate that our method achieves very promising performance.
Three main contributions of this paper are as follows. Firstly, we decompose the training images of each class into a low-rank part and a sparse part by low-rank matrix recovery. The low-rank part is a representation basis matrix of each class and it determines the class-specific subspace. The class-specific dictionary captures the discriminative feature of each class and is owned by only one class. The sparse part accounts for intra-class variance and can be shared by other subjects. Hence, for each image, it can be decomposed into a vector from intra-class variant subspace and a discriminative vector from class-specific subspace. Secondly, we analyze the reason why SRC doesn't work effectively when there are not enough training samples for each class. Thirdly, a supervised dictionary and a within-individual variant dictionary are builded to sparsely encode the query image. Our method is different from traditional dictionary learning methods (e.g. MFL, LCKSVD), which don't consider the problem that not all individuals have plenty of training samples. The most important of our method is that we separate the withinindividual variance information from the training images and introduce an auxiliary dictionary by using the sparse error matrix per class.

Background
Firstly, let us present a typical face recognition problem. Assume there are n training images from c distinct classes, and n i training images from the ith subject, i = 1, 2, . . ., c. Each image is represented as a vector x i j 2 R m×1 , which means this image from the ith class. m denotes the dimension of feature space for all images. X i = [x i 1 , x i 2 , . . ., x i n i ] 2 R m×n i consists of training images belonging to the ith subject and X = [X 1 , X 2 , . . ., X c ] 2 R m×n denotes the training images matrix by concatenating all training samples. Given the training samples in X, the aim of face recognition is to classify a query image y 2 R m .
Recent research has proved that linear regression based algorithms, e.g. Nearest Feature Subspace (NFS) [28], Linear Regression Classification (LRC) [29], and Sparse Representation Classification (SRC), are extremely easy to use and powerful for face recognition. The linear regression based algorithms assume that the images of one individual subject lie on a class-specific linear subspace.
Sparse representation has attracted broad interest in various domains due to its great success in image processing. The basic idea of sparse representation is to represent a given test sample Face Recognition Based on Low-Rank Matrix plus Sparse Representation correctly with as few training samples as possible. SRC assumes that training samples from each subject lie on a linear subspace spanned by the training images from the given subject. Therefore, for a test sample y 2 R m belonging to class i, if given sufficient training samples of class i, we have where β i = [β i 1 ; β i 2 ; . . .; β i n i ] 2 R n i is the coefficient vector corresponding to X i . Of course, this class-specific subspace is embedded in the linear space spanned by all the training images. Hence, y can be cast as the linear combination of all training samples, i.e.
where β 2 R n is a coefficient vector. Actually, we don't know which class the query image y comes from. Hence, the goal of sparse representation is to represent y using as few training images as possible, which is computed by solving the following minimization problem: where is a pre-specified small constant and kÁk 0 means ℓ 0 -norm. The above problem is NPhard and it can be solved by approximating the ℓ 0 -norm with ℓ 1 -norm based convex relaxation. Hence, problem Eq (3) can be transformed to minimizing the reconstruction error with a ℓ 1norm regularizer, i.e.
where λ is a scalar constant. In the ideal case, the entries of b b are zeros except those associated with the column of X from the ith class. In practice, this is not real, the recovered coefficient vector b b has most of its non-zero entries corresponding to the atoms belonging to the groundtruth class of the query image, while a few non-zero values are distributed elsewhere. Therefore, the query image y is assigned to the class which has the minimum reconstruction residual.
For SRC, the original training images act as a dictionary to represent the query image. Because the original face images may contain some noise, uncertain, or redundant information that can be negative to recognition, learning a dictionary from training images becomes an active topic. Yang et al. [8] propose a metaface learning (MFL) algorithm to represent the query image by a series of dictionaries learnt from each class. In order to achieve good performance, many discriminative dictionary learning methods are presented [9] [10] [11] [12] [13]. However, these dictionary learning methods need sufficient number of training samples per class.

Proposed Method
In practice, training images which are corrupted (i.e., occlusions, lighting variations, facial expressions) violate the linear subspace assumption. Furthermore, due to insufficient number of training images, the query images of the ith class may be not lie on the subspace spanned by training images X i . Therefore, the performance of SRC will deteriorate in these two situations. In order to handle the small sample size problem, leave-one-class-out subspace model is proposed [27]. The leave-one-class-out subspace of each class consists of all the common vectors and class-specific basis vectors for other classes but does not include any class-specific basis vectors for itself. In this section, we will propose a novel face recognition framework by utilizing low-rank and sparse error matrix decomposition, and sparse coding techniques. Unlike leave-one-class-out subspace model, our method can explicitly extract class-specific basis vectors owned by only one class and separate within-individual variant basis vectors from the original training images.

Basic Assumption
In this paper, we do not consider the impact of variations in pose and age. Because images are affected by variability in illumination, expression and occlusion, images of the same individual do not look identical to each other. We assume that x i j comes from the ith individual and can be represented as where x i is the clean and neutral image of the ith individual, and the term e i j consists of noise or within-individual variance and is sparse. e i j may contain the information about illumination conditions, expression conditions, and even occlusions in the image x i j . That is, a facial image can be decomposed into a neutral component and a sparse component pertaining to details on the face such as expressions, or occlusions (see Fig 3). Under this assumption, for another image x i k (j 6 ¼ k) of the ith subject, the difference between e i j and e i k can explain why two images Face Recognition Based on Low-Rank Matrix plus Sparse Representation (x i j and x i k ) both belong to the ith subject but do not look identical. On the other hand, two images from different subjects may have the same within-individual variance e. For example, in Fig 1, the query image looks like the training image with sunglass from Person A. Hence, many methods classify the query image as Person A due to the sunglass. If the sunglass is separated from the query image, we may make a right decision. For a image x i j , it can be decomposed into signal x i and noise component e i j .
x i captures the structured patterns of the ith subject and thus it can be used for classification, while the within-individual variance e i j only contributes the essential representation for the image x i j .

Face Recognition by Using Low-rank Matrix Recovery and Sparse Coding Techniques
Wright et al. choose training samples as dictionary for sparse coding. If training images are corrupted, SRC fails to extract the class-specific feature of each subject from the original training images and can't handle the cases when the training set is corrupted. For example, in Fig 1, the training image of Person C is occluded by scarf, the occluded regions (the scarf) might be regarded as the structure pattern of Person C. According to Eq (5), there exist the common patterns and the within-individual variance among images of the same class. The variability caused by the unbalanced lighting changes, variable expressions, and occlusions, can be shared by many subjects. On the other hand, SRC requires a large number of training samples of each subject to span the complete class-specific subspace. In this paper, we try our best to mitigate the negative effects of specific variance and utilize it.
Recently, low-rank matrix decomposition technique has received significant attention. As we know, principal component analysis (PCA) has been widely used for extracting low-dimensional information from the high-dimensional data. However, classical PCA lacks of robustness to grossly corrupted observations [30]. In order to robustify PCA, many approaches have been proposed in the literatures [31] [32] [19]. In particular, Wright et al. [19] recently have proposed a robust PCA method which is a powerful tool for various applications, such as image processing [18]. The training images which we have collected are often affected by expression, pose, occlusion or illumination. For dictionary learning methods, the dictionary learned from the original images might contain information about the image-specific details, such as expressions, occlusions. It has a negative effect on classification. The facial images from the same subject are correlated with each other and natural high-dimensional data often lies on a lowdimensional linear subspace. Meanwhile, each image contains image-specific details such as specularities and cast shadows, or noise with sparse support in the image. Therefore, the training images within each subject are decomposed into a low-rank matrix and a sparse matrix by using low-rank matrix recovery technique. The sparse matrix means that the images within each class undergo gross corruption such as occlusion, pose, or illumination changes.
For the noisy training images of the ith class, according to Eq (5), it can be modeled as: where each column of D i represents the neutral image of the ith subject and E i is the noise matrix of the ith class (i 2 {1, 2, . . ., c}). Because the neutral images of each subject are correlate with each other, D i is a low-rank matrix. E i represents expressions, occlusions, specularities and cast shadows in the training images of the ith individual and is a sparse matrix. Therefore, this decomposition can be solved by the following optimization problem: where kÁk 0 represents the ℓ 0 -norm (the nonzero entries in the matrix) and λ is the parameter that trades off the rank term and the sparsity term. However, Eq (7) is non-convex and NPhard to solve. Wright et al. [19] indicate that under broad conditions the aforementioned lowrank matrix recovery problem Eq (7) can be exactly solved via the following convex optimization problem: where the nuclear norm kD i kÃ approximates the rank of D i . To solve the optimization problem Eq (8), the augmented lagrange multiplier method proposed by Lin et al [33] can be adopted.
After the low-rank matrix D i and sparse error matrix E i for each subject have been learned, . D i contains the structured patterns and discriminative feature of the ith subject. Therefore, D i has a better representative ability than the original data X i in describing the face images of the ith subject [21]. The class-specific subdictionaries (D i ) of all subjects are combined to build the supervised dictionary D. On the other hand, the non-class-specific dictionary E only contributes to essential representation of the images, such as expression and illumination conditions, rather than discrimination. Since dictionary E represents non-class-specific variations, the random noises need to be decreased. This is done by removing dictionary atoms whose norm is less than an arbitrary-chosen threshold (e.g. 10 −3 ).
According to the basic assumption, a query image y can be represented as where x is the natural image and it can be represented by Dα. Sparse error matrix E usually represents lighting changes, exaggerated expressions, or occlusions. e (Eβ) represents the image details of y, such as expression conditions or noise with sparse support in the image. We can use the within-individual variant dictionary E and supervised dictionary D to represent y. If there are redundant and over-complete facial variant bases in E, the combination coefficients in β are naturally sparse. Hence, the sparse representation α and β can be recovered simultaneously by ℓ 1 -norm minimization.
Based on Eqs (6) and (9), we propose a face recognition framework by using low-rank and sparse error matrix decomposition and sparse coding techniques (LRSE+SC). Our method treats the face recognition problem as finding a sparse coding of the query image in term of the supervised dictionary as well as the within-individual variant dictionary.
After introducing the two phases of the proposed method, the main steps of LRSE+SC are summarized in Algorithm 1.
Algorithm 1 Low-rank matrix recovery and sparse coding for face recognition (LRSE+SC) Inputs: A matrix of training images X = [ X 1 , X 2 , . . ., X c ] 2 R m×n for c subjects and the query image y, parameters λ 1 , λ 2 .
Output: Class label of the query image y.
Step 1: Learning class-specific dictionary and intra-class variant dictionary by low-rank matrix recovery. for Step 2: Building supervised dictionary and Within-individual variant dictionary. The is builded by integrate all the sparse error matrix of each subject. However, for each column of E, if the norm is less than a threshold η (e.g. η = 10 −3 ), it is removed from E.
Step 3: Finding sparse representation of the query image y in term of new Step 4: Classification

Analysis of the Proposed Method
In this section, the justification of our method and the difference with SRC will be discussed. Linear regression based algorithms assume that the images from a single subject lie on a linear subspace. We denote the subspace spanned by the training images from the ith subject as S i . Thus, we have Given a query image y, we assume that it comes from the ith class. If we have observed a sufficient number of training images per subject, then the query image can be well reconstructed by the training images belonging to the ground-truth class of it. Therefore, y 2 S i . Obviously, SRC has achieved very promising results in this situation. In practice, facial images might suffer from expression, illumination variations and even occlusions, there isn't adequate number of training images for the ith class to cover the variations of the test image y, i.e., y = 2 S i . However, for large scale face recognition problems where the training sets contain large number of subjects, some training images from other subjects can be used to describe the test image y. Therefore, in this paper, we suppose that the test sample lies on the subspace spanned by all training samples, i.e., y 2 S = span{X 1 , . . ., X c } and y = 2 S i . For SRC, the training samples, which are the most similar with the query image y, are selected to represent y. Since y = 2 S i , there exist training samples x 0 1 ; ::; x 0 p which come from other classes and they can be used to represent y. Certainly, y can't be modeled accurately by training images just from others because there must be some unique patterns owned by the ith class. So, we have y 2 spanfx i 1 ; :::; x i n i ; x 0 1 ; :::; x 0 p g and y = 2 spanfX Ài g; where X −i represents all training samples except for the ith subject. According to Eq (11), the linear representation of y can be written as: where α i j and β j are the coefficients. Obviously, α = [α i 1 ; α i 2 ; . . .; α i n i ] 6 ¼ 0 and β = [β 1 ; β 2 ; . . ., β p ] 6 ¼ 0. Without loss of generality, assume that x 0 1 ; ::; x 0 p all come from the kth (k 6 ¼ i) subject. If the contribution of training data belonging to the ith class is small, it is possible that Then SRC may classify y as the kth class. For example, the query image in Fig 1. Due to the sunglass, it looks like the image bounded by green rectangle and SRC recognizes it as Person A. In fact, the training samples of others are used to represent the regions of the test image y which might be caused by illuminations, expressions, or occlusions. However, SRC can't separate these components (such as, illumination, expression variations) from the original training samples, hence, SRC treats them as discriminative feature of each subject.
According to the theories of linear subspace and Eq (5), S i , i.e., the subspace of Subject i, can be modeled as :::; e i n i ; In Eq (5), only one vector x i is used to represent the class-specific information of images for Subject i and it is the discriminative component of this subject, while e i 1 , e i 2 , . . ., e i n i are basis vectors for the within-individual subspace and explain why training images of Subject i do not look identical. Hence, the basis vectors of subspace S i can be divided into two categories: The first one is the discriminative vector for each class; the other one is within-individual variant vector. However, in practical scenarios, facial images are affected by many factors, it's not appropriate to describe the class-specific information of Subject i by just one basis x i . From Eq (6), utilizing low-rank matrix recovery technique, we can use a matrix D i as representation basis matrix of Subject i and S i is the subset of space spanned by D i and E i , i.e., where d i j and e i j represent the jth (j = 1, . . ., n i ) column of D i and E i , respectively. We denote S 0 i ¼ spanfd i 1 ; d i 2 ; :::; d i n i g. Therefore, S 0 i is a class-specific subspace and span{e i 1 , e i 2 , . . ., e i n i } represents the noise or within-individual variance of X i . We combine the sparse error matrices of all subjects into a within-individual variant dictionary E = [e 1 1 , . . ., e 1 n 1 , . . ., e c n c ], which can be used to model the intra-class variations lighting conditions, expressions or occlusions. Therefore, W = span{e 1 1 , . . ., e 1 n 1 , . . ., e c n c } is a within-individual subspace. From Eq (14), we For a query image from the ith subject, it lies in the linear subspace S 0 i þ W. Therefore, the query image y in Eq (12) can't be represented by training samples from Subject i, but it may be lie in S 0 i þ W. Fig 2 shows an example. The query image y belongs to Subject 84. However, y = 2 S 84 due to the occluded regions. By taking advantage of low rank matrix recovery, we obtain the within-individual subspace W. The query image y can be represented by the class-specific subspace S 0 84 and within-individual subspace W. Hence, LRSE+SC can alleviate the small sample size problem and the problem of the corrupted data.
From Eq (8), when λ 1 tends to infinity, all the atoms of within-individual dictionary E are zeros. In this situation, our method LRSE+SC is equivalent to SRC.

Results
In this section, several experiments are implemented to demonstrate the effectiveness of the proposed LRSE+SC algorithm by comparing it with the state-of-art on the AR, FERET, FRGC and LFW databases. Besides SRC, we compare our method with linear regression for classification (LRC) [29], Extended SRC (ESRC) [25], MFL, RLRC1 and RLRC2 [27]. In ESRC, we construct the intra-class variant dictionary by subtracting the class centroid of images from the same class. As we known, many algorithms [34] [35] [36] [37] can solve the ℓ 1 -regularized least squares problem. The feature-sign search algorithm is very fast and achieves high performance [35]. For fair comparisons, both LRSE+SC, SRC and ESRC use the feature-sign search algorithm to solve the ℓ 1 minimization problem. The regularization parameters in all algorithms are tuned by experience. The Matlab code of LRSE+SC algorithm can be downloaded from http://www.researchgate.net/publication/264556568_LRAESC?ev=prf_pub.

AR Database
The AR face database is employed because it's one of the very few including natural occlusions. The AR database consists of over 4,000 frontal images for 126 individuals [38]. For each individual, 26 images are taken under different variations, including illumination, expression, and facial occlusion in two different sessions. All the images are cropped with dimension 44 × 40 and converted to gray scale. In this paper, we select a subset of 50 male subjects and 50 female subjects for our experiments (as [7] do). For each individual, there are six occluded images and the remaining seven are simply with illumination and expression variations in each session.
A. Face Recognition Without Occlusion. In this part, we just consider face recognition without occlusion, and the occluded face images aren't considered. Hence, for each class, fourteen images with only illumination and expression changes are used for experiments.
In the first experiment, n i images per class are randomly selected from Session 1 for training and the rest (14 − n i per class) are used as query samples. This partition produce is repeated for 5 times and we compute the average results. n i denotes the number of training samples of the ith class and it may be different for each class. To test undersampled effect, the number of training samples per class n i is small. We set n i = 2, 3, 4, rand( [2,5]) respectively, where rand ( [2,5]) means that the number of training samples per class is a random number between 2 and 5. Hence, for each class, it's obvious that there are insufficient training samples to span the variations of expression and illumination under testing conditions. The average recognition rates are demonstrated in Fig 4. Since the training data size is small, the recognition rates of all methods are poor. Compared with other methods, the recognition rates of LRC, RLRC2 are unacceptable and aren't enumerated in Fig 4. For example, when n i = 2, the classification accuracies of LRC, RLRC2, SRC, MFL, ESRC, RLRC1 and LRSE+SC are 36.52%, 45.15%, 71.65%, 69.03%, 71.66%, 73.46%, 74.82%, respectively. For all methods, the recognition rates rise as the number of training samples increases. As can be seen, our algorithm LRSE+SC outperforms all the other methods. When the number of training samples per each class is unequal (n i = rand ([2, 5])), the performance of LRSE+SC improves significantly. For example, the recognition rate of MFL is 78.76% while LRSE+SC achieves 83.69%.
In the foregoing experiment, there is not enough training data for each class. For the second experiment, we consider the scenario that there are sufficient training samples for some subjects. Hence, we randomly choose p classes ({i 1 , i 2 , . . ., i p }) and for each of these p classes, seven images with illumination and expression changes at Session 1 are selected for training. On the other hand, for the remaining classes ({j 1 , j 2 , . . ., j 100−p }), only one image of each class at Session 1 is randomly selected and used as training sample. For each subject, seven images (without occlusion) from Session 2 are used for testing. In other words, for any class from {i 1 , i 2 , . . ., i p }, sufficient training samples are used for training, meanwhile, for others, there is only one training sample. We repeat this procedure for 5 times and report average recognition accuracy.
It is important to note that this scenario considered here is difficult. Fig 5 shows the average recognition accuracy versus number of subjects (p), each of which has seven images for training. For all the methods, the performance in this scenario is not well. It is clear to see that LRSE+SC  There are seven training images for each person and they are shown in Fig 6(a). Fig 6(b) presents the dictionary learned by MFL. For MFL, the dictionary is learned individually per class. Compared with the original training images, the dictionary in MFL mitigates over-fitting problem. SRC treats the original training images as dictionary. Hence, MFL performs better than SRC. The low-rank class-specific dictionaries fD i g c i¼1 are presented in Fig 6(c). By utilizing lowrank matrix recovery, we extract subject-specific feature that is used as supervised dictionary and separate the common feature (as shown in Fig 6(d)) that is caused by illumination, expression from the original training samples. In order to show the importance of the intra-class variant dictionary E for recognition, we just use the low rank class-specific dictionaries fD i g c i¼1 to represent the query image and then a reconstruction-based scheme for classification is adopted. This method is denoted as LR+SC. From Fig 5, it can be seen that LRSE+SC greatly outperforms LR+SC.
The basic assumption in SRC is that there are sufficient training images of each subject and this assumption might be violated when most of subjects have only one training sample. Fig 7  shows such an example. There is only one training sample for Subject 54 as shown in Fig 7(b). Due to variant lighting, the query image (Fig 7(a)) is very similar with the training sample of Subject 38. Hence, as presented in Fig 7(c), the training sample with the largest weight is from  Fig 7(d) indicates that SRC fails to recognize the subject which the query image belongs to. Different from SRC, we decompose the original training samples into low-rank class-specific dictionary and sparse intra-class variant dictionary. The intra-class variant dictionary represents the illumination variations sparsely as shown in Fig 7(f). We separate the illumination and expression variations from the original training samples and the subjectspecific feature is extracted. Hence, our approach mitigates over-fitting problem. As shown in Fig 7(e), the coefficient of the training sample from Subject 54 is the largest. Thus, the query image is able to be successfully recognized by LRSE+SC.
B. Face Recognition With Occlusion. One of the most important characteristics of SRC is its robustness to face occlusion. However, most of the current face recognition methods don't consider that occlusions may exist in training data. When using occluded images for training, SRC might over-fitting the extreme noise of occlusion. In this subsection, we consider the training data set might contain occluded face images. There are 26 images for each subject which are taken from two separate sessions in AR database. For each session, there are seven clean images without occlusion, three images with sunglasses and three images with scarf. In our experiments, not all of subjects have occluded images for training. Firstly, we randomly select p(p = 1, 10, 20, 30, 40) classes, each of which contains occluded images. For the remaining classes, there aren't occluded images for training. Specifically, the following three scenarios are considered.
• Sunglasses: In this scenario, for each of the p classes which are chosen, seven neutral images plus one image with sunglasses (randomly chosen) at Session 1 are used for training. For the remaining subjects, we just use seven neutral images at Session 1 for training. The test set contains seven images without occlusion at Session 2 and the rest of the images with sunglasses in both sessions for each subject. In total, for each of the p classes, we have 7 neutral images plus 5 images with sunglasses(two taken at Session 1 and three at Session 2) for testing, and for the remaining classes, there are 7 neutral images plus 6 images with sunglasses for testing.
• Scarf: We consider the training images are occluded by scarf. It's similar to the first scenario, one image with scarf and seven neutral images at Session 1 for each of the p classes are randomly chosen for training. While, for other classes, only seven neutral images at Session 1 are used as training samples. The test set consists of seven neutral images without occlusion at Session 2 and the remaining images with scarf in both sessions for each subject.
• Sunglasses+scarf: The third scenario is that two corrupted images (one with sunglasses and one with scarf) for each of the p subjects are randomly selected from the Session 1 and used for training. And seven neutral images at Session 1 for each of the 100 subjects for training. seven images without occlusion at Session 2 and all the images with scarf and sunglasses which aren't selected for training in both sessions for each subject are used for testing.
In all three scenarios, some subjects have occluded images for training, but the rest of subjects don't have. Hence, the occluded regions might be regarded as subject-specific feature for SRC. The experiments are repeated for 5 times and recognition accuracies are averaged. Tables 1, 2 and 3 list the results for sunglasses, scarf and sunglasses+scarf respectively. It's clear to see that LRSE+SC outperforms SRC, ESRC, MFL and RLRC1. For example, when p = 10, we achieve recognition rates at 75.2%, 81.51% and 73.64% for the scenarios of sunglasses, scarf and sunglass+scarf, respectively, while the recognition rates of SRC are 71.11%, 78.29% and 67.81%. From Face Recognition Based on Low-Rank Matrix plus Sparse Representation these three tables, both RLRC1 and MFL perform poorly, since these two methods can not separate the occlusion information from the original images. ESRC and LRSE+SC introduce an intra-class variant dictionary to represent the variations between the query image and the training images. Fig 8 shows the sparse coding of a query image with sunglass. The query image comes from Subject 24 and is presented in the first row of Fig 8(a). The remaining seven unoccluded images of Fig 8(a) are training samples of Subject 24. There is one image from Subject 10 in the training set, which is potentially occluded by sunglass. Due to occlusion, the occluded image of Subject 10 looks like the query image. From Fig 8(b), the solution recovered by SRC is not sparse and the largest coefficient corresponds to the occluded image of Subject 10. Fig 8(c) shows the corresponding residuals with respect to the 100 subjects. The green bar indicates that SRC fails to identify the subject. In order to deal with occlusion, ESRC uses an intra-class variant dictionary to represent the possible variations. Fig 8(d) and 8(e) plot the coefficients correspond to the training sample dictionary and intra-class variant dictionary. However, ESRC can not separate the occluded part from the query image. Therefore, the value of the coefficient corresponding to the image with sunglass is also the largest. Different from ESRC, LRSE+SC decomposes the original training images into low-rank supervised dictionary and sparse intra-class variant dictionary. Hence, the low-rank supervised dictionary doesn't contain occlusion variations and the value of the coefficient corresponding to the base from Subject 24 is the largest, as presented in Fig 8(f) and 8(g). LRSE+SC identifies the subject correctly. C. Face Recognition With Random Faces. Recently, Random Projection [39] has emerged as a powerful tool in dimensionality reduction. In order to evaluate the performance Face Recognition Based on Low-Rank Matrix plus Sparse Representation of the LRSE+SC algorithm on different feature extraction methods, we will use random faces as feature descriptors in AR face database for recognition. The dataset used in our experiment is provided by Jiang [40] and it's a subset of AR database which consists of 2,600 images from 50 female subjects and 50 male subjects. Each image is projected onto a 540-dimensional subspace with a randomly generated matrix from a zero-mean normal distribution. Each row of the  Table 4.
It's clear that the proposed LRSE+SC consistently outperforms others. Since we have small number of training data per class, the recognition rates of LRC is relatively low. In order to overcome the issue of small sample size, RLRC1 and RLRC2 are proposed. The basic idea behind these two methods is that the basis vectors of each class-specific subspace are composed of the class-specific basis vectors owned by one class only and the common basis vectors shared by many classes. From Table 4, the recognition rates of RLRC1 are significantly higher than the recognition rates of LRC. MFL independently learns dictionary per class and some useful information may be loss in this procedure. Therefore, the performance of MFL may not be better than SRC. ESRC and LRSE+SC utilize the intra-class variations, such as lighting changes, and these two methods achieve good performance. However, comparing with ESRC, LRSE+SC can separate intra-class variations from original training samples.

FERET Database
The FERET database [41] is one of the most widely used databases for face recognition. A subset of the FERET database is used to test the performance of the LRSE+SC, in which there are 720 images of 120 subjects (each subject containing six images). Images with different illuminations and expressions are collected in this subset. All the images are cropped with dimension 44 × 40.
In this experiment, for each subject, a few images are selected as training samples. We also consider the situation that the number of training samples for each subject is unequal. Hence, five groups experiments are designed. 2, 3, rand( [1,3]), rand( [2,4]), rand( [1,4]) images of each subject are chosen for training, respectively, and the remaining images are used as test samples. we select images randomly and repeat for 5 times in each condition. The average recognition rates across 5 runs of each method are presented in Fig 9. LRSE+SC achieves the highest recognition rates in the five experiments, which demonstrates its capability to deal with the small sample size problem. Since SRC and MFL need sufficient training samples, they do not work well. As we know, when the number of training samples per subject is not equal, LRSE +SC mitigates the over-fitting problem and performs better than other methods.

FRGC Database
The FRGC v2.0 data [42] is used for evaluating the real processing and recognition performance, which is a large-scale face database designed with uncontrolled indoor and outdoor setting. In this experiment, an outdoor lighting set is exploited as unstrained face recognition Face Recognition Based on Low-Rank Matrix plus Sparse Representation benchmark. The outdoor lighting subset contains 275 subjects, and each subject has five uncontrolled lighting images. They are cropped and normalized to 50 × 50. The uncontrolled images are taken in varying illumination conditions, e.g., hallways, atriums, or outside. We randomly select n(n = 2, 3) images per person for training and the remained for testing, and the experiment is repeated for 20 times. The average accuracies and standard deviations are listed in Table 5. We can see that LRSE +SC outperforms all the other methods. When the number of training samples per class is two, LRSE+SC obtains the top results of 85.88±0.87, which is followed by the ESRC approach. It can be also be seen that the recognition rates of SRC are almost the same as ESRC.

LFW Database
In this section, we test the effectiveness of LRSE+SC in handling the problem of unconstrained face recognition. The LFW database [43] consists of images of 5,749 individuals captured from uncontrolled environment. Same as [44], a subset of aligned LFW [45] is chosen for testing, which includes 143 subjects with no less than 11 samples per subject. For each subject, the first 10 images are used as training samples and the remaining images are used for testing. In order to represent the face image, Gabor magnitude [46] feature and Local Binary Pattern (LBP) [47] feature are extracted. For each image, it is partitioned into 2 × 2 blocks; and then the discrimination-enhanced feature is obtained by performing LDA in each block; finally, the features of all blocks are concatenated and the feature dimension is 560. It can be seen from Table 6 that LRSE+SC achieves the best performance.

Statistical Evaluation
In order to find significant differences in performance across the six different classifiers (e.g. SRC, ESRC, MFL, RLRC1, RLRC2 and LRSE+SC), rank based statistics are employed. According to [48], we use the Friedman test with the corresponding post-hoc test for statistical comparisons of the six different classifiers (e.g. SRC, ESRC, MFL, RLRC1 and LRSE+SC).
Firstly, the Friedman test is employed to assess whether the average ranks of different approaches are statistically different from the mean rank. This test ranks the methods according to their results for each dataset separately, thus the best performing algorithm gets rank 1, the second best one gets rank 2, etc. Since six approaches are used for comparison, the mean rank is (1 + Á Á Á + 6)/6 = 3.5. Then, the Friedman test compares the average ranks of methods. Under the null-hypothesis, which states that the algorithms are equivalent and their ranks should be equal to rank 3.5, we calculate the following refined Friedman statistic F F with six algorithms and 22 test sets: The critical value of F(5, 105) at α = 0.05 is 2.3. Since F F > 2.3, the null-hypothesis is rejected, which means that the performance of different approaches is statistically significant difference. Therefore, the next step is the post-hoc test to compare the proposed method with others. Under this test, the critical difference (CD) is used to measure whether the performance of two algorithms is significantly different with each other in terms of the corresponding average ranks. The critical difference is defined as follows: where q α are critical values and given in [48]. For our results, CD = 2.576 × 0.5641 = 1.4531 at α = 0.05. Table 7 lists the differences between the average ranks of LRSE+SC and the other methods. It's clear that the proposed LRSE+SC method is better than SRC, MFL, RLRC1 and RLRC2 at α = 0.05, since the differences in average ranks are larger than the CD = 1.45. Due to the intra-class variant dictionary, ESRC can alleviate the small sample size problem and obtain better performance than SRC. However, for significance level α = 0.05, we can't say LRSE+SC is significantly different from ESRC.

Parameter Selection
There are two parameters, λ 1 and λ 2 in our method LRSE+SC. The two parameters have very clear physical meaning, which could guide the setting of these parameters. The parameter λ 1 trades off the rank of D i versus the sparsity of the error E i . According to the theoretical considerations in [19], the correct scaling is λ 1 = O(m −1/2 ). For example, if the dimension of feature vector is m = 1760, we find the optimal λ 1 in the neighborhood of 1 ffiffi ffi m p ¼ 0:023. The parameter λ 2 balances the tradeoff between the sparsity of representation and reconstruction error. We find a good value in the ranges of λ 2 = [10 −4 , 10 −1 ].  In order to study the influences of parameters λ 1 , λ 2 over recognition accuracy, we perform two experiments on AR database. Fig 10 shows the evaluation results of LRSE+SC with different values of λ 1 , λ 2 . These two examples demonstrate that the optimal values of λ 1 , λ 2 are correlated with training samples and test samples. However, when λ 2 = 0.001, 0.0005, LRSE+SC achieves good performance. It is easy to see that LRSE+SC can achieve good performance when the value of λ 1 in the neighborhood of 0.023.

Conclusions
In this paper, we have introduced a novel face recognition framework by utilizing low rank matrix recovery. In this mechanism, we try to decompose the original images into the classspecific feature and within-class variant feature. We have demonstrated theoretically and experimentally that the within-class variant feature which is separated from the original images is very important to deal with small sample size problem.
Our method leverages low-rank and sparse error matrix decomposition technique and sparse representation scheme. Firstly, we recover the low-rank matrix of each subject by removing the sparse error in the training images. Then the low-rank matrix can be used as supervised dictionary to coding the test samples. Our algorithm regards the sparse error matrix of all subjects as sparse within-class variant dictionary to represent the variants of the test image and training images from the same class, which may be caused by illuminations, expressions and disguises. Experiments confirm that LRSE+SC approach outperforms SRC and ESRC for small sample size problem.