General Regression and Representation Model for Classification

Recently, the regularized coding-based classification methods (e.g. SRC and CRC) show a great potential for pattern classification. However, most existing coding methods assume that the representation residuals are uncorrelated. In real-world applications, this assumption does not hold. In this paper, we take account of the correlations of the representation residuals and develop a general regression and representation model (GRR) for classification. GRR not only has advantages of CRC, but also takes full use of the prior information (e.g. the correlations between representation residuals and representation coefficients) and the specific information (weight matrix of image pixels) to enhance the classification performance. GRR uses the generalized Tikhonov regularization and K Nearest Neighbors to learn the prior information from the training data. Meanwhile, the specific information is obtained by using an iterative algorithm to update the feature (or image pixel) weights of the test sample. With the proposed model as a platform, we design two classifiers: basic general regression and representation classifier (B-GRR) and robust general regression and representation classifier (R-GRR). The experimental results demonstrate the performance advantages of proposed methods over state-of-the-art algorithms.


Introduction
As well known, the nearest neighbor classifier (NN) is one of the most popular classifiers due to its simplicity and efficiency. However, NN just uses one training sample to represent test sample. To address this problem, the nearest feature line (NFL) uses two training samples of each class to represent test sample [1]. The nearest feature plane (NFP) applies three samples to represent test sample [2]. Furthermore, some classifiers leverage more training samples for test sample representation, such as the local subspace classifier (LS) [3] and nearest subspace classifier (NS) [4,27], which represent the test sample via all training samples of each class. Actually, all these methods can be considered as variants of linear regression based methods. To prevent over-fitting, the L 2 -regularizer is generally used in the linear regression model. In the past years, the L 1 -regularizer, which is closely linked to sparse representation, becomes a hot theme in information theory, signal/image processing and related areas. Meanwhile, numerous findings of neuroscience and biology form a physiological base for sparse representation [5][6][7].
Recently, many efforts have been made to apply sparse representation methods to pattern classification tasks, including signal/image classification and face recognition etc. Labusch et al. presented a simple sparse-coding strategy for digit recognition and achieved state-of-the-art results on the MNIST benchmark [8]. Yang et al. addressed the problem of generating a super-resolution (SR) image from a single low-resolution input image via sparse representation [9]. Mairal et al. elaborated a framework for learning multi-scale sparse representations of images with applications to image denoising and inpainting [10]. Yang et al. employed sparse coding instead of vector quantization to capture the significant properties of local image descriptors for image classification [11]. Particularly, Wright et al. introduced a sparse representation based classification (SRC) and successfully applied it to identify human faces with varying illumination changes, occlusion and real disguise [12]. In their method, a test sample image is coded as a sparse linear combination of the training images, and then the classification is achieved by identifying which class yields the least residual. Theodorakopoulos et al. introduced a face recognition method based on sparse representation of facial image patches [37]. Subsequently, Gao et al. proposed Kernel Sparse Representation for image classification and face recognition. KSR actually is the sparse coding technique in a high dimensional feature space via some implicit feature mapping [39]. Yang and Zhang constructed a Gabor occlusion dictionary for SRC to reduce the computation cost by using Gabor feature [13].
Although the newly-emerging SRC shows great potential for pattern classification, it lacks theoretical justification. Yang et al. provided an insight into SRC and analyzed the role of L 1 -optimizer [14]. They think that L 1 -optimizer contains two properties: sparsity and closeness. However, L 0 -optimizer can only achieve the sparsity. Sparsity determines a small number of nonzero representation coefficients and closeness makes the nonzero representation coefficients concentrate on the training samples with the same class label as the given test sample. Wright et al. give an overview of sparse representation for computer vision and pattern recognition [15]. Yang et al. presented a robust regularized coding model to enhance the robustness of face recognition to occlusion, pixel corruption and real disguises [16,31]. He et al. proposed an effective sparse representation algorithm based on maximum correntropy criterion for robust face recognition [17]. To unify the existing robust sparse regression models: the additive model represented by SRC for error correction and multiplicative model represented by CESR and RSC for error detection, He et al. [38] built a halfquadratic framework by defining different half-quadratic functions. The framework enables to perform both error correction and error detection. Furthermore, He et al. also leverage the half-quadratic framework to address the feature selection and subspace clustering problems [48,49]. In addition, Zhou et al. incorporated the Markov Random Field model into the sparse representation framework for spatial continuity of the occlusion [40]. Li et al. explored the intrinsic structure of continuous occlusion and proposed the structured sparse error coding (SSEC) model [41]. Ou et al. proposed a novel structured occlusion dictionary learning method for robust face recognition [42]. Apart from these methods, many related tasks have been reported [18][19][20][32][33][34][35].
With the widely use of sparse representation for classification, some scholars question the role of sparseness for image classification [21,22]. Zhang et al. analyzed the working principle of SRC and believed that it is the collaborative representation that improves the image classification accuracy rather than the L 1norm sparsity. Consequently, Zhang et al. presented a collaborative representation based classification with regularized least square (CRC) [23]. Compared with SRC, CRC delivers very competitive classification results with little computation time. Subsequently, Yang et al. proposed a relaxed collaborative representation model (RCR) which effectively captures the similarity and distinctiveness of different features for pattern classification [24]. Theodorakopoulos et al. gave a collaborative sparse representation model in dissimilarity space for visual classification tasks [43].
Most of previous works assume that representation residuals are mutually uncorrelated [31]. It's difficult to hold this assumption in real-world applications. Actually, it is common to have data where representation residuals are correlated. Thus, in this paper, we consider to eliminate the correlations between representation residuals and present a novel model named General Regression and Representation (GRR) for pattern classification. GRR mainly aims to take account of the prior information (e.g. the correlations between representation residuals and representation coefficients) and the specific information (weight matrix for each image pixel) so as to enhance the classification performance under different conditions. Specifically, GRR selects one image from training set and finds its K nearest neighbors in rest ones to code the image. In this way, all the training images can be coded on its K nearest neighbors. Subsequently, we calculate the correlations between representation residuals and representation coefficients by virtue of the reconstruction error and representation coefficient of each training image. For each test sample, we apply the iterative algorithm to achieve the weight of each image pixel. The overview of GRR is shown in Fig. 1. Compared with other regression based classification methods, the novelty of the proposed model is threefold: N First, we take into account the correlations of the representation residuals and develop a general regression and representation model (GRR) for sample coding; N Second, GRR captures the prior information from the training set via the generalized Tikhonov regularization in conjunction with the K Nearest Neighbor method and leaving-one-out procedure; N Third, with the GRR model as a platform, we design two classifiers: Basic GRR (B-GRR) and Robust GRR (R-GRR) by combining the prior information and the specific information with different strategies.
To evaluate the proposed model, we finally use four databases which involve different recognition tasks: the CENPARMI dataset for handwritten numerical recognition, the NUST603 dataset for handwritten Chinese character recognition, the AR dataset for face recognition and face occlusion recognition, and the Extended Yale B dataset for face recognition with extreme lighting changes and face recognition with random block occlusion. Experimental results demonstrate the effectiveness of the proposed model. This paper is the extended version of our conference paper [36]. In this paper, we provide a more in-depth analysis and more extensive experiments on the proposed model.

A. Current Methods
Suppose there are c known pattern classes. Let A i be the matrix formed by the training samples of Class i, i.e., A i~½ y i1 ,y i2 , Á Á Á ,y iM i [R N|M i , where M i is the number of training samples of Class i. Let us define a matrix The matrix A is obviously composed of entire training samples.

Sparse Representation based Classification
Given a test sample y, we present y in a over-complete dictionary whose basis vectors are training samples themselves. i.e., y5Ax. The sparse solution to y5Ax can be sought by solving the follow optimization problem: x~arg min x k k 0 s:t: However, solving the L 0 optimization is NP hard problem. Fortunately, recent researches reveal that L 0 optimization and L 1 optimization are equivalent when the solution is sparse enough. In general, the sparse representation problem can be formulated as:x~a rg min x k k 1 s:t: Ax~y: The problem is equivalent tox~arg min x y-Ax k k 2 2 zl x k k 1 . Then classification rule is: identif y(y)~arg min i fr(y i )g ð 3Þ where r i (y)~jjy{ŷ i jj 2~j jy{Ad i (x)jj 2 , d i (x) is the nonzero coefficient vector associated with class i.

Correntropy-based Sparse Representation
Correntropy-based Sparse Representation (CESR) leverages the maximum correntropy criterion to design the classifier for robust face recognition [17]. Similar with SRC, CESR aims to reconstruct a test sample y using existing training samples as well as possible. The correntropy-based sparse model is formulated as: where g(x)~exp( {x 2 s 2 ) is a gaussian kernel function. The above nonlinear objective function can be solved by using half-quadratic optimization technique. Then, the test sample is classified to class i corresponding to the maximal nonlinear difference between y andŷ i , i.e., identif y(y)~arg max i fc(y i )g ð 5Þ where c(y i )~g(y-ŷ i )~g(y-Ad i (x)) and s 2~1 2c

Robust Sparse Representation
The robust sparse representation problem can be reformulated as the following minimization problem:x~a where w( : ) is a robust M-estimator and can be optimized by half-quadratic (HQ) optimization, ( : ) j means the j-th dimension of input data. In HQ framework, RSR problem can be considered as an iterative regularization problem and applying a number of unconstrained quadratic problems to solve the optimization problem. The classification rule is identif y(y)~arg min i fw(y-Ad i (x))g ð 7Þ CRC with Regularized Least Square CRC uses the regularized least square method to represent test sample can lead the similar results to L 1 -norm regularization but with low computation burden. The model is formulated as:x~a where l is the regularization parameter. The regularization term can help us to achieve a stable solution. Meanwhile, it also introduces a little sparse constraint to thex, which is much weaker than SRC. The solution of CRC in (8) with regularized least square as follows: The classification rule of CRC is similar with SRC. However, r i (y)~jjy{Ad i (x)jj 2 =jjd i (x)jj 2 . We classify y by checking the reconstruction error of each class to yield the classification result.

Linear Regression Classification
It's assumed that patterns from the same class lie on a linear subspace. On the basis of this point, LRC represents the test sample image as a linear combination of class-specific training set. There is: The solution ofx i is:x LRC is made in favor of the class with the minimum distance d i (y)~y{ŷ i 2 .

B. Problems
The most previous works, including RSC, SRC, CRC, CESR et al, assume that the representation residuals are homoskedastic and mutually uncorrelated. In realworld applications, these assumptions do not hold. In particular, when the elements of representation residuals have unequal variances and are correlated, variance of representation residuals is no longer a scalar variance-covariance matrix, and hence there is no guarantee that the least square estimator is the most efficient within the class of linear unbiased estimators [46,47]. Here, we also give an example to demonstrate this view. Fig. 2 shows the example, where 200 samples of each class are selected from the CENPARMI dataset, each sample is coded on its top 200 neighbors from the rest samples. The correlation matrix map of representation residuals is shown in Fig. 2, from which we can see that these representation residuals are actually correlated.

General Regression and Representation Model for Classification (GRR)
This section mainly introduces two classifiers: Basic GRR and Robust GRR, which designed with the GRR model as a platform. Basic GRR is built to address the correlation problem of representation residuals in other regression models. Robust GRR is an extended version of the basic GRR model, which provides a mechanism to deal with noises in test samples.

A. Basic GRR
Let A be the matrix formed by the K nearest neighbors of the test sample from training set, M be the number of training samples, and y be the test sample. Our model isx~a where P is the matrix that is introduced to eliminate the correlations between representation residuals (or called reconstruction errors), and Q is used to refine the regularization term.
We call above model as the basic general regression and representation (B-GRR). Actually, this model can be reformulated as follows: x~arg min If P and Q are known, from the generalized Tikhonov regularization [30,44], we know there is a close-form solution: However, P and Q are unknown beforehand. We here employ a generative method to estimate these two correlation matrices P and Q in the training stage. Basically, we assume the representation residual e~y-Ax and representation coefficient vector x satisfy multivariate normal distributions. Then, P can be estimated by using the inverse covariance matrix of e [30,44]. To explain why the matrix P can be estimated in this way, we first let R be a non-stochastic transformed matrix and ignore the regularization term for the moment. Eq. (11) can be reformulated asx~a where Ry denotes the transformed dependent variable and RA is the matrix of the transformed explanatory variables. It can be seen that RA also has full column rank provided that R is nonsingular. The solution iŝ Obviously, we can see that P~R T R. The natural question then is how to find a transformation matrix that yields the most efficient estimator among all linear unbiased estimators. Generally speaking, one should choose R as a non-stochastic and non-singular matrix like R T P e R~s 2 e I. It should be note that P e is symmetric and positive definite so that it can be orthogonally diagonalized as C T P e C~L, where C is the matrix of eigenvectors corresponding to the matrix of eigenvalues L. For T~I . This result suggests that the transformation matrix R should be proportional to P {1=2 e . Given this choice of R, we have P~P {1 e . The matrix Q can be estimated by using the inverse covariance matrix of x [30,45]. Q is introduced to generate a Mahalanbios distance based regularization term. The main difference between ridge regression and the proposed method is that ridge regression uses Euclidean distance to constrain the representation coefficients and the proposed method applies Mahalanbios distance to constrain them. It's believed that Mahalanbios distance might provide a better regularization term than Euclidean distance since there exists correlations between representation coefficients. Ideally, we should maximize the correlation of representation coefficients of the homo-class samples and minimize the correlation of representation coefficients of the hetero-class samples simultaneously in the training process. However, it is difficult to model this because we have different numbers of representation coefficients corresponding to homoclass and hetero-class samples. A feasible way is to eliminate the correlations of all representation coefficients. This leads to more significant effects on the representation coefficients of hetero-class samples than on those of homo-class samples, since the representation coefficients of hetero-class samples are much more than those of the homo-class samples in multi-class classification problems.
Based on the above analysis, we give the details of estimating P and Q as follows.
Let y (train) i be the i-th sample of the training set. A (train) i is the matrix formed by the K nearest neighbors of y (train) i from the training set. We set P 0 5I and Q 0 5I. The coding coefficient vector of y (train) Let e i~y is the regular parameter. Note that l 1 I is introduced to avoid the singularity of the covariance matrix.
{m 2 ) T . l 2 is the regular parameter and l 2 I is also used to avoid the singularity of the covariance matrix.
In the testing stage, for a given test sample y, we find its K nearest neighbors from the training set to form the matrix A. Then, we calculate the representation coefficients vectorx using Eq. (14). We can reconstruct the test sample y aŝ y c~A d c (x) by employing the representation coefficients associated with c-th class. The corresponding reconstruction error of c-th class is defined: The decision rule is: if r l (y)~min c r c (y), y is assigned to Class l. B-GRR makes full use of the prior information of the training set. It works well when the testing samples share the same probability distribution with the training samples. The algorithm of B-GRR for classification is summarized in Algorithm 1.

B-GRR for Classification
Input: Dictionary A, test sample y. Initial values P 0 and Q 0 Output: y is assigned to the class which yields the minimum residuals.

B. Robust GRR
In image classification problems, illumination, expression or pose changes may cause significant differences between test samples and training samples. Therefore, it is necessary to introduce the test sample specific information to alleviate the effect caused by the differences between test samples and training samples. This specific information is to give a weight to each feature (or image pixel) of the sample, which can be learned online via the iteratively reweighted algorithm.
Based on this idea, we present a robust general regression and representation model (R-GRR) for classification. Compared with B-GRR, R-GRR not only includes the prior information P and Q, but also contains the specific information (weight matrix) W. The model is given below: If P, Q and W are known, the above model can be solved explicitly using the formula:x~½ Since P and Q can be learned offline using the same method as in Basic GRR, the remaining problem is to learn the specific information W online. Specifically, given a test sample y, we firstly compute the representation residuals e of y so as to initialize the weight. The residual e is initialized as e5y-y ini , and y ini is the initial estimation of the true images from the observe samples. In this study, we simply set y ini as the mean image of all samples in the coding dictionary since we don't know which class the test image y belongs to. With the initialized y ini , our method can estimate the weight matrix W iteratively. W actually is a diagonal matrix, W k,k (i.e. v h (e k )) is the weight assigned to the k-th pixel of test image. The weight function [16] is: where a and b are positive scalars. In addition, Eq. (22) is the explicit solution of Eq. (21). The process is terminated when the difference of the weights between adjacent iterations satisfies the following condition: The R-GRR algorithm for classification is summarized in Algorithm 2.

R-GRR for Classification
Input: Dictionary A, test sample y. Initial values P 0 , Q 0 and y ini .
1. Normalize the columns of A to have unit L 2 -norm, test sample y with L 2norm and y t initialized as y ini . 2. The prior information matrices P and Q are learned from the training set by using the generalized Tikhonov regularization and KNN. 3. The test sample y is coded on its K nearest neighbors A.
Compute the reconstructed test sample y (t)~A x (t) , and let t 5 t + 1 e. Go back to step a) until the maximal number of iterations is reached, or convergence is met as shown in Eq. (24) 4. Compute the residuals of each class.
Output: y is assigned to the class which yields the minimum residuals.

C. Robust GRR for Occlusion Cases
In real-world image recognition tasks, occlusion is one of the most challenge problems. To overcome this problem, we combine advantages of the prior information Q and the specific information W to enhance the classification performance. As we know, P reflects the correlations between representation residuals. If there are great differences between the test sample and the training samples, the resulting reconstruction error does not follow the original distribution. In this case, we cannot employ the matrix P to eliminate the correlations between representation residuals of test images. So, the matrix P is removed from the model when there is occlusion, real-disguise or noises in test image. In contrast, the matrix Q won't be affected whether the test image has occlusion or not since it is mainly used as a regularization term. Therefore, we keep matrix Q in the model, which can have positive effect on the performance. Then Eq. (19) can be reformulated as: x~arg min The solution of this model is: Further Analysis of GRR In this section, we will further analyze the role of P and Q in GRR. P is a symmetric matrix which is learned from the training set and can be decomposed into R T R, where R is a non-singular transformed matrix and is used to eliminate the correlations between representation residuals. The matrix Q in the regularization term is also learned from the training set. The proposed model uses Mahalanbios distance instead of Euclidean distance to constrain the representation coefficient. It's believed that Mahalanbios distance can provide a better regularization than Euclidean distance since there exists correlations between representation coefficients. Fig. 3(a) gives an example to show the role of P and Q.
In this example, we represent the test sample ''1'' from the CENPARMI database and illustrate the reconstruction residual of each class. Based on the minimal class residual criterion, we know that B-GRR, using the prior information contained in P and Q, achieves the right result, while CRC fails without using this information.
We then compare the obtained representation coefficients of CRC and B-GRR. Fig. 3 (b) shows the representation coefficients of CRC and B-GRR for the same test image as shown in Fig. 3 (a). The representation coefficients of the homoclass samples are highlighted in blue. CRC provides a very dense representation, while B-GRR gives a sparser representation due to the KNN based dictionary selection. In comparison with CRC, the representation coefficients of B-GRR seem to be more congregated on the homo-class samples.
We also give an example to compare our methods with some state-of-the-art methods on handing occlusions. In the example, two classes of face images from the AR database, as shown in Fig. 4, are used for training. We test two cases of real-world disguise images: the images with sunglasses and the images with scarves. In Fig. 5 (a) and Fig. 5 (b), the left column contains the disguise images. In our test, we use R-GRR, RRC_L 2 , RSC, B-GRR, SRC and CRC to deal with occlusion. For each occluded image, the reconstructed images (recovered clean image) and the residual images (recovered occlusion) are shown in Fig. 5. From  Fig. 5, we can see that R-GRR achieves comparable result with RRC_L 2 , RSC and significantly outperforms other methods. However, R-GRR is slightly better than RRC_L 2 and RSC from the viewpoint of weight maps.

Experiments
In this section, we perform experiments on four benchmark databases and compare the proposed model GRR with state-of-the-art models. Note that here in SRC and RSC, the matlab function ''l1-ls'' [25] is used to calculate the sparse representation coefficient. In the following experiments, the parameter a is 8 and  b is 0.8 for image classification. b is set to 0.5 when dealing with occlusion cases [16].

CENPARMI Database
The experiment was done on Concordia University CENPARMI handwritten numeral database. The database contains 6000 samples of 10 numeral classes (each class has 600 samples). Some samples of ''0'' from the CENPARMI database are shown in Fig. 6.
In the first experiment, we choose the first 200 samples of each class for training, the remaining 400 samples for testing. Thus, the total number of training samples is 2000 while the total number of testing samples is 4000. PCA is used to transform the original 121-dimensional Legendre moment features [28] into Ddimensional features, where D varies from 10 to 100 with an interval 10. Based on the PCA-transformed features, NN, NFL, LRC, SRC, CRC, RSC, RRC_L 2 and B-GRR are employed for classification. The parameter K is set to 200. The recognition results of each method corresponding to the variation of dimensions is shown in Fig. 7 (a). From Fig. 7 (a) In the second experiment, we let the number of training samples per class vary from 100 to 500 with an interval of 100, and the rest samples for testing. Then, PCA is used to transform the original Legendre moment features into lowdimensional features. We select the optimal dimension of each method based on the above experiments as shown in Fig. 7 (b). The recognition rates of each method corresponding to the variation of training samples is shown in Fig. 7 (b). From Fig. 7 (b), we can see that B-GRR still gives better results than other competing methods.

NUST603 Database
The experiment was performed on the NUST603 handwritten Chinese character database which was built in Nanjing University of Science and Technology. The database contains 19 groups of Chinese characters that are collected from bank checks, each group with 400 samples. Some images from the NUST603HW database are shown in Fig. 8.
In this experiment, the first 200 samples of each class are used for training, and the remaining samples for testing. Similar to the experimental methodology adopted in the last experiment. PCA is used to transform the original 128dimensional peripheral feature [29] into D-dimensional features. We thus let D varies from 10 to 100 with interval 10. The parameter K is set to 300. Then NN, NFL, LRC, SRC, CRC, RSC, RRC_L 2 and B-GRR are employed for classification. The performances of each method versus the variation of dimensions are shown in Fig. 9 (a). Additionally, we also let the number of training samples per class vary from 100 to 300 with an interval of 50, and the remaining samples for testing. PCA is then used to transform the original feature into low-dimensional features. We select the optimal dimension of each method based on the above experiments as shown in Fig. 9 (b). The recognition rates of each method are illustrated in Fig. 9 (b). The results in Fig. 9 are basically consistent with those in Fig. 7. B-GRR

B. Face Recognition without Occlusion
We evaluate the performance of R-GRR on the AR and the Extended Yale B database with illumination and expression changes but without occlusion. In these experiments, PCA is first used to reduce the dimensionality of face image.

AR Database
The AR face database [26] contains over 4000 color face images of 126 persons, including frontal views of faces with different facial expression, lighting conditions and occlusions. The pictures of 120 individuals were taken in two sessions (separated by two weeks) and each session contains 13 color images. Fourteen face images (each session contains 7) of 100 individuals are selected and used in our experiment. The face portion of each image is manually cropped and then normalized to 60 6 43 pixels.
In this experiment, images from the first session are used for training, and images from the second session are used for testing. Then NFL, LRC, SRC, CRC, B-GRR, RSC, RRC_L 2 and the proposed R-GRR are employed for classification. The NN classifier is also used to provide a baseline. The parameter K of R-GRR means we choose the K nearest neighbors of the test image from training set to form the coding dictionary. K is set to 650 here. The recognition rates of each classifier versus the variation of dimensions are listed in Table 1. From Table 1, we can see that our model R-GRR outperforms state-of-the-art methods in all dimensions except that R-GRR is slightly worse than RSC when dimension is 54. However, it's difficult to achieve better performance when dimension is low for all the methods. The maximal recognition rates of NN, NFL, LRC, SRC, CRC, B-GRR, RSC, RRC_L 2 and R-GRR are achieved when the dimension is 300.

Extended Yale B Database
The extended Yale B face image database [27] contains 38 human subjects under 9 poses and 64 illumination conditions. The 64 images of a subject in a particular pose are acquired at camera frame rate of 30 frames/second, so there are only small changes in head pose and facial expression for those 64 images. All frontalface images marked with P00 are used in our experiment, and each is resized to 48642 pixels.
In our experiment, we use the first 32 images of each individual for training and the remaining images are used for testing. Based on the PCA-transformed features, NN, NFL, LRC, SRC, CRC, B-GRR, RSC, RRC_L2 and R-GRR are employed for classification. The parameter K is 800. The recognition rates of each classifier corresponding to the variation of feature dimensions are listed in Table 2. Table 2 shows that the proposed model R-GRR achieves the best recognition results in all dimensions for face recognition. When the feature dimension is 100, R-GRR gives about 3% improvement of recognition rate over LRC, SRC and CRC, respectively.

C. Face Recognition with Occlusion
In this section, we examine the robustness of R-GRR when face images suffer different occlusions, such as real disguise, block occlusion or pixel corruption. In the following experiments, we mainly compare our method with CRC, SRC, RSC, RRC_L 2 , correntropy-based sparse representation (CESR) [17] and Gabor-SRC [13].

Face Recognition with Real Disguise
A subset of the AR face image database is used in our experiment. The subset includes 100 individuals, 50 males and 50 females. All the individuals have two session images and each session contains 13 images. The face portion of each image is manually cropped and then normalized to 42630 pixels.
In the first experiment, we choose the first four images (with various facial expressions) from the session 1 and session 2 of each individual to form the training set. The total training images is 800. There are two image sets (with sunglasses and scarves) for testing. Each set contains 200 images (one image per session of each individual with neutral expression). The parameter K is 300 for the test set with sunglasses and 760 for the test set with scarves. The face recognition results of each method on the two testing set are listed in Table 3. From Table 3, we can see that R-GRR achieves the best recognition results among all the methods when the images with scarves and gives comparable result with the excellent method when the images with sunglasses. Additionally, the performances In the second experiment, four neutral images with different illumination from the first session of each individual are used for training. The disguise images with various illumination and glasses or scarves per individual in session 1 and session 2 for testing. We set the parameter K as 220, 300, 240 and 320 for the four  different test sets, respectively. The recognition rates of each method are shown in Fig. 10. From Fig. 10, we can see clearly that R-GRR gives better performance than CRC, SRC, GSRC, CESR, RSC and RRC_L 2 on different testing subsets. Both SRC and CESR do well on the subsets with sunglasses but poor in the cases with scarves. However, GSRC achieves better result on the subsets with scarves and worse result on the subsets with sunglasses. Compared to RSC, at least 4.3% improvement is achieved by R-GRR for different testing set. Meanwhile, it is worth noticing that the recognition rate of R-GRR is 67.6%, 59.6% higher than SRC and CESR on the testing images with scarves from session 2, and 43.7% higher than GSRC on the testing images with sunglasses from session 2. In the first two subsets from session 1, the performances of R-GRR and RRC_L 2 are similar. However, R-GRR significantly outperforms RRC_L 2 in the last two subsets (more challenge tasks) from session 2. Compared with RRC_L 2 , R-GRR uses x k k 2 Q instead of x k k 2 2 to refine the regularization term can further improve the classification performance.

Face Recognition with Block Occlusion
In this experiment, we use the same experiment setting as in [12,16] to test the robustness of R-GRR. Subsets 1 and 2 of the Extended Yale B database are used for training and Subset 3 is used for testing. The face images are resized to 96684. The parameter K is 500. Fig. 11 shows recognition rates curve of SRC, GSRC, CESR, RSC, RRC_L 2 and R-GRR versus the various levels of occlusion (from 0 percent to 50 percent). From Fig. 11, we can see that the proposed R-GRR overall outperforms SRC, GSRC, CESR, RSC and RRC_L 2 . When the occlusion percentage is 50%, R-GRR achieves the best recognition rate 91.9, compared to 65.3 for SRC, 87.4 for GSRC, 57.4 for CESR, 87.6 for RSC, and 87.8 for RRC_L 2 . It's surprising that the performance of CESR is very poor. Probably, it is not suitable for dealing with this block occlusion case.

Face Recognition with Pixel Corruption
In this experiment, we chose the images from the Subsets 1 and 2 of the Extended Yale B database for training, and images from the Subset 3 with random pixel corruption (the image is corrupted by using uniformly distributed random values within [0, 255]) for testing. The face images were resized to 96684 pixels. The corrupted pixels are randomly chosen for each test image and the locations are unknown to the algorithm. We vary the percentage of corrupted pixels from 0% to 90%. Since the most competing methods can achieve better performance from 0% to 40%. We only report the recognition rates for 50%-90% corruption. Fig. 12 plots the recognition rates of five methods under different levels of corruptions. From Fig. 12, we can see that R-GRR, RRC-L 2 , and RSC give the similar results in 80%, 70%, 60% and 50% corruption. R-GRR achieves the best recognition rate when the percentage of corrupted pixels is 90%. However, the performance of SRC is poor when the percentage of corrupted pixels is more than 70%.

D. Discussion
In this section, we first discuss the influences of the parameters K, l 1 and l 2 in our experiments. We then compare the running time of the proposed R-GRR with state-of-the-art methods.
The performances of the proposed method R-GRR (or B-GRR) with different parameters are evaluated on different recognition scenarios. The experiments setting are same with the above mentioned experiments in section 5.2 and 5.3. In our experiments, we just change one parameter when fixing the other ones. Fig. 13 plots the recognition rates versus the variation of the parameter K on the CENPARMI database and NUST603 database. From Fig. 13, we can see that B-GRR can achieve the better recognition rates in conjunction with a smaller K. Fig. 14 plots the recognition rates versus the variation of the parameter K in different face recognition experiments. From Fig. 14 (a) and (b), we can see that the parameter K is relatively larger and smaller than total number of training samples will lead to higher performance when face images without occlusion. Fig. 14 (c) and (e) show that the recognition rates are not sensitive to the variations of the parameter K. In Fig. 14 (f), the proposed method achieves best results when the K is 200 for the test images with block occlusion. However, R-GRR gives the best performance when the K is set to 550 for the test images with pixel corruption as shown in Fig. 14 (g). Generally speaking, the parameter K is relatively smaller in the case that the feature dimension is much lower than the number of training samples, while the parameter K is relatively larger in the case that the feature dimension is much higher than the number of training samples. In this paper, we employ the cross-validation strategy to determine the parameter  K in the training stage. Specifically, we select one training sample as query sample and the rest training samples as gallery set. Thus, the recognition rate of all training samples can be achieved. We choose the best parameter K which achieves the best recognition rate. Fig. 15 plots the recognition rates versus the variation of the regularization parameters l 1 and l 2 , respectively. From Fig. 15, we can see that the proposed model always achieves it optimal or nearly optimal performance when l 2~1 under different face recognition scenarios. However, the performance of the proposed model is non-sensitive to the variation of l 1 . Thus, it's easy to set the regularization parameters of the proposed methods in real-world applications.
The running time of the competing methods, including SRC, GSRC, CESR, RSC, RRC_L 2 and R-GRR, are evaluated on the AR database (with sunglasses). The programming environment is Matlab version 11b. The desktop used is of 2.93 GHz CPU and with 4G RAM. Table 4 lists the computation time for one recognition operation of various methods with the same experiment setting in  Table 4, we can see that R-GRR is superior to SRC, GSRC and RSC due to less computation cost and better performance. RRC_L 2 achieves the least computation time. SRC has rather high computation burden. In addition, RSC is very time-consuming since it must solve L 1 optimization problem in each iteration process. Although CESR is also fast, its performance is not stable. R-GRR gives comparable computation time with RRC_L 2 and achieves better performance than RRC_L 2 in most cases.

Conclusions and Future Works
In this section, we first conclude the paper and then give more discussions on potential future work.  This paper presents a general regression and representation (GRR) model for pattern classification. In GRR, we learn the prior information from the training set by using the generalized Tikhonov regularization and KNN, and obtain the specific information from the test sample by using the iteratively reweighted algorithm. Actually, we provide two classifiers: B-GRR and R-GRR, which combine the prior information and the specific information with different strategies. Experiments on character datasets and face datasets demonstrate that the validity of our model and its performance advantages over state-of-the-art classification methods. Particularly, R-GRR achieves encouraging recognition rates under different cases but with lower computational cost.
Although our model has demonstrated promising performance, there are still many issues requiring in-depth investigation in the future. Here, two improvements can be made for GRR. (1) Most classification methods perform well on the condition that they assume the training and testing data are drawn from the same feature space and the same distribution. However, it's difficult to hold this assumption in real-world applications. To address this problem, transfer learning is proposed and aims to help improve the target predictive function using the knowledge in source domain [50]. Deng et al. presented the generalized hiddenmapping ridge regression method for various types of classical intelligent methods [51]. We can borrow the idea of transfer learning to improve the robustness of our model. (2) With the ever increasing size of training data sets, a challenge in our model is how to design an efficient learning algorithm. Actually, there are many literatures have been reported to overcome the similar problem. IvorW. Tsang et al. presented a core vector machine (CVM) to handle larger datasets. Furthermore, CVM not only preserves the performance of SVM but also performs much faster than existing scale-up methods [52]. Deng et al also developed effective learning algorithms for fussy models when facing with large datasets [53,54].

Ethics Statement
Some face image datasets were used in this paper to verify the performance of our methods. These face image datasets are publicly available for face recognition research, and the consent was not needed. The face images and the experimental results are reported in this paper without any commercial purpose.