Robust Face Recognition via Multi-Scale Patch-Based Matrix Regression

In many real-world applications such as smart card solutions, law enforcement, surveillance and access control, the limited training sample size is the most fundamental problem. By making use of the low-rank structural information of the reconstructed error image, the so-called nuclear norm-based matrix regression has been demonstrated to be effective for robust face recognition with continuous occlusions. However, the recognition performance of nuclear norm-based matrix regression degrades greatly in the face of the small sample size problem. An alternative solution to tackle this problem is performing matrix regression on each patch and then integrating the outputs from all patches. However, it is difficult to set an optimal patch size across different databases. To fully utilize the complementary information from different patch scales for the final decision, we propose a multi-scale patch-based matrix regression scheme based on which the ensemble of multi-scale outputs can be achieved optimally. Extensive experiments on benchmark face databases validate the effectiveness and robustness of our method, which outperforms several state-of-the-art patch-based face recognition algorithms.


Introduction
Object classification is an active topic in the area of pattern recognition [1][2][3][4][5][6][7][8]. Due to the advantages of non-intrusive natural and pronounced uniqueness, face recognition has been an active research topic and has been incorporated into many multimedia applications [9][10][11][12][13][14], such as surveillance, human machine interaction, access control and photo album management in social networks. Recently, linear regression based face recognition approaches have led to state-of-the-art performance [15][16][17][18], with representative examples being sparse representation-based classification (SRC) [15] and linear regression-based classification (LRC) [16]. In SRC, the query sample image is coded as a sparse linear combination of all the training images, and then the classification is made by checking which class yields the least reconstruction error. Many works of SRC have been developed for vision applications, e.g., super-resolution [19,20], facial expression recognition [21] and human gait recognition [22,23]. Alternatively, Naseem et al. [16] proposed LRC for face recognition. Based on the assumption that samples from a specific object class lie on a linear subspace, LRC represents a query image as a linear combination of training images of each class. Yang et al. [24] provided an insight into SRC and sought reasonable supports for its effectiveness. They viewed the L 1 -regularizer as having two properties, sparseness and closeness. Sparseness determines a small number of nonzero representation coefficients, and closeness makes the nonzero representation coefficients concentrating on the training samples having the same class label as the test sample. Zhang et al. [18] discussed the working mechanism of SRC and demonstrated that it is collaborative representation rather than L 1 -norm sparseness that improves the classification performance. In their work, a collaborative representation-based classification (CRC) model was presented with a squared L 2 -regularization, which achieves competitive classification performance but with significantly lower complexity than the sparse representation method.
It is worth noting that the majority of studies assume that the testing images are taken under well-controlled settings (e.g., reasonable illumination, poses and variations, without occlusion or disguise). Their performance is degraded when the testing images are contaminated. By introducing an identity matrix I as a dictionary to code the outliers (e.g., corrupted or occluded pixels), SRC [15] exhibits excellent robustness and promising performance. However, SRC is not robust to contiguous occlusion such as sunglasses or scarves, as the occlusion level exceeds the breakdown point determined by this algorithm. Yang et al. [25] modified the SRC framework for handling outliers such as occlusions in face recognition by modeling the sparse coding as a sparsity-constrained robust regression problem. He et al. [26] unified the algorithms for error correction and detection by using the additive and multiplicative forms, respectively, and established a half-quadratic framework to solve the robust sparse representation problem. From the viewpoint of dictionary learning, Yang et al. [27] constructed a feature pattern dictionary that captures structured information and prior knowledge of image features to represent the unknown feature pattern weight of a query image. Similarly, Ou et al. [28] developed a clear and noise dictionary simultaneously and applied the learned clear dictionary for classification. Observing the distribution of the reconstruction error image, Yang et al. [29][30][31][32] used the nuclear norm to characterize the structural information of an error image and proposed a nuclear norm-based matrix regression model that has achieved state-of-the-art performance for face recognition with occlusion and illumination changes.
In spite of aforementioned tremendous achievements, the small sample size (SSS) problem till remains one of the most fundamental and challenging issues in face recognition community. In many real-world applications such as smart card solutions, law enforcement, surveillance and access control, the available training samples per subject may be very limited [33]. Thus, the performance of these regression-based methods is greatly degraded because the query sample cannot be well represented by the few training samples. To tackle the SSS problem, many efforts have been made in the past few decades. Existing methods mainly fall into three categories. The first are patch-based methods, which generally contain steps of local patch representation, local feature extraction and the combination of classification results [34][35][36]. However, the patch size has a great impact on the output performance in patch-based methods [37,38]. The second integrate the local and global features for classification [39,40] because they can provide complementary information for final results. The third employ different feature extractors to extract multiple types of features, and then utilizes decision level fusion scheme for final classification [41,42]. We mainly focus on patch-based method in the sequel.
To improve the recognition performance of matrix regression in SSS problem and preserve its outstanding ability dealing with occlusion and illumination changes, in this paper, we propose performing matrix regression on patches. The so-called patch-based matrix regression (PMR) classifies each query matrix patch, and then integrates the recognition outputs of all patches for final decision. Nevertheless, the patch size plays an important role on the final performance in PMR, and the optimal patch size varies greatly across different databases. If the patch size is too small, little information is given, and the method cannot capture the geometric structure of the image; if it is large, the information that can be used is limited. To fully exploit the classification ability and appearance information of different patch sizes, we then devise a multi-scale PMR (MSPMR) scheme by integrating the complementary information from different scales. MSPMR first performs PMR on each scale and then learns optimal scale weights to adaptively fuse multi-scale outputs. To evaluate the performance of the proposed method, we use four databases that involve different recognition tasks: the Extended Yale B, AR and LFW dataset for face recognition without occlusion, the AR database for face recognition with real disguise, and the Extended Yale B dataset for face recognition with block occlusion. The experimental results demonstrate the effectiveness and robustness of the proposed method.
The remainder of the paper is organized as follows. Section 2 briefly reviews two related works. The proposed multi-scale PMR via margin distribution optimization is presented in Section 3. Section 4 conducts extensive experiments, and Section 5 concludes this paper.

Nuclear norm based matrix regression
By observing the distribution of the reconstruction error image, a nuclear norm-based matrix regression (NMR) [29] model was proposed that uses the nuclear norm to characterize the whole structure of the error image. Here, we define N i as the number of images from the i-th class and N ¼ X c i¼1 N i as the total number of training samples from c classes. Given a set of N training image matrices A 1 , A 2 , . . ., A N 2< row×col and a query image matrix B2< row×col , the NMR model can be represented as where λ is the regularization parameter, x and A(x) = x 1 A 1 +x 2 A 2 +. . .+x N A N are the representation coefficient vector and the reconstructed image, respectively. Then, the query image can be classified into the class that yields the minimal reconstruction error, i.e., where r i ðBÞ ¼ kB ÀB i k Ã ¼ kAðx Ã Þ À Aðd i ðx Ã ÞÞk Ã ; where x Ã is the optimal solution of Eq (1) and δ i (x) is a vector whose only nonzero entries are the entries in x that correspond to Class i. We know that NMR is much more robust and effective for face recognition, particularly with respect to occlusion and illumination changes.  [36] model was proposed. For a given query image y, it is first divided into multiple overlapped patches {y 1 , y 2 ,. . ., y p }. Then, each patch y i is tackled by representing it as a linear combination over a local dictionary D i . Finally, one can employ the plurality or linearweighted combination scheme to the recognition outputs for a final decision.

Patch-based CRC
For each patch y i , its representation weights can be obtained by minimizing the following error: Where D i = [D i1 , D i2 ,. . ., D ic ] denotes the local dictionary located with the same position as y i , D ik is the sub-dictionary of the k-th class. The recognition result of patch y i is Identity(y i ) = argmin k [43], where r ik ¼ ky i À D ikbik k 2 =kb ik k 2 ,b ik is the coefficients associated with the k-th class. For clarity, four key components (i.e., muti-scale trick, local patch strategy, structure error and pixel error characterization) of several related methods are compared in Table 1.

Multi-Scale Patch-Based Matrix Regression (MSPMR) 1. Motivation
In PCRC, each local patch matrix is first converted to a vector, and then the L 2 -norm is used to characterize the reconstruction error. However, the L 2 -norm (or L 1 -norm) is based on pixel values and thus ignores the structural information of the error image. We know that the nuclear norm is the sum of all singular values of a matrix, which can also be considered as the l 1 -norm of the singular value vector. Based on the above example, we believe that the nuclear norm is more suitable to describe the structural error.

Patch-based matrix regression (PMR)
To make the model robust and efficient for face recognition with occlusion and illumination changes, matrix regression [29,30,32] was proposed using the nuclear norm to characterize the structure of the error image. In our patch-based matrix regression, all local patches are denoted in matrix form. Given a set of N local patches X i1 , X i2 ,. . .,X iN 2R p×q and a query patch Y i 2R p×q located at position i, Y i can be represented linearly using X i1 , X i2 ,. . .,X iN , i.e., . ., α iN ) T is the representation coefficient vector and E i is the representation error. Generally, α i can be determined by the following regularized modelâ where ||·||Ã denotes the nuclear norm (the sum of the singular values) on R p×q . The problem is equivalent to Problem (6) can be solved by the alternating direction method of multipliers (ADMM), which minimizes the following augmented Lagrangian function: That is, The entire algorithm is briefly summarized in Algorithm 1, which mainly consists of two steps: a soft-thresholding operator [44] and a singular value thresholding operator [45].
Based on the optimal solution α i Ã , we can obtain the reconstructed image of Y aŝ Let δ k : R n !R n be the characteristic function that selects the coefficients associated with the k-th class. For α2R n , δ k (α) is a vector whose nonzero entries are the entries in α that are associated with Class k. Using the coefficients associated with the k-th class, one can obtain the reconstruction of Y i in Class k asŶ ik ¼ Fðd k ða Ã i ÞÞ. Algorithm 1. Solving problem (6) via ADMM Input: A set of N patches X i1 , X i2 ,. . .,X iN 2R p×q and a query patch Y i 2R p×q , parameters λ and μ, the termination condition parameter ε. 1: Fix the others and update u kþ1 3: Fix the others and update E kþ1 5. If the termination condition is satisfied, go to 6; otherwise go to 1. 6. Output: Optimal coding vector x k+1 .
The corresponding class reconstruction error is defined as The recognition output z j of query patch Y i is denoted as Identity Then we can combine the classification outputs of all patches by linear weighted combination [37], probabilistic model [40], kernel plurality [34] or majority voting [35]. In this paper, we use majority voting in the final decision making.

Multi-scale ensemble
From the previous introduction of PMR, we can find that the patch size plays an important role on the final output performance. In addition, how to set an optimal scale in advance for various databases remains unclear. Fig 3 exhibits the recognition rates curves versus different training sample sizes and patch sizes on the LFW and Extended Yale B databases, respectively. From Fig 3, the following observations can be made. First, the optimal patch size varies greatly between different databases. Second, the optimal patch size also varies a lot under the variation of the training sample size per person. To tackle the aforementioned difficulties, the recognition outputs of multi-scale PMR can be fused optimally; thus, the complementary information from different scales can be fully applied to further enhance the recognition performance. Motivated by [36], we also incorporate the ensemble learning scheme into our method to integrate multi-scale outputs.
The diagram of the proposed multi-scale PMR is shown in Fig 4. In the following text, we first formulate the multi-scale ensemble problem, and then introduce a margin distribution optimization to obtain the optimal solution.
Problem formulation. Suppose that we have two scales and two sample classes labeled +1 and -1. For any query sample, we can obtain its classification result +1 or -1 on each scale.
where z i is the label of the sample x i . Definition 2 [36]: For a query sample x i 2S, [4] (j = 1, 2, . . ., s) are the classification results on these s different scales. Then the ensemble margin of x i is denoted as The ensemble loss of x i can be denoted as [36] Where ε(x i ) is the ensemble margin of sample x i .The square loss applied in CRC [18], SRC [15] and least square regression [16] can be used here to evaluate the ensemble loss. For a sample set S, its ensemble square loss can be formulated as where e 1 is a column vector whose entries are 1.
Algorithm of MSPMR. In order to obtain the optimal scale fusion weights, the ensemble square loss in Eq (13) should be minimized. Nevertheless, the solutions may be non-unique for this linear system. Intuitively, we should impose constraint on the objective function in Eq (13) to make the solution unique and stable. Also, Shawe-Taylor [46] has provided the bound on the generalization error and pointed out that both the norm of w and the ensemble square loss should be optimized simultaneously to enhance the generalization ability.
As in [36], the following constrained l 1 -regularized least square optimization can be used to obtain the optimal scale weights [47]: where τ is a regularization parameter and the regularization term can help to achieve a stable solution. Constraint X s j¼1 w j ¼ 1 can be converted to e 2 w = 1, where e 2 = [1,1, . . .,1] is a row vector whose length is s. Then, we have ke À Dwk Denoted byê ¼ ½e; 1 andD ¼ ½D; e 2 , then we have [36] w ¼ arg min w kê ÀDwk  Duo to the fact the decision matrix is usually very small, the scale fusion weights w can be simply obtained by some commonly used l 1 -minimization solvers. In our method, l 1 _ls [48] is employed. Based on the above description, the proposed multi-scale PMR (MSPMR) scheme is summarized in Algorithm 2. Once the optimal scale fusion weights are achieved, the recognition output for an arbitrary sample x i can be represented as

Computational complexity
In this subsection, we will evaluate the computational complexity of the proposed method.
Since the multi-scale fusion weights can be learned off-line, we only discuss the computational complexity of the on-line recognition process involved in the proposed method. As illustrated in Algorithm 2, the proposed face recognition method takes major cost on patch-based matrix regression process. We observe that there are four factors affecting the cost in our method: the training sample size N, the dimension of one patch m = p×q, the number of iterations k in Algorithm 1, the patch number in one image M. As described in [30], the matrix regression of each patch takes O(k(m 1.5 +mN+N 2 )) (in the case that p = q) cost. For M image patch, the computational cost is O(k(m 1.5 +mN+N 2 )M). In addition, the scale number s also affect the final running time. Therefore, the computational cost of the proposed method is about O(sk(m 1.5 +mN+N 2 )M). In Section 4, we will further compare the proposed method with the state-of-the-art approaches in terms of CPU runtime.

Experimental Results and Discussion
In this section, we conduct experiments on the benchmark face databases and the proposed method is compared with state-of-the-art models. For each method, we perform 20 runs of test on each database, and the average recognition rates and the corresponding standard deviations are reported. As in [36], seven scales are adopted in our MSPMR, and the patch sizes are 4×4, 6×6, 8×8, 10×10, 12×12, 14×14 and 16×16. In single scale-based PNN, PSRC, PCRC and PMR, the patch size is 10×10, and the patches overlap with their neighbors by 5 pixels. For PMR and MSPMR, we choose the optimal λ2[0.01,0.1]. Parameter τ is set as 0.1 for MSPMR. It should be mentioned here that all experiments are done on the original face images, without any feature extraction or image preprocessing step. Some face image datasets were used in this paper to verify the performance of our methods. These face image datasets are publicly available for face recognition research, and the consent was not needed. The face images and the experimental results are reported in this paper without any commercial purpose.
As in [36], to learn the optimal scale weights, the training set is divided into subset1 (one image per person) and subset2 (the reminder of the training set). Then, PMR is used to classify samples from subset1 utilizing subset2 as the gallery set and the optimal weights on the seven scales can be learned. It should be noted that at least two samples per person are required to find the optimal scale fusion weights.
Extended Yale B database. The first experiment was conducted on the Extended Yale B database; it includes 38 human objects in 9 poses under 64 illumination changes [49]. 64 images of a person with a particular pose are acquired at a camera frame rate of 30 frames per second, so the variations in the head pose and facial expression are small. All the frontal images marked with P00 are utilized in this experiment, and each is rehsaped to 32×32. Some examples are shown in Fig 5. For each subject, 2~5 samples are randomly selected from the first 32 images for training, and another 5 samples are randomly chosen from the rest 32 images for testing. Table 2 tabulates the experimental results.
It can easily be seen that MSPMR obtains the best recognition performance for all tests. Compared with PCRC and MSPCRC, PMR and MSPMR lead to much better results, thus verifying the effectiveness of characterizing the reconstruction error by the nuclear norm.
AR database. The AR database [50] gathers over 4,000 color face images from 126 subjects, containing frontal facial images with different lighting conditions, facial expressions and occlusions. Pictures of 120 subjects were taken in two sessions (separated by two weeks), and each has 13 color images. As in [18], in this experiments, we choose a subset with only illumination and expression changes, which includes 50 male objects and 50 female objects. Fourteen face images (seven from each session) of these 100 individuals are selected and used. For each object, 2~5 samples from session 1 are randomly chosen for training, and another 3 samples from session 2 are randomly chosen for testing. All the images are manually cropped and then resized to 32×32 pixels. Some sample images of one person are presented in Fig 6. The recognition results of different methods are listed in Table 3. The proposed methods always achieve better performance than the other methods. We can observe that in AR database, multi-scale ensemble learning in MSPMR leads to limited improvement over PMR. As described in [36], the reason may be that in this database, the average weight value for the scale 10×10 is approximately 0.9, indicating that patch size 10×10 is a proper choice for PMR in the AR database.
LFW database. Labeled Faces in the Wild (LFW) [43] is a large-scale database of face photographs designed for unconstrained face recognition with variations in pose, illumination, expression, misalignment and occlusion; it contains images of 5,749 subjects. LFW-a is an  extension of LFW after a commercial face alignment software is applied. As in [36], the objects who have more than ten samples are gathered to form a dataset with 158 objects from LFW-a. All the images are manually cropped and then resized to 32×32 pixels. Fig 7 shows some sample images from this database. For each subject, we randomly choose 2~5 samples for training and another 2 samples for testing. Table 4 shows the face recognition of each method on the LFW dataset. From Table 4, we can clearly see that the performance of our PMR and MSPMR are superior to that of all other methods. Meanwhile, the recognition performance is greatly improved by MSPMR.

Face recognition with occlusion
In the following experiments, we evaluate the robustness and effectiveness of the proposed method when face images encounter with different occlusions, like real disguise or block occlusion. In this subsection, our method is compared with CRC [18], SRC [15], NSC [30], HQ_A and HO_M [26], PSRC [15], PCRC and MSPCRC [36].
Face recognition with real disguise. As in [29,32], a subset of the AR face database is applied, containing 50 males and 50 females. Each face image is manually cropped and normalized to a size of 42×30. Fig 8 shows the sample images for one person from the AR database. In our experiment, for each individual, the first four images (with various facial expressions) from session 1 and session 2 are chosen to form the training set. Two image sets with sunglasses and scarves are used for testing, each of which includes 600 images (three images per session of each individual). For each individual, 2~5 samples are randomly chosen from the training set and another 3 samples from the testing set to evaluate the performance of each method.
The recognition results of each method are shown in Tables 5 and 6, from which we can see that the patch-based method achieves better performance than the corresponding original holistic ones. PMR also gives better results than PCRC or MSPCRC. MSPMR obtains the best performance among all the competing methods when testing images with sunglasses and achieves comparable results when testing images with scarves.
Face recognition with block occlusions. In this subsection, we evaluate the robustness of our method against block occlusions. We adopt Subsets 1 and 2 of the Extended Yale B database for training and Subset 3 for testing. All the face images are normalized to 48×42 pixels. The testing images are corrupted by a randomly located square block of a "baboon" image with an occlusion level of 40%. Fig 9 shows the training and testing sample images for one person from the Extended Yale B database. For each individual, 2~5 samples are randomly chosen from the training set and another 5 samples from the testing set to evaluate the performance of each method.
The face recognition results of each method are tabulated in Table 7. We can see that by characterizing the reconstruction error with the nuclear norm, NSC overall outperforms CRC, SRC, HQ_A and HQ_M. By virtue of the patch trick, our PMR always outperforms PCRC and PSRC. By incorperating the multi-scale ensemble learning trick, the proposed MSPMR achieves the best performance among all the competitive methods.

Parameter discussion
In this subsection, we mainly discuss how the regularization parameter λ affects the performance of our PMR and MSPMR in different face recognition scenarios. The experimental settings are the same as in the aforementioned experiments in section 4.1 and 4.2 except that the training samples per person are fixed at 3. Fig 10 plots the recognition results of PMR and MSPMR versus the variation in the regularization parameter λ on different face image databases. We can observe that PMR and MSPMR always achieve their optimal or nearly optimal performance in the range of [0.01, 0.1]. Thus, we can set the regularization parameter of the proposed method in the above range for real-word scenarios.

Running time comparisons
In this subsection, the CPU runtime of the proposed method is compared with the state-ofthe-art methods. The compared results on the AR face database testing with scarves are listed

Evaluation of the experimental results
The aforementioned experimental results have shown that the proposed method always obtain better performance than some state-of-the-art methods. However, is this superiority statistically significant? In this subsection, we will assess the experimental results by the null hypothesis statistical test [51]. If the evaluated p-value is under the desired significance level (i.e., 0.05), the performance difference between compared approaches is deemed to be statistically significant. The evaluation results are summarized as follows: 1. For face recognition without occlusion, such as on LFW database, MSPMR outperforms MSPCRC significantly for all tests (p = 0.014, 0.013,0.016 and 0.020). On other database, although MSPMR performs better than other state-of-the-art methods, the performance discrepancies between MSPMR and other approaches are not statistically significant.
2. For face recognition with occlusion, MSPMR performs significantly better than other approaches in case of real disguise and block occlusion (p < 0.001).

Conclusions and the Future Work
To improve the performance of matrix regression in face of the small sample size problem and preserve the desired performance level in the presence of occlusion and illumination changes, in this paper, we proposed a patch-based matrix regression (PMR) method. PMR first performed matrix regression on each raw patch (without matrix-to-vector conversion), and then combined the recognition outputs of all patches by majority voting. However, it is difficult to pre-define an optimal patch size across different databases. Fortunately, the complementary information across multiple patch scales can be fully utilized to further enhance the recognition performance. To this end, we proposed the multi-scale version of PMR, i.e., MSPMR, to optimally combine the multi-scale outputs. Our extensive experimental results have demonstrated that the proposed methods are more effective and robust than the state-of-the-art methods. Although our proposed method has obtained successful performance, there are still many issues to be addressed in future. Generally, two main improvements can be made for our method. (1) With the development of the storage device, many images can be collected from real-word applications. One challenge in our method is how to overcome the expensive computational cost. We will try to design efficient matrix regression algorithm to further improve the robustness and effectiveness of our method. (2) In our method, we have to predefine several specific scale sizes in advance. However, different database may exhibit scale transformation in real-word applications. We can borrow the idea of scale selective local binary patterns [52] to design adaptive scale selection strategy to further improve the flexibility of our method.

Ethics Statement
Some face image datasets were used in this paper to verify the performance of our methods. These face image datasets are publicly available for face recognition research, and the consent was not needed. The face images and the experimental results are reported in this paper without any commercial purpose.

Author Contributions
Conceived and designed the experiments: GWG JY XYJ.