Face Recognition with Multi-Resolution Spectral Feature Images

The one-sample-per-person problem has become an active research topic for face recognition in recent years because of its challenges and significance for real-world applications. However, achieving relatively higher recognition accuracy is still a difficult problem due to, usually, too few training samples being available and variations of illumination and expression. To alleviate the negative effects caused by these unfavorable factors, in this paper we propose a more accurate spectral feature image-based 2DLDA (two-dimensional linear discriminant analysis) ensemble algorithm for face recognition, with one sample image per person. In our algorithm, multi-resolution spectral feature images are constructed to represent the face images; this can greatly enlarge the training set. The proposed method is inspired by our finding that, among these spectral feature images, features extracted from some orientations and scales using 2DLDA are not sensitive to variations of illumination and expression. In order to maintain the positive characteristics of these filters and to make correct category assignments, the strategy of classifier committee learning (CCL) is designed to combine the results obtained from different spectral feature images. Using the above strategies, the negative effects caused by those unfavorable factors can be alleviated efficiently in face recognition. Experimental results on the standard databases demonstrate the feasibility and efficiency of the proposed method.


Introduction
Over the past decades, face recognition technology has become one of the most important biometric fields [1,2]. Due to its relative high recognition accuracy and low intrusiveness, it has been widely applied in various scenarios, such as information security, law enforcement, surveillance, and so on. Many algorithms have been developed to address various problems with face recognition [3][4][5], such as expression variation, pose variation, 3D face recognition, multi-modal 2D+3D face recognition, multi-biometric feature fusion, etc.
In recent years, face recognition for the one-sample-per-person problem has attracted many researchers to this research branch. There are two main reasons for this. On the one hand, this problem is very common in some existing application scenarios, such as law enforcement, driver's license, passport and identity card identification, where only a single frontal-view image per person is available. Therefore, it is necessary to develop some more efficient and effective algorithms to make face recognition techniques applicable to these situations. On the other hand, storing only one sample per person in a database can very effectively reduce the costs of sample collection, storage and computation [6].
Different approaches have been proposed for the one-sampleper-person face recognition problem [6,7]. Principal component analysis (PCA) is a widely used statistical signal processing technique [8,9]. Various extensions of PCA have been proposed to solve the one-sample-per-person problem [10][11][12]. Instead of using global features, a representation extracted from patches is proposed in [13] for face recognition with a single exemplar image per person. A prominent advantage of using local representations is its fair robustness to variations in lighting, expression and occlusion. Multiple-feature fusion is also an effective approach for the one-sample-per-person face recognition problem. A combination of the frequency invariant features and the moment invariant features [14], and a fusion of the directionality of edges and the intensity facial features [15], are proposed for face recognition with a single training sample. Instead of using 2D representation, a 3D model-based method is an important approach to the onesample-per-person face recognition problem. In [7], a good review of state-of-the-art 3D facial reconstruction methods [16,17] for face recognition based on a single 2D training image per person is provided. Generally speaking, a common approach to deal with the one-sample-per-person face recognition problem is to enlarge the training set by constructing new representations [18][19][20] or by generating novel views [21].
Linear discriminant analysis (LDA) is a well-known technique for feature extraction and dimensionality reduction that has been used widely in numerous applications. To overcome the so-called singularity problem, a new type of LDA, called two-dimensional LDA (2DLDA), has been proposed and applied to image recognition in recent years [22][23][24]. Compared to the classical LDA, an obvious difference with 2DLDA is that the data is represented in a matrix form instead of a vector form. 2DLDA and its variants have attracted much attention in the past several years because of its advantages in dealing with the singularity problem and in computational cost. Although 2DLDA represents data in a matrix form, it cannot be directly applied to solving the onesample-per-person problem because the within-class scatter matrix is a zero matrix, which makes it unstable. In [25], the difference between the original image and the reconstructed image obtained in using singular value decomposition (SVD) was found to be able to reflect the variations in the within-class images, to an extent. Therefore, the original image and the reconstructed image, instead of the training images only, are used together to compute the within-class scatter matrix and the between-class matrix. The discriminant feature obtained by 2DLDA has been demonstrated to be superior to some existing methods [10,26,27].
Information in the frequency domain is useful in image classification. In [28], a global feature of a scene, named ''spatial envelope'', is proposed by exploring the dominant spatial structure of a scene. For this global feature, the global energy spectrum is used to develop spectral signatures for each scene category. To capture the textural characteristics of the image in the frequency domain, a variant of the global energy feature is presented further in [29], which explores the statistics of the co-occurrence matrix.
Although the spectral feature is specially designed for scene classification, in this paper we present a spectral representation of face images and apply this representation to the one-sample-perperson problem. One issue with the one-sample-per-person problem is that the number of training sample available is too few. In this paper, multi-resolution spectral images are extracted and used as representations of training face images by means of a method similar to [28], thereby enlarging the size of the training set greatly. We find that, among these spectral feature images, features extracted from some specific orientations and scales using 2DLDA are not sensitive to variations of illumination and expression. Inspired by this finding, in our algorithm the spectral features are used as a robust representation of faces. As we do not know exactly which orientations and scales are robust for all testing images, an alternative approach is to use all of these filters in the decision-making process. In our method, each of the filters will form one weak classifier. The strategy of classifier committee learning (CCL) is designed further to combine the results obtained from different spectral feature images to determine the classes of the testing images. With the strategy of CCL, on the one hand, most of the correct categorizations can be retained. On the other hand, it is not necessary for us to choose the optimal filters, which is a very difficult task for the one-sample-per-person problem. Using the above strategies, the negative effects caused by those unfavorable factors, such as variations of illumination and facial expression, can be alleviated greatly in face recognition. Experimental results on some standard databases demonstrate the feasibility and efficiency of the proposed method. Fig. 1 shows the flowchart of our multi-resolution spectral feature image-based 2DLDA ensemble algorithm. There are three main parts to the proposed method: spectral feature image extraction, discriminant feature extraction, and the combination of weak classifiers. A detailed description of each of these three parts is presented in the following subsections.

Spectral Feature Image Representation
Assume that there are C training images I i (i~1, Á Á Á ,C) with size m|n, and that each belongs to one subject. We first extract the spectral feature images of each training image. The image is first pre-filtered to reduce the effect of illumination, using a local normalization method of intensity variance as follows [28]: where I(x,y) and I 0 (x,y) are pixel intensities before and after prefiltering, respectively, G(x,y) is an isotropic low-pass Gaussian spatial filter with a radial cut-off frequency at 0.015 cycles/pixel, and h(x,y)~1{G(x,y). is a constant that helps suppress noise in low-frequency regions. Next, a set of Gabor filters with n s scales and n o orientations is applied on the Fourier transform of the prefiltered image [28]: Finally, the amplitude of the resulting image is computed as the spectral feature image. As a result, for the given N f (i.e. n s |n o ) filters, N f spectral feature images can be obtained for each training sample. Given the filter shown in Fig. 2

Discriminant Feature Extraction
Having generated the spectral feature images based on the N f Gabor filters for all the training face images, we can obtain N f optimal projection subspaces via 2DLDA [25]. Subsequently, N f sets of discriminant feature can be derived by projecting the feature image onto the optimal projection subspace. Denote F ij (i~1, Á Á Á ,C,j~1, Á Á Á ,N f ) as the spectral feature image of the training image I i obtained by using the j th filter. The unitary matrices U, V and the diagonal matrix S constitute the SVD of F ij , i.e., If the first k1 SVD basis images are used, the corresponding reconstructed feature image can be given as follows: where the singular values s l are the diagonal elements of S, and u l and v l are the l th column of U and V, respectively. Given the spectral feature images F ij and the reconstructed feature imageŝ F F ij , the mean feature image F F ij and the global mean F F j of the i th 2DLDA are defined as follows: and Then, the between-class matrix S j b and the within-class scatter matrix S j w can be computed as follows: Denote w i as the eigenvectors of the following generalized eigenvalue problem: where l i are the eigenvalues. The optimal project matrix W j is composed of the eigenvectors associated with the first k2 largest eigenvalues, i.e., which maximize the following criterion: The discriminant features Z ij can be computed by projecting the spectral feature image F ij onto the subspace spanned by W j , i.e., As a result, we can obtain N f discriminant features

Combining the Weaker Classifiers
From Fig. 2, we cannot directly observe whether an extracted spectral feature image is sensitive to the variations of illumination Face Recognition with Spectral Feature Images PLOS ONE | www.plosone.org or expression. As an alternative, we investigate the sensitivity by checking the predicted labels of test samples for different filters. Given a test image I t , we first extract its spectral feature images F tj (j~1, Á Á Á ,N f ), and then compute the discriminant features Z tj using (12). With the discriminant features of the training images Z ij and that of the test image Z tj , the nearest-neighbor classifier is used here to assign a class label to I t . The test image I t belongs to the k th class if For a subject with different expression, illumination and occlusion, Table 1 shows the predicted labels of test samples for different filters when the first image is used as the training sample. We see that the labels can be predicted correctly for the features extracted from spectral feature images at particular scales and orientations. Unfortunately, these orientations and scales are not consistent for the different test samples. That is to say, we cannot predict which scales and orientations are not sensitive to variations of illumination and expression for different face images.
Since we cannot select the optimal scales and orientations, an alternative approach is to use all of these filters in the decision process. We construct one weaker classifier for each filter, as shown in (13). As a result, for each test sample, N f weaker classifiers are formed by means of the spectral features extracted via N f filters. Finally, a classifier-combination strategy is adopted finally to determine the class label of the test image. Max rule, min rule, median rule, and majority-vote rule are commonly used classifier-combination strategies [30]. As the outputs of the weaker classifiers are the class labels of the test images, the majority-vote rule is the most suitable strategy to combine these outputs. To count the votes received from the weaker classifiers, a binary- valued vector is defined. If I t belongs to the k th class, the class label vector of I t obtained via the j th weaker classifier can be given as follows: It can be seen from (14) that one binary-valued vector can be obtained for each weaker classifier. Further, we sum these vectors to obtain the number of votes for each class as follows: Each element of L t i denotes the number of votes of each class. The test sample I t belongs to the class with the maximum number of votes. For example, the label of I t is q if the q th element of L t i is maximum.
The labels (L(MR_2DLDA)) determined via the majority-vote rule are tabulated in Table 1. As a comparison, the labels (L(2DLDA)) predicted through 2DLDA are also shown in Table 1. It can be seen that, comparing to 2DLDA, more labels are predicted correctly using the proposed method. There are two main reasons for this. One the one hand, the spectral features extracted on some scales and orientations are not sensitive to variations of illumination and expression. As shown in Table 1, 2DLDA cannot predict the labels correctly for some test samples, whiles these labels are correctly assigned by some weaker classifiers. This provides the possibility to predict the label of the test samples correctly. On the other hand, although we have no way of choosing the optimal filters, as discussed previously, the majority-vote rule can find the correct class attributes of the test samples when the spectral features extracted on a large percentage of scales and orientations are not sensitive to variations of illumination and expression. Certainly, we can also see that not all the labels of the test images are predicted correctly using our proposed method. Therefore, the strategies adopted in the proposed method can only alleviate the negative effects caused by variations of illumination and expression to some extent.

Databases and Experiment Set-Up
We evaluate the performance of our proposed method on seven standard databases: Yale face database [31], ORL face database [32], Extended Yale Face database B, PIE database, FERET face database, AR database, and LFWA database [33,34].
The Yale face database contains 165 grayscale images of 15 individuals. Each individual has 11 images that are different in expressions (happy, normal, sad, sleepy, surprised, and winking), in lighting conditions (left-light, center-light, right-light), and in facial details (with/without glasses) [31]. and k2 vary in the interval , (b) k1 is set at 5 and k2 varies in the interval , and (c) k2 is set at 20 and k1 varies in the interval . LFWA is a database of face photographs designed for studying the problem of unconstrained face recognition. All the face images and k2 vary in the interval , (b) k1 is set at 5 and k2 varies in the interval , and (c) k2 is set at 20 and k1 varies in the interval .   have been aligned via commercial face-alignment software by the provider. Further, like other datasets, the face images are manually cropped to remove the backgrounds. As the whole database is fairly large, and the face images are manually cropped, we have selected only a subset from the database in the experiments. Those individuals with 30 or more images were selected for the experiments. In the selected dataset, there are 34 individuals in total. Except for the LFWA database, all the face images are manually aligned and cropped by other researchers. To investigate the influence of the image size on the recognition performance, we use both the Yale face database and the ORL face database, which have two sets of data with different image sizes. From the Yale face database, the images of size 32|32 (denoted as Yale_32|32) and 64|64 (denoted as Yale_64|64) were used in the experiments. For the ORL face database, the images of size 112|92 (denoted as ORL_112|92) and 32|32 (denoted as ORL_32|32) were utilized in the evaluation. The Extended Yale Face database B, the PIE database, the FERET face database, and the AR database have far larger numbers of face images than the other two   databases. Therefore, they can be used to investigate the performance of the algorithms on large databases. The image sizes of the Extended Yale Face database B, the PIE database, the FERET face database, and the AR database are 32|32, 32|32, 40|40, and 165|120, respectively (denoted as YaleB_32|32, PIE_32|32, FERET_40|40, and AR_165|120, respectively). The above datasets are publicly available from [32,34,37].

Experiments
To verify the performance of our proposed method (denoted as MR_2DLDA), we compare it to four other face recognition methods designed for the one-sample-per-person problem. The four methods are the E(PC) 2 A method [10], the block-based Fisher LDA method (denoted as BFLDA) [26], the generalized eigenface method (denoted as GE) [27], the 2DLDA [25].
As in [38], three scales are employed for the filter transfer functions. The respective numbers of orientations for the three scales (NOS) are set at 8, 8, and 4 (denoted as [8 8 4]). Our experiments have shown that a satisfactory performance can generally be achieved when the parameters of the filter transfer functions are set around the values suggested in [38]. Table 2 shows the recognition rates of MD_2DLDA on the datasets Yale_32632 and ORL_32632, with the first image of each subject in the databases used as training samples and different numbers of Gabor filters are used. As only one training sample is used for each distinct subject, traditional parameter-selection methods, such as cross validation, cannot be used to choose the  optimal parameters. It can be seen from Table 2 that the classification results are close for different parameters, i.e., the parameter variation around [8 8 4] only has slight influence on the classification performance. Therefore, for simplicity, in all the following experiments, the respective numbers of orientations for the three scales are set at 9, 8, and 4 (denoted as [9 8 4]), i.e., twenty-one filters are used in our proposed method. Similar results can be obtained when other parameters are adopted in the experiments. It is also difficult to find the optimal values for the parameters k1 and k2 in the one-sample problem. Taking the dataset Yale_32632, for example, Fig. 3(a) shows the recognition rates when k1 and k2 vary in the interval . Fig. 3(b) shows the results when k1 is 5 and k2 varies in the interval . It can be seen that the performance becomes stable when k2 is larger than 15. Fig. 3(c) shows the results when k2 is 20 and k1 varies in the interval . We can see that the recognition rate increases gradually when k1 varies from 1 to 5, and then the recognition rate decreases with some fluctuations when k1 varies from 6 to 31. Fig. 4 shows the corresponding results based on the dataset ORL_32632. We can conclude that a better result is obtained when k1 and k2 are set at 5 and 20, respectively. Since we are unable to identify the optimal values of the parameters k1 and k2 via the parameter-selection methods, in all the following experiments, k1 and k2 are set at 5 and 20, respectively.
The performance of our proposed method is compared to four different face recognition methods. We follow the same experimental set-ups as used in [25]: the first image of each subject is used as the training sample, while the remaining images are used as the test samples. We first perform a set of experiments on the datasets Yale_32632 and ORL_112692 to compare the recognition performances of the five different face recognition methods. Table 3 shows the top 1 recognition rates (%) of the five methods based on the two datasets. Note that the experimental results of E(PC) 2 A, BFLDA, GE and 2DLDA are given by [25]. For the dataset Yale_32632, Table 3 shows that our proposed method can achieve much higher recognition accuracy than the other four methods. For the dataset ORL_112692, compared to the other four methods, the recognition rate of our proposed method is between 8% and 40% higher than the other four methods.
As 2DLDA has been demonstrated to have a superior performance as compared to the other three methods, we therefore compare the performances of 2DLDA and MR_2DLDA only, based on the datasets Yale_64664, ORL_32632, Ya-leB_32632, PIE_32632, FERET_40|40, AR_165|120, and LFWA_110|80 (see Table 4). For 2DLDA, the parameter k1 is set at 3, as is in [25]. The parameter k2 is set at 6 in terms of the experimental results shown in Fig. 8 of [25]. It can be seen from Table 4 that our proposed method can achieve much higher recognition accuracy than 2DLDA on Yale_64664 and Ya-leB_32632. Also, our proposed method has recognition rates about 10%, 17%, 20%, 31% and 13% higher than 2DLDA on ORL_32632, PIE_32632, FERET_40|40, AR_165|120, and LFWA_110|80, respectively. Furthermore, we can see from Tables 3 and 4 that, with our proposed method, the larger the image size, the higher the recognition rates will generally be, and vice versa.
We noticed that some classification results on the ORL database and the Yale database are also reported for a multiple-feature method (denoted as MFM) [14]. Here, we can present only a rough comparison because the image sizes and the experimental set-ups are different for our MR_2DLDA method and the MFM method. It can be seen from Tables 9 and 10 in [14] that, the classification rates of the ORL database and the Yale database are 71% and 0.69%, respectively, when the first image of each individual is used as the training sample. However, we can see from Table 4 that the corresponding classification rates of the MR_2DLDA method are 71.39% and 80%, respectively. Moreover, in [14], we noted that the image sizes of the ORL database and the Yale database are 92692 and 1286128, respectively, which are larger than the image sizes (32632 and 64664) used in evaluating the MR_2DLDA method. In terms of the conclusion we drew from Tables 3 and 4, a higher recognition accuracy generally can be achieved for the MR_2DLDA method if the databases with larger image sizes are available. In general, the MR_2DLDA method has a classification performance that is competitive with the MFM method on both the ORL database and the Yale database.
Furthermore, to investigate the influence of different training samples on the recognition performance, each face image of every class is used as the training sample for the datasets ORL_32632, ORL_112692, Yale_64664 and ORL_32632. For the datasets, YaleB_32632, PIE_32632, FERET_40|40, AR_165|120, and LFWA_110680, one face image is randomly selected from every class and used as the training sample. The trials are performed for ten times. Fig. 5 shows the recognition rates of 2DLDA and MR_2DLDA, respectively, on 9 datasets when different face images are used as the training samples. It can be seen that MR_2DLDA has a better recognition performance than 2DLDA on these datasets. Tables 5 and 6 show the mean (m), standard deviation (s), and the ratio (s=m) of the top 1 recognition accuracies (%) for 2DLDA and MR_2DLDA when different face images are used as the training samples, respectively. It can be seen that the mean recognition rates of MR_2DLDA are higher than those of 2DLDA by about 7*49%. These two methods have similar s and s=m values on the datasets ORL_112692 and ORL_32632. However, MR_2DLDA has lower s and s=m values on the datasets Yale_32632, Yale_64|64, and LFWA_110|80 than those of 2DLDA. Moreover, MR_2DLDA has lower s=m values on the datasets YaleB_32632, PIE_32632, FERET_40|40, and AR_165|120 than those of 2DLDA. Therefore, we can conclude that MR_2DLDA is more robust than 2DLDA to the training samples used. We also can see that the performance of MR_2DLDA is obviously better than that of 2DLDA on four large datasets YaleB_32632, PIE_32632, FERET_40|40, and AR_165|120. In addition, the experimental results of the two methods in Tables 5 and 6 again verify that, the larger the image size, the higher the recognition rates will generally be, and vice versa.

Discussion
Pre-filtering is an important step in the MR_2DLDA method. Fig. 6 shows the recognition rates of MR_2DLDA with and without pre-filtering, respectively, when different face images are used as the training samples. Table 7 shows the mean (m), the standard deviation (s), and the ratio (s=m) of the recognition rates (%) for MR_2DLDA when pre-filtering is not employed. We can see from Tables 6 and 7 and from Fig. 6 that the performances decrease greatly on the datasets Yale_32632, Yale_64664, YaleB_32632, PIE_32632, FERET_40|40, AR_165|120, and LFWA_110680, especially for the five large datasets. Although there is little change on the two ORL datasets, we can conclude, in general, that pre-filtering is an important step in the MR_2DLDA method.
Traditional parameter-selection methods, such as cross validation, cannot be used to choose the optimal parameters for face recognition in the case of the one-sample-per-person problem. For our proposed method, the parameters k1, k2 and N f can only be determined experimentally. This problem is also encountered by other existing face recognition algorithms in the one-sample-perperson case. How to find the optimal parameter values is still to be investigated in our future work.
A heavy computation burden is a common problem in the CCL algorithms. The proposed method also has a higher computation cost than does 2DLDA. The reason for this is that the feature extraction and classification are performed based on each individual filter, i.e., N f times in all. The computation time can be reduced by selecting only some of the filters, instead of using all the filters in the experiments. However, it remains a difficult problem to find an efficient criterion to select those filters that are efficient for all datasets.
An argument of CCL is that the number of samples is increased when the number of weaker classifiers is increased via the rand subspace method, or another such method. For our proposed method, we can increase the training set size by constructing the spectral images of the training samples. Then, the features are extracted by using the conventional LDA algorithms, and the test samples are classified using the nearest-neighbor algorithm. The results are poor for the cases cited in this paper. Another possible variant of the proposed method is that, instead of using 2DLDA, other LDA algorithms such as the well-known regularized discriminant analysis [39], etc., can also be embedded in the proposed approach, thereby substituting 2DLDA. As 2DLDA has been demonstrated to have a superior performance as compared to the other methods, there is no need to present more experimental results here.
Although our proposed method is specifically designed for face recognition in the one-sample-per-person problem, it can also be extended to deal with cases with more than one sample. When multiple training images are available, as shown in Fig. 1, we can construct one set of weaker classifiers for each sample. Correspondingly, the label of the test image can be determined by integrating the outputs of all weaker classifiers. We will not discuss in this paper the case of multiple training samples because numerous algorithms have been developed for this.

Conclusions
In this paper, we propose an efficient multi-resolution spectral feature image-based 2DLDA ensemble algorithm for the onesample-image-per-person problem of face recognition. Experimental results have demonstrated that our proposed method has a higher recognition accuracy and robustness than some recently reported methods. Further, the experimental results also indicate that, for the proposed method, the larger the image size, the higher the recognition rates will be, and vice versa. In addition, prefiltering is found to be an important step in the MR_2DLDA method. Compared to the 2DLDA method, the computation time required by the proposed method is higher. How to determine an efficient criterion to select a subset of the filters so as to reduce the computation burden while maintaining the performance level, is to be investigated in our future work.