Discriminant Projective Non-Negative Matrix Factorization

Projective non-negative matrix factorization (PNMF) projects high-dimensional non-negative examples X onto a lower-dimensional subspace spanned by a non-negative basis W and considers WT X as their coefficients, i.e., X≈WWT X. Since PNMF learns the natural parts-based representation Wof X, it has been widely used in many fields such as pattern recognition and computer vision. However, PNMF does not perform well in classification tasks because it completely ignores the label information of the dataset. This paper proposes a Discriminant PNMF method (DPNMF) to overcome this deficiency. In particular, DPNMF exploits Fisher's criterion to PNMF for utilizing the label information. Similar to PNMF, DPNMF learns a single non-negative basis matrix and needs less computational burden than NMF. In contrast to PNMF, DPNMF maximizes the distance between centers of any two classes of examples meanwhile minimizes the distance between any two examples of the same class in the lower-dimensional subspace and thus has more discriminant power. We develop a multiplicative update rule to solve DPNMF and prove its convergence. Experimental results on four popular face image datasets confirm its effectiveness comparing with the representative NMF and PNMF algorithms.


Introduction
Dimension reduction uncovers the low-dimensional structures hidden in the high-dimensional data and gets rid of the data redundancy, and thus significantly enhance the performance and reduce the subsequent computational cost. Due to its effectiveness, dimension reduction has been widely used in many areas such as pattern recognition and computer vision. Some data such as image pixels and video frames are non-negative, but conventional dimension reduction approaches like principal component analysis (PCA, [1]) and Fisher's linear discriminant analysis (FLDA, [2]) do not maintain such non-negativity property, and thus lead to a holistic representation which is inconsistent with the intuition of learning parts to form a whole.
Non-negative matrix factorization (NMF, [3]) decomposes a non-negative data matrix X into the product of two lower-rank non-negative factor matrices, i.e., X<WH. Due to the nonnegativity constraints on both factor matrices W and H, NMF learns parts-based representation and brought much attention in practical tasks such as image processing [4] and data mining [5][6][7][8].
To utilize the label information of a dataset, Zafeiriou et al. [9] proposed Discriminant NMF (DNMF) by incorporating Fisher's criterion to NMF. Guan et al. [43] [44] proposed a Nonnegative Patch Alignment Framework (NPAF) that incorporates marginmaximization based discriminative information into NMF. Recently, Guan et al. [42] extended NMF to a novel low-rank and sparse matrix decomposition method termed Manhattan NMF (MahNMF). Nevertheless, NMF, DNMF, NPAF, and MahNMF suffer from the out-of-sample deficiency [10] [11], namely it is indirect to obtain the coefficient of any new coming example. Usually, after getting the basis W by NMF, we calculate the coefficient of a new coming example x as y = W { x, where W { denotes the pseudo-inverse of W. However, such strategy violates the non-negativity property of the coefficients because the pseudoinverse operator induces negative entries. Conventional dimension reduction methods such as PAF [35], NPE [12] and LPP [13] overcome the out-of-sample deficiency by using the linearization method which learns a projection matrix. They project a new coming example into the lower-dimensional subspace by directly multiplying it with the learned projection matrix.
To overcome the out-of-sample deficiency of NMF, Yuan et al. [14] proposed projective NMF (PNMF) based on the linearization method. In particular, PNMF learns non-negative basis of the lower dimensional subspace and considers its transpose as the projection matrix, i.e., X<WW T X. Since the learned projection matrix is non-negative, PNMF obtains non-negative coefficient for any new coming example because multiplication of non-negative matrix and non-negative vector produces non-negative vector. In addition, since PNMF implicitly induces WW T <I, rows of W are approximately orthogonal. Moreover, since W is non-negative, such orthogonality implies that each column of W contains few nonzero entries. Therefore, PNMF implicitly learns parts-based representation. In contrast, NMF never guarantees such partsbased representation [15]. On the other hand, PNMF involves fewer parameters than NMF, and thus it has been widely used in dimension reduction.
Recently, PNMF has been well-studied and extended to deal with various tasks. Liu et al. [10] proposed projective non-negative graph embedding (PNGE) which learns two factor matrices, i.e., a non-negative basis matrix and a non-negative projection matrix while PNMF learns a single one. PNGE incorporates both geometric structure and label information in a dataset based on graph embedding [16]. Wen et al. [17] proposed orthogonal projective non-negative matrix factorization based on NPE (NPOPNMF) for hyperspectral image feature extraction. However, PNGE and NPOPNMF have two unknown variables like NMF and do not benefit enough from PNMF. To handle non-linear dimension reduction problem, Yang et al. [18] proposed nonlinear PNMF. Yang et al. [18] theoretically analyzed the convergence of the multiplicative update rule (MUR) of PNMF and applied MUR to optimize the non-linear PNMF. Since the objective function of PNMF contains a fourth-order term, MUR suffers from serious non-convergence problem. To remedy this problem, Hu et al. [19] approximated PNMF with a high-order Taylor expansion of the objective function and developed a convergent MUR with its convergence proved. To guarantee the convergence of PNMF, Zhang et al. [20] solved PNMF by a new adaptive MUR without normalizing the basis matrix in each iteration round.
Although PNMF and its variants have been successfully applied in many fields such as face recognition and document clustering, they share the following problems: PNMF and most of its variants ignore the label information of the dataset, and thus they cannot perform well in classification tasks. PNGE considers the label information based on the graph embedding framework [16], but it induces additional unknown variable and increases the computational complexity. In this paper, we proposed a Discriminant PNMF (DPNMF) to overcome the aforementioned problems. In particular, DPNMF incorporates Fisher's criterion into PNMF to make examples of different classes as far as possible meanwhile make examples of the same class as close as possible in the lowerdimensional subspace. It has been verified that label information enhances recognition performance in practical applications [21][22][23][24]. Therefore, DPNMF benefits much from the label information and significantly boosts the performance of classification tasks. To avoid the singularity problem in conventional FLDA, DPNMF utilizes a smartly choosing parameter to trade-off both aforementioned objectives. To solve DPNMF, we developed a MUR-based algorithm and proved its convergence. Experimental results on four popular face image datasets including Yale [25], ORL [26], UMIST [27] and FERET [28] confirm the effectiveness of DPNMF comparing with NMF, PNMF and their extensions.

Analysis
This section surveys both non-negative matrix factorization (NMF) and projective non-negative matrix factorization (PNMF) with their superiorities and shortcomings analysed.

NMF
Given n examples in m-dimensional space arranged in a nonnegative data matrix V [R m|n z , NMF seeks two lower-rank nonnegative factor matrices, i.e., W [R m|r z and H[R r|n z , whose product reconstructs V. The objective of NMF is to minimize the Kullback-Leiblur (KL) divergence between V and WH, i.e., where log signifies the natural logarithmic function. Although NMF is jointly non-convex with respect to Wand H, it is convex with respect to W and H separately. Therefore, NMF can be solved by alternatively updating both factor matrices. Lee and Seung [3] proposed an efficient multiplicative update rule (MUR) to solve NMF: where (2) updates W followed by a normalization (3), and (4) updates H.
Since NMF ignores the label information of a dataset, it does not perform well in classification tasks. In addition, NMF suffers from the out-of-sample problem because it is non-trivial to calculate the non-negative coefficient of a new coming example.

PNMF
To overcome the out-of-sample deficiency of NMF, PNMF [14] learns a non-negative projection matrix to directly project V onto the lower-dimensional subspace. Let W denote the basis matrix, then PNMF treats W T V as the coefficients and utilize WW T V to reconstruct V. The objective function of PNMF is where : k k F denotes the Frobenius norm. Since J PNMF is nonconvex [19], it is non-trivial to get the global minimum of PNMF. Yuan et al. [14] developed a multiplicative update rule (MUR) to iteratively update W by until JPNMF does not change. In each iteration round, PNMF normalizes W by dividing its spectral norm, i.e., W /W = W k k 2 and : k k 2 signifies the spectral norm of a matrix, for the following reason. According to (5), PNMF implicitly induces the constraint WWT<I, which is not guaranteed by (6). The normalization operator shrinks W to make WWT close to I in terms of spectral norm.
PNMF overcomes the out-of-sample deficiency of NMF and learns parts-based representation because it implicitly induces the orthogonality of the learned basis. However, since PNMF ignores the label information of a dataset, like NMF, PNMF does not work well in classification tasks.

Discriminant PNMF
Above analysis gives us two observations on NMF and its extensions: 1) both NMF and DNMF suffer from the out-ofsample deficiency, and 2) although PNMF overcomes the out-ofsample deficiency, it does not utilize the label information in a dataset. To further understand these observations, we  Figure 1.B shows that these coefficients contain negative entries caused by the pseudo-inverse operator over the basis matrix, i.e., DNMF suffers from out-of-sample deficiency which weakens its discriminant power. Figure 1.C shows that PNMF overcomes the out-of-sample deficiency but it has weak discriminant power because it completely ignores the label information.
These observations motivate us to take advantages of both DNMF and PNMF and propose Discriminant PNMF (DPNMF) algorithm. In particular, we assume that examples can be projected onto a lower-dimensional subspace and the transpose of basis is considered as a projection matrix. Such assumption implicitly induces parts-based representation of the training examples and overcomes the out-of-sample deficiency like PNMF. To utilize the label information of a dataset like DNMF, DPNMF incorporate Fisher's criteria to enhance the discriminant ability of PNMF. Given training data examples arranged in V[R m|n , DPNMF learns the basis matrix W [R m|r (r#m and r#n) and projects V from R m to R r by W T , i.e., the coefficients Y = W T V. According to [2], DPNMF expects the examples of same class as close as possible and the examples of different class as far as possible in the lower-dimensional subspace. Since Y = W T V, the above two objectives are equivalent to where C signifies the number of classes, n c is the number of examples of class c, and S w~P where l balances objectives (7) and (8), and m controls the weight of Fisher's criterion. The tradeoff parameterl is critical in DPNMF (9). According to [29], we choose l as the largest eigenvalue of , to guarantee the convexity of Fisher's criterion. Although the second term of (9) is convex, the objective function of (9) is non-convex because the loss function of PNMF is nonconvex. The following section will present an efficient algorithm to find its local minimum. Another tradeoff parameter m is tuned in the experiments.

MUR for DPNMF
Since the objective function J DPNMF (W) is non-convex, it is impossible to find its global minimum. Fortunately, it is differential with respect to W, and thus the gradient descent method can be used to find a local minimum of (9). By simple algebra, eq. (9) can be written as Tr which is obviously a constrained minimization problem. The problem (10) can be solved by using the Lagrangian multiplier method [30]. The Lagrangian function of the objective function of (10) is where w is the Lagrangian multiplier of the constraint W$0. According to the K.K.T. conditions [31], the minimizer of (9) satisfies where W ik stands for the entry positioned at the i-th row and k-th column of W. By substituting (12) into (14), we have Since any real matrix A can be calculated by its positive items minus the negative items, i.e. A~½A z {½{A z , where the operator [X] + keeps the non-negative entries of X meanwhile shrinks the negative entries to zero, By simple algebra, the above equation is equivalent to Eq. (16) gives us a multiplicative update rule (MUR) for DPNMF Since MUR includes only product operators of non-negative matrices, the obtained minimizer naturally satisfies (17). Although MUR is derived from the K.K.T. condition [31], it does decrease the objective function J DPNMF (W) of DPNMF. The following Theorem 1 proves the convergence of MUR.
We leave the proof of Theorem 1 in Materials. Similar to PNMF, DPNMF also implicitly induces the constraint WW T <I which cannot be satisfied by MUR. Therefore, DPNMF normalizes W by dividing by its spectral norm in each iteration round to remedy this deficiency. The DPNMF algorithm is summarized in Algorithm 1 (see Table 1), where the operator 0 in line 5 signifies element-wise multiplication. The Algorithm 1 is stopped when the following condition is satisfied: where t is the iteration counter and e is a predefined tolerance.
The main time cost of Algorithm 1 is spent on lines 1, 2, and 5. Line 1 constructs both within-class and between-class scatter matrices in O(m 2 n) time. Line 2 calculates inverse of S w and its multiplication with S b in O(m 3 ) time. Line 5 denominates the time complexity because it includes multiplications between highdimensional matrices and the number of iterations is usually large. Looking carefully at line 5, its time costs can be decreased by updating W t+1 by the following two steps: and where (19)

Experiments
This section evaluates DPNMF by a comprehensive study of its ability of data representation and its effectiveness in face recognition on four datasets including Yale [25], ORL [26], UMIST [27] and FERET [28] dataset.

A Comprehensive Study
To validate the data representation ability of DPNMF, we conducted a simple experiment before practical tasks. We randomly selected two individuals from UMIST dataset. For each individual, totally 15 images were chosen for this study and 7 images were utilized for training and the remaining 8 images were utilized for testing. Each image was cropped to a 40640 pixel (17) array and reshaped to 1600-dimensional vector. We marked images of both individuals by ''*'' and ''o'', respectively, and the training images and the test images are painted in blue and red, respectively. Therefore, we obtained totally 14 training images painted in red and 16 test images painted in blue in Figure 2. In this experiment, DPNMF, DNMF, PNMF and NMF were conducted on the training images to learn a 2-dimensional subspace. Then, the test images were projected onto the learned subspace to depict their data representation abilities. Figure 2 shows the coefficients of both training and test images in the learned subspaces by DPNMF, DNMF, PNMF and NMF. Figure 2.B shows that their coefficients in the DNMF subspace contain negative entries. It means that DNMF suffers from the out-of-sample deficiency, namely the coefficients of the test examples contain negative entries. Figure 2.C shows that PNMF overcomes the out-of-sample deficiency but has weak discriminant power because it ignores the label information of the training images. In addition, NMF suffers from the out-of-sample deficiency and ignores the label information of the training images (see Figure 2.D). Figure 2.A shows that DPNMF simultaneously overcomes the aforementioned drawbacks and separates the images of both individuals perfectly.

Face Recognition
In this section, we validate the effectiveness of DPNMF by comparing the most related methods including NMF, PNMF, PNGE and DNMF on four datasets including Yale [25], ORL [26], UMIST [27] and FERET [28] dataset. For each dataset, all the face images are aligned according to the position eye. Different numbers of images of each subject were randomly selected to construct the training set and the remaining images consist of the test set. In this experiment, we used the nearest neighbor (NN) rule as a classifier and calculated the accuracy as percentage of test face images that are correctly classified. To eliminate the effect of Output: Basis matrix W.
1. Calculate S w and S b with V and L, according to (1) and (2), respectively.

Calculate the largest eigenvalue
6. Normalize W tz1 /W tz1 = W tz1 k k 2 and update t/tz1. (18)   randomness, we repeated such trial 5 times and compared representative algorithms based on the average accuracy. For DNMF, we set c = 10 and d = 0.0001 over the within class scatter term and between class scatter term, respectively. For PNGE, we set the trade-off parameter m = 0.5 and the other parameters according to [10]. For all algorithms, the maximum number of loops is set to 2000 and the tolerance e of stopping criterion is set to 10 27 . Given the training set V tr , both NMF and DNMF learn a basis W and the coefficients as V tr~W Y tr . To classify each image v ts , we first calculate its coefficient y ts~W { v ts and then classify it to the same class as the image whose coefficient has smallest Euclidean distance to y ts , i.e., i~arg min yi[Ytr y i {y ts k k 2 . Since both PNMF and DPNMF learn a basis W and consider its transpose as a projection matrix, different from NMF and DNMF, the coefficient of a test image v ts is calculated as y ts~W T v ts . We keep the remaining procedures of classification consistent for fairness of comparison. Figure 3 gives the basis images learned by DPNMF, DNMF, PNGE, NMF, and PNMF on Yale, ORL, UMIST, and FERET datasets. It shows that DPNMF learns parts-based representation. In the following, we will validate the effectiveness of such representation.

Until {Stopping criterion
Yale Dataset. The Yale face image database [25] consists of 165 grayscale images taken from 15 subjects. Totally eleven images were taken from each subject under different settings such as varying facial expressions (sleepy or surprised) and other configurations. Each image is cropped to 32632 pixels and reshaped to a 1024-dimensional vector. For each subject, totally 2, 4, 6, and 8 images were randomly selected as the training images and the remaining images as test images. In this experiment, we set the parameter m = 1 for DPNMF (9). Figure 4 reports the average accuracies of DPNMF, DNMF, PNGE, PNMF and NMF on Yale dataset under different settings. It shows that DPNMF significantly outperforms the representative algorithms because it utilizes the label information in representing the training images and such parts-based representation (cf. row A of Figure 3 effectively inhibits the influence of the contained noises. ORL Dataset. The Cambridge ORL database [26] is composed of 400 face images taken from 40 individuals with varying facial expression, lighting and occlusions such as with and without glasses. For each individual, totally 2, 4, 6, and 8 images were randomly selected as the training images and the remaining images as test images. Each image is cropped to 32632 pixels and reshaped to a 1024-dimensional vector. For DPNMF, the parameter in (9)     UMIST Dataset. The UMIST database [27] includes 575 face images collected from 20 individuals from different views and poses. Each image was resized to a 40640 pixel array and reshaped to a 1600-dimensional long vector. In this experiment, a subset of 300 images composed of 15 images per subject on the left profile was tested. We randomly selected 4, 6, 8, and 10 images  from each individual for training and the remaining images are used for testing. For DPNMF, we set the parameter m = 1 in (9) empirically. Figure 6 compares the average accuracies of DPNMF, DNMF, PNGE, PNMF and NMF on UMIST dataset under different settings. It shows that DPNMF significantly outperforms other algorithms especially when four and six images of each individual are selected for training. When eight and ten images of each individual are selected for training, DPNMF almost performs perfectly.
FERET Dataset. The FERET database [28] contains 13,539 face images taken from 1,565 subjects varying in size, pose, illumination, facial expression and age. We randomly select 100 individuals and 7 images for each individual to build up the FERET dataset. Each image was cropped to a 40640 pixel array and reshaped to a 1600-dimensional long vector. Totally 2, 3, 4, and 5 images were randomly selected from each individual for training and the remaining images are used for testing. For DPNMF (9), we set the parameter m = 1 when 2 and 3 images of each individual are selected for training, and set m = 0.1 when 4 and 5 images of each individual are selected for training. Figure 7 reports the average accuracies of DPNMF, DNMF, PNGE, PNMF and NMF on FERET dataset under different settings. It shows that DPNMF significantly outperforms NMF, PNMF, and PNGE because it utilizes the label information in the training set. Figure 7 shows that DNMF also performs well on this dataset especially when 3, 4, and 5 images of each individual are selected for training. However, DNMF performs poorly when only two images of each individual are used for training because the training examples are rather limited in this case and the pseudoinverse operator over its learned basis greatly reduces the discriminant power of DNMF. DPNMF overcomes such problem, and thus performs well (see Figure 7.A) in this case. Such observation confirms the effectiveness of DPNMF.

Discussion
This section shows how to tune the tradeoff parameter in DPNMF. In addition, we also give an empirical validation of both convergence and efficiency of the MUR algorithm for DPNMF.

Parameter Selection
In the proposed DPNMF, there is a trade-off parameter m that controls its discriminant power. It is usually tuned by using grid search on a wide range. In our experiments, we tuned this parameter in a wide range of [10-10 10-7 10-3 0.01 0.1 1 3 5 10 50 100 500 103 107 1010] on the Yale, ORL, UMIST and FERET datasets. To study the consistence of the selected parameter, we randomly select 4 and 8 images from each individual of Yale and ORL datasets for training, and 6 and 10 images from each individual of UMIST dataset for training, and 3 and 5 images from each individual of FERET dataset for training. Such trail is independently conducted five times to eliminate the randomness of training set and the average accuracy is reported in Figure 8.A to Figure 8.H, respectively. Figure 8.A and Figure 8.E show that DPNMF performs stably when m is selected from 10 210 to 1 on the Yale dataset and reaches its peak when m = 1. Figure 7.B and Figure 8.F show that DPNMF performs stably when m varies from 10 210 to 0.1 on the ORL dataset and reaches its peak when m = 0.1. Figure 8.C and Figure 8.G show that DPNMF performs stably when m is selected from 10 210 to 50 on the UMIST dataset and reaches its peak when m = 3. Figure 8.D and Figure 8.H show that DPNMF performs stably when m is selected from 10 210 to 1 on the FERET dataset and reaches its peak when m = 0.01. From Figure 8, we can see that DPNMF performs stably when the parameter m is selected from a wide range, but its discriminant power might decrease when the parameter m is gradually increased. Therefore, we empirically set the parameter m = 1, and this parameter should be tuned for satisfied classification performance on other datasets.

Convergence Study
In this section, we verified the convergence of DPNMF on the tested four face datasets. We randomly selected 8, 8, 10 and 5 images from each individual of Yale, ORL, UMIST and FERET datasets for training, and reported the objective values versus numbers of iterations in Figure 9.A to Figure 9.D, respectively. In this experiment, we set the tradeoff parameter m to 10, 0.1, 3, and 0.01, according to above analysis and the reduced dimensionalities to 116, 304, 186, and 496 on the Yale, ORL, UMIST, and FERET datasets, respectively. The maximum number of iterations is set to 500.
From Figure 9.A to Figure 9.D, we can see that MUR gradually reduced the objective function of DPNMF and converges rapidly within 500 iteration rounds on four tested datasets.

Efficiency Study
We also verified the computational cost of DPNMF compared with the representative algorithms on Yale, ORL, UMIST, and FERET datasets. Similarly, we randomly selected 8, 8, 10 and 5 images from each individual of Yale, ORL, UMIST and FERET datasets for training and repeated such trial five times to eliminate the effect of randomness. The parameter setting is same as those in above section. We implement all algorithms in MATLAB on a workstation which contains a 3.4 GHz Intel (R) Core (TM) processor and an 8 GB RAM. Figure 10 compares the average CPU costs of each iteration round spent by DPNMF with those spent by PNMF and PNGE on four test datasets. Figure 10 shows that DPNMF costs more CPU times than the other algorithms because it utilizes two time-consuming operators, i.e., lS w {S b ½ z W and S b {lS w ½ z W in line 5 of Algorithm 1, whose time complexities are both m 2 r. However, DPNMF can  achieve higher accuracy than other algorithms (see Figure 4 to Figure 7) due to the incorporated Fisher's criterion. Several excellent NMF optimization algorithms such as NeNMF [45], Online RSA-NMF [46], and L-FGD [47] can be applied to optimize DPNMF more efficiently than MUR.
From above analysis, DPNMF is an effective dimension reduction method. In our future works, we will applied it to many vision tasks, e.g., color to gray image transformation [32], 3-D face reconstruction [33], and 3-D face facial expression analysis [34]. In addition, due to its effectiveness, we will extend DPNMF to tensor analysis [37] for gait recognition [36] and Bayesian model based on covariance learning [38][39] [40] [41] in our future works.

Conclusion
This paper proposes an effective Discriminant Projective Nonnegative Matrix Factorization (DPNMF) method to overcome the out-of-sample deficiency of NMF and boost its discriminant power by incorporating the label information in a dataset based on Fisher's criterion. We developed a multiplicative update rule to solve DPNMF and proved its convergence. Experimental results on popular face image databases demonstrate that DPNMF outperforms NMF and PNMF as well as their extensions.

Proof of Theorem 1
Given the current solution W9, we approximate J DPNMF (W ) by its Taylor We construct an auxiliary function G(W ,W 0 ) of J DPNMF (W ) as follows: It is easy to verify that J DPNMF (W 0 )~G(W 0 ,W 0 ). In the following section, we will prove that J DPNMF (W )ƒG(W ,W 0 ) to complete the proof. For any z.0, we have z §1z log z. By substituting z~W ik W 0 ik into the above inequality, we have Since Tr(W T VV T W 0 )~P ik (VV T W 0 ) ik W ik and (23), we have By substituting (24) and (25) into (21), we prove that J DPNMF (W )ƒG(W ,W 0 ).
Assuming W0 is the minimum of G(W ,W 0 ), we have the following inequalities: The remaining things are calculating W0 and verifying its nonnegativity constraint. To this end, we set the gradient of G(W ,W 0 ) to zero, i.e., LG(W ,W 0 ) Eq. (27) gives Since (28) is contains multiplications and divisions of nonnegative entries, W0 is non-negative matrix.
It is obvious that (28) is equivalent to (17), and thus (26) implies that (17) decreases the objective function of DPNMF. It completes the proof.