An NMF-L2,1-Norm Constraint Method for Characteristic Gene Selection

Recent research has demonstrated that characteristic gene selection based on gene expression data remains faced with considerable challenges. This is primarily because gene expression data are typically high dimensional, negative, non-sparse and noisy. However, existing methods for data analysis are able to cope with only some of these challenges. In this paper, we address all of these challenges with a unified method: nonnegative matrix factorization via the L2,1-norm (NMF-L2,1). While L2,1-norm minimization is applied to both the error function and the regularization term, our method is robust to outliers and noise in the data and generates sparse results. The application of our method to plant and tumor gene expression data demonstrates that NMF-L2,1 can extract more characteristic genes than other existing state-of-the-art methods.


Introduction
The development of microarray technologies makes the study of complex biological gene expression networks possible. Microarray datasets typically contain expression data for the thousands of genes profiled on each chip, and the number of replicates is much smaller than the number of genes, making the selection of genes difficult [1]. In addition, the inclusion of irrelevant or noisy variables may decrease the accuracy of selection [2]. The problem of how to select genes associated with the target terms has become a challenge for scientists [3]. For example, plants are able to cope with environmental challenges such as cold, heat, and salt, which are referred to as abiotic stresses; there must therefore exist specific interacting genes that respond to each abiotic stress. Another typical example is that of cancer, an important cause of human morbidity; the identification of genes that are frequently mutated in cancers and play an essential role in cancer development is critical. Many methods have been proposed for processing gene expression data collected by DNA microarray profiling [4][5][6][7][8][9]. For example, Liu et al. [10]used a method based on penalized matrix decomposition (PMD) to extract characteristic plant genes, and Zheng et al. [11] applied nonnegative matrix factorization (NMF) to tumor gene selection. Principal component analysis (PCA) and singular value decomposition (SVD) have also been used to analyze gene expression data [12]. Liu et al. [13] proposed a CIPMD algorithm (A Class-Information-Based Penalized Matrix Decomposition) for identifying plants core genes responding to abiotic stresses. This method is PMD method with label information. Liu et al. [14] proposed a PRFE algorithm (A P-Norm Robust Feature Extraction) for identifying differentially expressed genes. Although those methods are all feature selection methods and in widespread use, they present some disadvantages: 1. Although the elements of the initial data matrix are entirely nonnegative, the traditional low-rank algorithm [15]cannot guarantee nonnegative values in the project matrix, thereby complicating their biological interpretation.
2. The high dimensionality of data poses challenges, such as the so-called curse of dimensionality [16,17]. 3. Faced with millions of individual data points, it is difficult to interpret gene expression data without sparse constraints.
4. Gene expression data often contains numerous outliers and abundant noise, which traditional methods do not effectively address.
NMF has been widely used in various fields because it can generate low-rank and nonnegative results. The ability to generate a low-rank nonnegative matrix to approximate a given nonnegative data matrix is a significant advantage [18], but the lack of sparsity in data processed via NMF makes this method less than ideal for characteristic gene selection. In high throughput datasets, gene expression data are high dimensional and always contain some redundant information(i.e., not all features are relevant). To address these problems, we sought to incorporate sparsity, or the reduction of certain vector elements to zero. The regular inclusion of sparsity has played a significant role in dimensionality reduction and feature selection [19]. For example, Journée et al. [20] proposed a sparse principal component analysis (SPCA) method using the generalized power method, and Wittenet al. [21]proposed a penalized matrix decomposition (PMD) method, which has been proven useful in microarray analysis by imposing penalization on factor matrices. Nonnegative matrix factorization with sparse constraints (NMFSC), which was first introduced by Patrik O. Hoyer in 2004 [22], accurately controls sparsity. NMFSC has been applied to the problems of imaging and gene selection, among others. However, it does not guarantee that entire rows of a matrix are sparse, which can lead to difficulties during feature selection. To address these issues, the L 2,1 version of NMF favors the inclusion of a small number of non-zero rows in the factor matrix, which are proposed to generate sparse results for rows.
However, these methods for generating sparsity apply the least square error function, which does not reliably address noise and outliers [23]. When faced with these complications, the error for both features and samples will be squared [24], increasing the effect of large noises or outliers [25]. As a result, the L 2,1 version of the error function has been proposed to address noisy data [26].
In light of these problems, we propose a novel method called Nonnegative Matrix Factorization with L 2,1 -norm (NMF-L 2,1 ), which imposes an L 2,1 -norm constraint on both the error function and a regularization term to solve the aforementioned problems simultaneously. A sparse regularization term avoids the potential problem of over-fitting and selects a sparse subset of features. Rather than use an L 2 -norm-based error function, the L 2,1 -norm-based error function diminishes the impact of the outliers and noise in a dataset [25,27].
The main contributions of this paper are the following: First, L 2,1 -norm is employed for regularization in our method to generate sparse results, making the results easier to interpret.
Second, nonnegative matrix factorization is utilized to generate low-rank results with nonnegative values.
Third, the L 2,1 -norm-based error function is used to diminish the outliers and noise inherent in gene expression data.
This paper is organized as follows. The methodology section introduces the NMF-L 2,1 method and provides an efficient algorithm for estimation. The results and discussion section compares our method with other three methods: PMD, NMFSC and SPCA. Our conclusions are presented in the third section.

Mathematical Definition of L 2,1 -norm
This subsection briefly introduces the L 2,1 -norm proposed in [28]. It is defined as where m i is the i-th row of M, m ij is the (i, j)-th entry in M, M is an n × s matrix. An explanation of L 2,1 -norm is as follows. First, we compute the L 2 -norm of rows m i , then compute the L 1 -norm of vector b(M) = (km 1 k 2 ,km 2 k 2 ,. . .,km s k 2 ). The amplitude of the components of vector b(M) dictate how important each dimension is L 2,1 -norm favors a small number of nonzero rows in M,ensuring that an appropriate dimensional reduction is achieved [29].

Extracting Characteristic Genes by NMF-L2,1
In this paper, the matrix X denotes the initial gene expression dataset, whose size is n × c. Each column of X represents the transcriptional response of the n genes in one sample. Each row of X represents the expression level of a gene across all samples. Thus, the X can be approximated as: where A is an n × d matrix, Y is a d × c matrix, and d < min(n, c). The matrices Y and A are called the coefficient and basis matrices, respectively. Given suitable parameters for NMF-L 2,1 , the sparse matrix A can be obtained. Characteristic genes can then be extracted according to the non-zero entries in A [30].

NMF based on L2,1-norm (NMF-L2,1)
Here, the error for each data point is calculated as a squared residual error in terms of kx i −Ay i k 2 . As a result, a few outliers with large errors can easily dominate the objection function due to the squared errors. Thus, it is reasonable to propose an NMF-L 2,1 formulation to reduce the influence of outliers and errors.
The error function of the NMF-L 2,1 formulation is: In this formulation, the error for each data point is kx i −Ay i k, which is not squared; thus, the impact of large errors caused by outliers does not fully dominate the objective function [32].
The NMF-L 2,1 optimization problem is formulated as The problem in Eq (4) is equivalent to Thus, the above problem can be rewritten as Then, the problem can be reformulate das How can this optimization problem handle high dimensional, nonnegative, noisy and sparse data simultaneously? The reasons are as follows: 1. The L 2,1 -norm error function term is designed to diminish the impact of noise or outliers contained in the original data. As a result, we can expect to obtain cleaner data for subsequent analyses.
2. This clean data may be not sparse-some features may be irrelevant to the learning procedure-so the L 2,1 -norm regularization term [33]can be used to generate a sparse solution.
3. In the next section, we will demonstrate that the above conditions produce more effective models, especially for datasets that are sparse, nonnegative, high dimensional and noisy.

An Efficient Algorithm for NMF-L2,1
To solve the constrained optimization problem in Eq (7), Nie et al. [34]have provided an efficient algorithm. Here, we briefly introduce the efficient algorithm. By introducing Lagrangian multiplier Λ, we first give the Lagrangian function as follows: where Tr() is the trace function of a matrix. Here we introduce the augmented cost-function where q ∊ R b is an auxiliary vector and Q = diag(q) is a diagonal matrix with the diagonal element in which ε is a positive number and infinitely close to, but not equal to, zero. Taking the derivative of J(U, q) with respect to U to zero, we obtain: By multiplying the two sides of Eq (11) by BQ −1 and using the constraint BU = X, we obtain Then, we obtain: More details of the algorithm can be found in [34]. Here, we summarize our method in Box 1. In each iteration, U is calculated with the current Q. Then, Q is updated based on the current U. The iteration procedure is repeated until the algorithm converges.
In this paper, the characteristic genes are extracted by the coefficient matrix A. We summarize the NMF-L 2,1 method to extract core genes as follows: 1. Create the data matrix X based on gene expression data.
2. Obtain the basis matrix A by using the NMF-L 2,1 method.
3. Extract characteristic genes from non-zero entries in matrix A.
4. Exploit the Gene Ontology (GO) tool to investigate the extracted genes.

Results and Discussion
In this section, several experiments are carried out. In the first subsection, the NMF-L 2,1 method is compared with the following methods for a gene expression dataset obtained from plants responding to abiotic stresses: (a) the PMD method (proposed by Witten et al. [15]); (b) the SPCA method (proposed by Journée et al. [35]); and (c) the NMFSC method (proposed by Patrik O. Hoyer [36]);(d) the CIPMD method (proposed by Liu et al. [13]); (e) PRFE method (proposed by Liu et al. [14]). In the second subsection, the six methods are compared for Medulloblastoma and leukemia tumor datasets.

Results for Plant Gene Expression Data
Plants are continually challenged by environmental parameters such as drought, salt, cold, osmotic pressure, and UV-B light [37]. Among plant genes, there must exist a specific set of interacting genes that respond to each abiotic stress. Thus, it is important but challenging to extract genes responding to each abiotic stress from plant gene expression data.  Table). Table 1 lists the reference numbers and sample numbers for each stressor.
3.1.2 The selection of parameter λ. In order to obtain the most effective results, we used gene expression data to train the parameter λ. For each sample, the parameter varied from 0-1 with a step of 0.1, and GO Terms were used to select the most appropriate parameter. The results are provided in Table 2.
3.1.3 gene ontology (GO) analysis. In this paper, GO Term is used to evaluate the genes that responded to plant abiotic stressors [38]. GO Term Finder analysis provided information to aid with the biological interpretation of high-throughput experiments. GO Term Finder is available publicly at [http://go.princeton.edu/cgi-bin/GOTermFinderS] [39]; it aims to describe genes in the query/input set and to find the genes that may have something in common.
For the sake of simplicity, 500 genes were selected from the gene expression data by the NMFSC, PMD, SPCA, CIPMD, PRFE and NMF-L 21 methods. The threshold parameters used were: maximum P-value = 0.01, and minimum number of genes = 2.
3.1.4 Response to stimulus. Table 3 summarizes the results of the response to a stimulus whose background frequency in the TAIR (A.thaliana (common wallcress))set was 6617/30320 (21.8%). The results are presented according to P-value and sample frequency. The P-value was calculated using a hyper-geometric distribution (details can be seen in [40]). The sample frequency denotes the number of the characteristic genes selected. For example, 330/500 denotes that 330 genes corresponding to GO terms out of 500 genes were selected by the method. Input: X ∊ R n×c and parameter λ. Output: Y ∊ R d×c ,A ∊ R n×d . 1: Initialize Q t 2 R b×b as an identity matrix and A ∊ R n×d as a nonnegative matrix, set t = 0.
A and Y are obtained from U according to Eq (7). Until convergence. Table 3, the six methods were compared by sample frequency and P-value. NMFSC, NMFL21, PRFE, SPCA and PMD are unsupervised methods, so we first compare the five algorithms. In the 12 terms, the results show that our method could extract more characteristic genes than the other methods for eight of twelve samples. For example, for shoot samples exposed to UV-B stress, the sample frequency was 76.4% by our method, 72% by NMFSC, 72.4% by PMD, 66.4% by SPCA and 61.8% by PRFE. This shows that our method is markedly improved over PRFE, PMD and SPCA. When compared with the supervised method CIPMD, except the salt and UV-B stress, our methods performs worth than CIPMD. Generally speaking, since supervised methods take the class labels into consideration, they usually have better performance than unsupervised methods.

As listed in
3.1.5 Response to the abiotic stimulus. Table 4 summarizes the results of the six methods for datasets describing the response to abiotic stimulus whose background frequency in the TAIR (A.thaliana (common wallcress)) set is 1539/29556 (5.2%). The numbers of characteristic genes and the P-values of genes responding to an abiotic stimulus (GO:0009628) in root and shoot samples are listed in Table 4.
As described above in the 'Response to stimulus' section, we first compare the five algorithms NMFSC, NMFL21, PRFE, SPCA and PMD. From the results we can see that our method could extract more characteristic genes than the PMD and SPCA methods for all datasets. For the four sample datasets (drought, salt, UV-B and osmotic), our method performed worse than NMFSC, but superior to PMD and SPCA. When compared with the supervised An NMF-L 2,1 -Norm Constraint Method for Characteristic Gene Selection method CIPMD, except the salt, heat and UV-B stress, CIPMD performs better than other methods. 3.1.6 Characteristic terms. In Table 5, we list the characteristic terms. Our method outperformed SPCA and PMD for all 12 items and outperformed NMFSC for seven items. Only in one item (shoot sample in UV-B) the PRFE outperforms than our method. However, for one of the twelve items (cold in root) our method produced the same result as NMFSC. From these results, it can be concluded that our method is more effective than other unsupervised methods. The response to water deprivation (GO:0009414) for shoot samples is also analyzed in Table 5. The background frequency of the response to water deprivation (GO: 0009414) is 1.4%. It is obvious that NMF-L 2,1 is able to extract more characteristic genes than the other methods, and the sample frequency in response to water deprivation by our method is 18.1%, while it is 13.2% for NMFSC, 11.9% for PMD, 16.8% for CIPMD, 11.2% for PRFE and 8.2% for SPCA, indicating that our method performs 6.2% better than PMD, 7% better than PRFE and almost 10% better than SPCA.
3.2 Results for tumor datasets. Two tumor datasets were also analyzed to verify the performance of the proposed method. The medulloblastoma dataset contains 34 samples, which can be divided into 25 tumor and 9 normal tissue samples, and assesses the expression of 5893 genes [41]. The leukemia dataset consists of 5000 genes and 38 samples [42]. The samples include 27 tumor and 11 normal tissue samples.
To make a fair comparison, all the six methods extracted 100 genes as characteristic genes from the two tumor datasets. The Gene Ontology (GO) enrichment and functional annotation  Tables 6 and 7 list the top 10 closely related terms with P-value corresponding to different methods for the two tumor datasets. From Table 6, it can be seen that NMF-L 2,1 outperforms the other three methods for all terms. For example, for the term M11197 in Table 6, the Pvalue from NMF-L 2,1 is 7.36E-129 and 70/389 denotes that NMF-L 2,1 extracts 70 genes corresponding to M11197, whereas NMFSC, PMD,CIPMF, PRFE and SPCA extracted 62, 62,48, 52 and 55 genes corresponding to M11197, respectively, and the total number of genes corresponding to M11197 was 389. In Table 7 we can see that our method outperforms than other method in seven terms, only in other three terms(17092989, 19755675, 11108479) PRFE have a lower P-value than our method. Thus, we can conclude that our method extracts more genes than others.
In summary, we conclude that our method is generally superior to others and is effective for the extraction of genes.

Conclusions
In this paper, we proposed an effective method to select characteristic genes withL 2,1 -norm minimization of both the error function and the regularization term. The L 2,1 -norm-based error function is robust to outliers and noise in the data points and is computationally efficient. Furthermore, the L 2,1 -norm-based regularization term is used to generate a sparse solution. We also used the nonnegative factorization method to avoid the problems stemming from the high-dimensional and nonnegative nature of the data. In summary, our method can cope with high dimensionality, non-negativity, sparseness and noise simultaneously. Furthermore, the genes selected by our method and others from both plant and tumor datasets were compared using GO enrichment. These results indicate that the proposed NMF-L 2,1 method is superior to SPCA and PMD for selecting characteristic genes. Supporting Information S1