A Framework for Regularized Non-Negative Matrix Factorization, with Application to the Analysis of Gene Expression Data

Non-negative matrix factorization (NMF) condenses high-dimensional data into lower-dimensional models subject to the requirement that data can only be added, never subtracted. However, the NMF problem does not have a unique solution, creating a need for additional constraints (regularization constraints) to promote informative solutions. Regularized NMF problems are more complicated than conventional NMF problems, creating a need for computational methods that incorporate the extra constraints in a reliable way. We developed novel methods for regularized NMF based on block-coordinate descent with proximal point modification and a fast optimization procedure over the alpha simplex. Our framework has important advantages in that it (a) accommodates for a wide range of regularization terms, including sparsity-inducing terms like the penalty, (b) guarantees that the solutions satisfy necessary conditions for optimality, ensuring that the results have well-defined numerical meaning, (c) allows the scale of the solution to be controlled exactly, and (d) is computationally efficient. We illustrate the use of our approach on in the context of gene expression microarray data analysis. The improvements described remedy key limitations of previous proposals, strengthen the theoretical basis of regularized NMF, and facilitate the use of regularized NMF in applications.


Introduction
Given a data matrix A of size m6n, the aim of NMF is to find a factorization A~WH T where W is a non-negative matrix of size m6k (the component matrix), H is a non-negative matrix of size n6k (the mixing matrix), and k is the number of components in the model. Because exact factorizations do not always exist, common practice is to compute an approximate factorization by minimizing a relevant loss function, typically where E : E F is the Frobenius norm. Other loss functions include Kullback-Leibler's, Bregman's, and Csiszar's divergences [1][2][3][4]. Problem 1 has been well studied and several solution methods proposed, including methods based on alternating non-negative least squares [5,6], multiplicative updates [1,3,7,8], projected gradient descent [9][10][11], and rank-one residue minimization [12] (reviews in refs. [9,13]). The NMF problem is computationally hard. Particularly, an important property is that the factorization is not unique, as every invertible matrix S satisfying WS §0 and S {1 H T §0 will yield another non-negative factorization (WS)(S {1 H T ) of the same matrix as WH T (simple examples of S matrices include diagonal re-scaling matrices) [14]. To reduce the problem of nonuniqueness, additional constraints can be included to find solutions that are likely to be informative/relevant with respect to problemspecific prior knowledge. While prior knowledge can be expressed in different ways, the extra constraints often take the form of regularization constraints (regularization terms) that promote qualities like sparseness, smoothness, or specific relationships between components [13]. At the same time, the computational problem becomes more complicated, creating a need for computation methods that are capable of handling the regularization constraints in a robust and reliable way.
We developed a novel framework for regularized NMF. This framework represents an advancement in several respects: first, our starting point is a general formulation of the regularized NMF problem where the choice of regularization term is open. Our approach is therefore not restricted to a single type of regularization, but accommodates for a wide range of regularization terms, including popular penalties like the L 1 norm; second, we use an optimization scheme based on blockcoordinate descent with proximal point modification. This scheme guarantees that the solution will always satisfy necessary conditions for optimality, ensuring that the results will have a well-defined numerical meaning; third, we developed a computationally efficient procedure to optimize the mixing matrix subject to the constraint that the scale of the solution can be controlled exactly, enabling standard, scale-dependent regularization terms to be used safely. We evaluate our approach on high-dimensional data from gene expression profiling studies, and demonstrate that it is numerically stable, computationally efficient, and identifies biologically relevant features. Together, the improvements described here remedy important limitations of earlier proposals, strengthen the theoretical basis of regularized NMF and facilitate its use in applications.

Regularized Non-negative Matrix Factorization with Guaranteed Convergence and Exact Scale Control
We consider the regularized NMF problem where R(W ) is a regularization term, lw0 determines the impact of the regularization term, and 1 T H~a1 T is an extra equality constraint that enforces additivity to a constant aw0 in the columns H. While we have chosen to regularize W and scale H, it is clear that the roles of the two factors can be interchanged by transposition. We assume that R(W ) is convex and continuously differentiable, but do not make any additional assumptions about R at this stage. Thus, we consider a general formulation of regularized NMF where one factor is regularized, the scale of the solution is controlled exactly, and the choice of regularization term still open. The equality constraint that locks the scale of H is critical. The reason is that common regularization terms are scaledependent. For example, this is the case for R( : )~E : E 1 (L 1 / LASSO regularization), R( : )~E : E 2 F (L 2 /Tikhonov regularization), and R( : )~EC : E 2 F (L 2 regularization with an inner operator C that encodes spatial or temporal relationships between variables). Scale-dependent regularization terms will pull W towards zero, and indirectly inflate the scale of H unboundedly. Locking the scale of the unregularized factor prevents this phenomenon.
To solve Problem 2, we explored an approach based on block coordinate descent (BCD). In general, the BCD method is useful for minimizing a function F when the coordinates can be partitioned into N blocks such that, at each iteration, F can be minimized (at low computational cost) with respect to the coordinates of one block while the coordinates in the other blocks are held fixed. The method can be expressed as the update rule where x i and V i denote the coordinates and domain of the ith block, respectively. The updates are applied to all coordinate blocks in cyclic order. In the case of NMF, there are three natural ways to define blocks: per-column, per-row, or per-matrix. We partition the coordinates of H per column whereas the partitioning of W depends on the anatomies of R and the subproblem solver (details below).
Regarding the convergence of BCD procedures, it can be shown that if the domain for the ith coordinate block, V i , is compact and all subproblems are strictly convex (that is, V i is convex and F is strictly convex over V i ), the sequence generated by a BCD procedure has at least one limit point and each limit point is a critical point of the original function F [15]. In this context, we say that an algorithm has converged if the current point is within a tolerance from a critical point (that is, a point (W ,H) where the derivative of the objective function is non-negative in all feasible directions; the first-order necessary condition for optimality). If F is convex but no longer strictly convex in V i , limit points are still guaranteed to exist but are not necessarily critical points (that is, the solution may not satisfy the first-order criterion for optimality).
In Problem 2, the clamping of the scale bounds H and, indirectly, also W . Hence, all V i 's are bounded. Because they are also closed, they are compact. However, subproblems that are not strictly convex may still occur. To guarantee solutions that represent critical points, we therefore need to safeguard against non-strict convexity in the BCD subproblems. To this end, we add a proximal point term to objective functions of subproblems that are not known to be strictly convex beforehand. A proximal point term penalizes the Euclidean distance to the previous point in V i , makes the subproblems strictly convex, and guarantees that limit points of the generated sequence are critical points of the original function F [16]. The BCD updates change to where d i Ex{x i E 2 is the proximal point term and d i §0 a small number which can be zero if F is known to be strictly convex in V i (in this case the proximal point term is not needed).

Optimizing the Mixing Coefficients
We developed an efficient procedure to optimize each block (column) of the mixing matrix H. The procedure itself is given in Algorithm 1. This section describes the proof.
The constraints H §0 and 1 T H~a1 imply that columns of H must lie in the a-simplex, defined as Geometrically, this is the intersection of the non-negative orthant and a hyperplane with normal vector 1 T and offset a from the origin. The set is convex and also compact, meaning the conditions for a BCD to converge discussed in the previous section are satisfied.
We first derive general optimality criteria for convex functions on D a . Let f : D a ?R be convex and differentiable. By definition, x[D a is a minimum of f , if and only if the directional derivative at x is non-negative in every feasible direction Considering the special cases y~(0, . . . ,0,y i~a ,0, . . . ,0) T , we see that must hold if x is a minimum. However, the converse also holds. Assuming that Equation 4 holds and letting y be an arbitrary point in D a , we have Hence, Equation 3 and Equation 4 are equivalent. Moreover, Equation 4 can be rephrased as This is interesting because the fact that x[D a implies that the reversed inequality also holds The right-hand side of this equation is a weighted average of partial derivatives. Because the weights are non-negative and the smallest partial derivative is included when forming this average, all partial derivatives that correspond to non-zero coordinates of x must equal the smallest partial derivative at x. Taken together, x is a minimum of a convex function f : D a ?R if and only if where P denotes the indices of the non-zero coordinates in x. This somewhat surprising result sets the stage for the development of an efficient way to minimize the columns of H. We next connect Equations 2 and 5 using a rank-one residue approach. Rewriting WH T , we have the subproblem of updating a column h i becomes which is the same as where v denotes the constant vector The key to solving this problem efficiently lies in the observation that h can be solved directly when the indices of the non-zero variables are known. To see this, assume for a while that P is given and let f be the above objective function of Problem 6. Because f is convex, Equation 5 implies that all its partial derivatives with respect to the non-zero variables share a common value, that is h j {v j~C , j[P for some C at the minimum. Summing over j and using the fact that h [ D a , we can solve for C Thus, all that remains is a way to find P. Although this may seem like a problem with a complexity of O(2 n ) at first sight, it turns out that P must correspond to the indices of the #P largest coordinates of v. To see this, assume that h is a minimum and that there exist indices a[P and b6 [P such that v a vv b . Then, the entries h a and h b could be swapped to obtain another feasible vector that would yield a smaller objective function value in Equation 6, contradicting that h is a minimum. Hence, the only remaining question is how many coordinates are non-zero at the minimum. This question can be resolved by computing C and the partial derivatives for different values of #P until Equation 5 is satisfied. This procedure can be implemented as a linear O(n) search (Algorithm 1) and is amenable to speed-ups when used iteratively (Discussion).

Optimizing the Components
Unlike the optimization of H, which is independent of R, the optimization of W depends on the choice of R. We next give W optimization procedures for three common types of regularization: Sparseness regularization. A common way to enforce sparsity is to penalize the L 1 norm, the closest convex L p relaxation of the L 0 penalty (the number of non-zero elements). To optimize W with R(W )~EW E 1 , one possibility is to use the rankone residue approach. Rewriting WH T as a sum of rank-one matrices and considering the Karush-Kuhn-Tucker (KKT) conditions, it is easy to show that the BCD update for the column/block w i is given by where ½ : z denotes truncation of vector elements at zero. Another possibility is to view W as a single block, in which case the minimization can be rewritten as a non-negative least squares problem (this follows directly from the KKT conditions) that can be solved efficiently using for example the Fast Non-Negative Least Squares algorithm (FNNLS) [17]. Tikhonov regularization. We next consider L 2 regularization with R~ECW E 2 F where C is an m6m filter matrix. This type of regularization is used to impose various types of smoothing, for example by using C~I or various difference operators, like C (i,i)~1 , C (i,iz1)~{ 1, and C (i,j)~0 elsewhere. Partitioning the coordinates per column and using a rank-one residue approach, the column-wise BCD updates become Expanding the norm and removing constant terms, we get which is a non-negative least squares problem. To see this, let L T L be the Cholesky decomposition of Eh i E 2 IzlC T C and consider Expanding Equation 8 and removing the constant term, we recover Equation 7. Hence, we can solve Equation 8 which can be done using non-negative least squares algorithms that start from the normal equations and do not require explicit Cholesky decomposition [17][18][19].
Related base vector regularization. In some applications, certain base vectors are known to be closer to each other. For example, this type of regularization may be motivated in the reconstruction of cell type-specific gene expression profiles from gene expression profiles of compound tissues, where the gene expression patterns of related cell types can be expected to be similar. One way to incorporate such information is to penalize the squared distance between base vectors that are known to be related. The objective function becomes where the set N defines pairs of adjacent vectors, encoded as a matrix N where each column defines a pair (i, j) by having elements that are l at position i and j and 0 elsewhere. The objective function can then be written as the minimum of which with respect to W can again be found using FNNLS or other non-negative least squares algorithms.

Computational Efficiency
To illustrate its use, we implemented our method with L 1 norminduced sparseness regularization (Algorithm 2; denoted rNMF), and applied it to sets of gene expression profiles of blood disorders (Table 1). For comparison, we considered two previously published methods [20,21]. These methods are relevant as control methods as they also seek to perform NMF with L 1 regularization and exact scale control. Other sparse NMF methods have been published (Discussion), but solve different formulations and, hence, are less relevant as controls in this context. Out of the two selected control methods, we found the method in [21] to be the most efficient, making it a representative control method. Each data set was analyzed with different numbers of components (k = 5,10, and 15) and regularization parameter values (l selected to yield 25%, 50%, and 75% zeroes in W ; the value needed to achieve a specific degree of sparsity varies between data sets).
Throughout, rNMF was 1.5 to 3.0 times faster per iteration and converged considerably faster (Table 1 and Figure 1a). The method also exhibited robust closing of the KKT conditions, illustrating that the theoretical prediction that solutions represent critical points holds numerically in practice (Figure 1b).

Analysis of Gene Expression Data
To illustrate the use of our approach in a practical situation, we applied rNMF to the Microarray Innovations in LEukemia (MILE) data set [22,23], containing 2096 gene expression profiles of bone marrow samples from patients with a range of blood disorders (Affymetrix Human U133 Plus 2.0 arrays; 54612 genes expression values/probes per sample). We applied rNMF to the MILE data with varying numbers of components (k = 10, 20 and 30) and varying degrees of sparsity (l chosen to yield 50%, 75%, and 90% sparsity in W ). To illustrate the effect of sparsity regularization, we also analyzed the data using conventional NMF (equivalent to setting l~0). Now, it is well known that the bone marrow morphology varies considerably between disorders and between patients, especially in terms of the abundances of various classes of blood cells. It is also known that different classes of blood cells exhibit distinct gene expression patterns [24]. Much of the variation in the data will therefore be caused by fluctuations in cell type abundances and by differences in gene expression between cell types. Because rNMF and NMF are driven by variation, it is reasonable to assess the biological relevance of the results by testing whether the components contain gene expression features belonging to specific classes of blood cells. To this end, we used gene set enrichment testing, a statistical technique that is widely used in genomics to annotate high-dimensional patterns. In essence, statistically significant enrichment for a gene/probe set in a component means that the genes/probes comprising the set have higher coordinate values (at the set level) in a component than would be expected by chance (c.f., ref. [25,26]). We defined sets of marker genes for all major classes of blood cells (Materials and Methods), and tested for enrichment of each of these sets in each component using the program RenderCat [25].
As illustrated in Figure 2a, enrichments of cell type markers were identified in all rNMF components except the weakest ones. Enrichments of markers for almost all major cell types in the bone marrow were detected in at least one component. In some components, enrichments of markers belonging to multiple cell types were detected. In these cases, the detected cell types belonged to the same developmental lineages (and hence have similar gene expression patterns). For example, this can be seen in Figure 2a where w 0 , w 1 , and w 2 are enriched for features from multiple myeloid cell types and w 10 and w 14 enriched for features from multiple lymphoid cell types. Together, the results support that the components are biologically relevant.
Conventional NMF also generated components with enrichments of cell type-specific markers. Interestingly, however, we observed differences as to which components did and did not capture cell type-specific features. As shown in Figure 2a, strong components generated by rNMF could usually be annotated and components that could not be annotated were usually the weakest ones. With conventional NMF, this pattern was generally not seen. Instead, as shown in Figure 2b, strong components could often not be annotated, suggesting that conventional NMF did not enrich for cell type-specific features. A likely explanation could be that there are relatively few cell type-specific markers compared to the number of genes in the genome, and that limiting the cardinality of components by including L 1 regularization promotes the identification of small sets instead of broader features that are less specific.

Discussion
Non-negative matrix factorization has been previously suggested as a valuable tool for analysis of various types of genomic data, particularly gene expression data [27][28][29][30][31]. The rationale is that gene expression is an inherently non-negative quantity. In this case, NMF allows the data to be expressed in their natural scale, thereby avoiding re-normalization by row-centering as is needed by dimension-reduction techniques based on correlation matrices (e.g., principal component analysis).
We developed methods that enable robust and efficient solution of a range of regularized NMF problems and tested these methods in the context of gene expression data analysis. The key component of our approach is an efficient procedure to optimize the mixing coefficients H over the a-simplex, enabling the scale of the solution to be explicitly controlled. Further, our approach separates the task of optimizing H and optimizing W . This has three advantages. First, the optimization of H becomes independent of the regularization term, meaning the same algorithm (Algorithm 1) can always be used. Second, as exemplified by the L 1 regularization case, the optimization of W is simplified, at least with standard regularization terms. Third, a proximal point term can be included, guaranteeing convergence towards critical points, ensuring that the results will always have welldefined numerical meaning [16]. Experimentally, we have illustrated that our method is computationally efficient and capable of enhancing the identification of biologically relevant features from gene expression data by incorporating prior knowledge.
Previous work on regularized NMF is limited compared with previous work on conventional NMF. A straightforward formu-lation is where are the functions R 1 and R 2 enforce the regularization constraints, and the parameters l 1 ,l 2 w0 control the impact of the regularization terms [13]. This formulation allows regularization of both factors and basic computation methods can be derived for some choices of R 1 and R 2 by extending conventional NMF methods [32,33]. However, balancing l 1 and l 2 against each other is often difficult and simultaneous regularization of both factors is rarely wanted. More commonly, the goal is to regularize one of the factors. For example, to get sparse component vectors, an L 1 penalty can be imposed on W whereas H does not have to be regularized. In Equation 9, single-factor regularization would correspond to setting l 1 or l 2 to zero. Again, with standard scaledependent regularization terms, this will pull the regularized factor towards zero and inflate the unregularized factor unboundedly. Scale-independent penalty terms have been proposed [34], but these are non-convex and therefore complicate optimization with respect to the regularized factor. One could also attempt to control  . However, this again requires balancing of l 1 against l 2 which is difficult, and, moreover, the scale can only be controlled approximately. Another ad hoc approach could be to compensate for the pull of the regularization term by standardizing the column norms of W or H between iterations. This is equivalent to inserting a diagonal matrix D and its inverse between the factors. This operation is safe in conventional NMF because the value of the objective function will not change. With a regularization term, however, column standardization is unsafe: although the value of the fitting term EA{WH T E 2 F will not change, the value of the regularization term may, meaning the objective function may increase between iterations. To control the scale exactly, [20] proposed a truncated gradient descent method and [21] a multiplicative update method, and studied regularization with respect to sparsity. These methods represent the closest predecessors of our approach and were therefore used as control methods.
When it comes to the convergence, the strongest proved result for conventional NMF is guaranteed convergence to critical points. Some conventional NMF methods always find critical points, for example alternating non-negative least squares. By contrast, regularized NMF methods are less well characterized. To our knowledge, the only regularized NMF method that is known to guarantee critical point solutions is an alternating non-negative least squares method that solves Problem 9 when R 1 is the squared L 1 norm and R 2 is the L 2 -norm [32]. Methods based on Lee-Seung's multiplicative descent method do not guarantee critical points [13], nor do current exact-scale methods [20,21].
In conclusion, we have presented a new framework for regularized NMF. Our approach has advantages in that it accommodates for a wide range of regularization terms, guarantees solutions that satisfy necessary conditions for optimality, allows the scale of the solution to be controlled exactly, is computationally efficient, and enables decomposition of gene expression data subject to knowledge priors. Hopefully, this study, along with other efforts, will further the development of methods to analyze complex high-dimensional data.

Materials and Methods
Microarray data sets generated on Affymetrix microarrays were retrieved from NCBI Gene Expression Omnibus (http://www. ncbi.nlm.nih.gov/gds; accession numbers GSE1159, GSE6891, GSE12417, GSE13159, GSE19784, and GSE28497). Because NMF assumes an additive model, non-log transformed gene expression values were used throughout the experiments. Sets of cell type-specific markers were inferred by use of the d-map compendium containing gene expression profiles of all major classes of blood cells sorted by flow cytometry (Affymetrix U133A arrays) [24]. One set per cell type was inferred by comparing dmap profiles belonging to this cell type to all others using Smyth's moderated t-test [35], selecting the top 100 probes as markers (results agreeing with those shown were obtained using the top 50, 150, 200 and 250 probes). Gene set enrichment testing was performed using the program RenderCat [25].

Algorithm 1
Pseudocode to optimize a column h i in H, given R i , w i , the current h i , and the proximal point parameter d. Note that in the if clause, the first condition j~~n asserts that the program never tries to reach v nz1 whereas the second asserts that C is the minimal value of the partial derivatives. Because v is sorted in descending order, and Lf =Lh j equals C if j[P and {v j otherwise, it is sufficent to compare C with {v jz1 .
v~(R T i w i zdh i ) = (Ew i E 2 2 zd) ½v,J~sort(v, 0 descend 0 ) a~1:0 for j~1 to n do a~a{v i C~a=j if j~~n or Cƒ{v(jz1) then break end if end for h(jz1 : n)~0 h(J)~h return h

Algorithm 2
Pseudocode for the complete rNMF procedure with R(W ) jjW jj 1 . To change the type of regularization, change the w i update. Note that the rank-one residual R is updated cumulatively to save computations. Blue cells indicate significant enrichment of cell type-specific markers (as detected by gene set enrichment testing; pv10 {5 ) in the component generated by rNMF with 90% sparsity (a) and conventional NMF (b). The components have been ordered by strength (defined as L 2 norm of w k h T k ) with w 0 denoting the strongest component. As discussed in detail in Results, strong components generated by rNMF capture cell type-related gene expression features more clearly than conventional NMF. doi:10.1371/journal.pone.0046331.g002