Fast optimization of non-negative matrix tri-factorization

Non-negative matrix tri-factorization (NMTF) is a popular technique for learning low-dimensional feature representation of relational data. Currently, NMTF learns a representation of a dataset through an optimization procedure that typically uses multiplicative update rules. This procedure has had limited success, and its failure cases have not been well understood. We here perform an empirical study involving six large datasets comparing multiplicative update rules with three alternative optimization methods, including alternating least squares, projected gradients, and coordinate descent. We find that methods based on projected gradients and coordinate descent converge up to twenty-four times faster than multiplicative update rules. Furthermore, alternating least squares method can quickly train NMTF models on sparse datasets but often fails on dense datasets. Coordinate descent-based NMTF converges up to sixteen times faster compared to well-established methods.


A Derivation of Multiplicative Update Rules
Multiplicative update rules for non-negative matrix tri-factorization (NMTF) and an orthogonal variant were introduced (Long et al., 2005, Ding et al., 2006. We focus on the non-orthogonal variant (Long et al., 2005) to make the approach comparable with alternative optimization techniques without added constraints. Both methods and all derivations in the following sections are based on the squared Frobenius norm objective function: where X presents the input data and U, S and V are latent factors. Squared frobenius norm can be written as ||A|| 2 = Tr(A T A), therefore D Fro equals to: Its partial derivatives with respect to each factor U, V, S are: Using the Karush-Kuhn Tucker (KKT) complementarity conditions, we can find a static point. The KKT condition for factor U is as follows: This condition holds if ∂ F ∂ U reached a static point (the derivative is zero): Similarly, we take the KKT complementarity condition for V and derive the second update rule: The third update rule is computed with respect to factor S: These rules can be expressed in matrix form, where denotes Hadamard product and symbol denotes Hadamard division: (11)

B Derivation of Alternating Least Squares
Let us calculate the gradient of function Eq. (1) with respect to each factor matrix U, V and S. We follow the derivations in Eq.
(3), (4) and (5) and simplify the equations: We equate the gradient for each factor to zero ∂ F ∂ U = 0. After this step, we calculate the inverse to get the update rule for U: This step can introduce negative values, so we must force non-negativity by assigning all negative elements with zero: Following the derivative for V from Eq. (4), we can similarly formulate update rule for V. We equate the gradient with zero ∂ F ∂ V = 0, calculate the inverse and force non-negativity: Derivative for S is shown in equation (5). We set the gradient ∂ F ∂ S to zero and force non-negativity constraints:

C Derivation of Projected Gradients
Projected gradient methods are based on the general gradient descent scheme, where we take the variable X and create a step towards the descent direction P scaled with the learning rate η: We derive projected gradient algorithm for NMTF, for the squared Frobenius norm objective function (1). We add projection to the non-negative values [ ] + to enforce non-negativity of the factors in case in crosses into negative values. Projected gradient methods follow this form: where P U , P V and P S are descent directions, η u , η v and η s are learning rates. Descent directions are defined with the following form: The scaling factors D U , D V and D S are set the following way: The scaling factor was chosen based on a study which compares convergence analysis of four different scaling factors (Merritt, 2005). The choice of scaling factor is inspired by existing projected gradient algorithm for non-negative matrix factorization (Lin, 2007). The learning rate or step size parameter is dynamically chosen using the same form as a related study for classical non-negative matrix factorization (Lin, 2007): We calculate the descent direction for each factor matrix U using the descent direction for U and the partial derivative (3): We define learning rate for V as follows: We calculate the descent direction for V by inserting the D S and derivative in Eq. (4) into the Eq. (19): We define learning rate for S as follows: The descent direction for S is computed by inserting the D S and derivative in Eq. (5) into the Eq. (19):

3/9
We insert the descent directions and learning rates in Eq. (18). The resulting update rules for U are: The following update rules define the prodecude for updating factor V: Resulting update rules for factor S are:

D Derivation of Coordinate Descent
First, we define the following residual matrices: The idea is to iterate over this set of subproblems for i ∈ {1 . . . k 1 } and j ∈ {1 . . . k 2 }: We use ||A|| 2 = Tr(A T A) and (AB) ·i = Ab ·i , where b ·i is i-th column of B. First, we derive the update rule for F Then, we equate the derivative ∂ F ∂ u i to zero: Update for the F

4/9
We equate the derivative ∂ F ∂ v j to zero: Finally, we find the static point for the F (i j) S (s i j ) subproblem: We equate the derivative ∂ F ∂ s i j to zero: For efficient implementation, we replace residual matrices X i , X j and X i, j with definitions in equations (26), (27) and (28). We also enforce non-negativity with projection to non-negative values. The resulting update rules are: These rules appear in the additive form, where the step size η u is implicitly defined within the update rules: From the equation (36) and (3) we can observe the step size: Similarly, the step sizes for v · j and s i j can be observed: We observed that such step size is effective in preventing the values to cross into negative space. The loss of information associated with projection into non-negative space is reduced, which keeps convergence stable throughout the factorization procedure.

E Derivation of Quasi-Newton update rules
In this section we show the equivalence of quasi-Newton update rules and ALS. The quasi-Newton update rules are formulated as follows, where H U , H V , and H S are Hessians of each factors:

5/9
Let us calculate the gradient and the Hessian of function Eq. (1) with respect to U: The Kronecker product is shown with ⊗, the vec-operator is shown with vec, and I is identity matrix. By placing the gradient and the Hessian into the equation (43), we get: which by converting from vec-operator to matrix form, using the rule vec(ABC) = (C T ⊗ A) vec(B) simplifies to: Further, we derive the gradient and the Hessian matrix of the factor V: We place the gradient and Hessian into equation (44) and convert to matrix form: The gradient and the Hessian of the S matrix are: We place the gradient and the Hessian into the equation (45).
We use the following rule to convert from vector to the matrix form: vec(ABC) Let us simplify the equations (49), (52) and (56), using the distributive rule (A + B)C = AC + BC: These rules are equivalent to the ALS update rules. Note that similar was already shown for NMF by Cichocki. 1

F Reconstruction on testing data
In this section we study reconstruction error of different optimization approaches on held-out data. We split the data X row-wise into two parts: X train is composed of the first 80 percent and X new composed of the remaining 20 percent of the data. Then, we run the factorization on X train , resulting in U train , S and V. We transform the remaining data X new into the latent space defined by S and V, such that the following objective function is minimized: We keep the S and V factors from the training step and initialize U new with random values. We iteratively apply updates on U new , while S and V remain unchanged. Figure A shows the objective value of four optimization techniques according to Eq. 58 at factorization rank k 1 , k 2 = 20. Each experiment is repeated ten times until the convergence criterion with parameter ε = 6 is reached. ). The highlighted area shows the span from lowest to highest objective value across ten repeated runs. MUR, multiplicative update rules; ALS, alternating least squares; PG, projected gradients; COD, coordinate descent.
The convergence using pre-trained model is orders of magnitude faster than training from scratch ( Fig. 2 in the main text). Consistent with the optimization on training data, we observe that Alternating least squares and Coordinate descent are fastest, Projected gradient is slower and Multiplicative updates is the slowest. ALS approach did not converge on Coil20 training data and was therefore not used on testing data. Figure B shows the lowest objective value from multiple runs (solid lines). We compare the optimization function at convergence for factorization rank k ∈ {10, 20, . . . , 100}. We can observe that for majority of datasets, all methods converge to a similar solution (Coil20, STRING, Mutations, Newsgroups). Coordinate descent and Alternating least squares appear more sensitive to random initialization on AlphaDigit and MovieLens datasets. The difference between best and worst solution increases as we increase factorization rank. In such cases, Coordinate descent needs to be repeated a few times in order to ensure results comparable to Projected gradients and Multiplicative update rules. Alternatively, using a different initialization technique 2 we may overcome this drawback.

G Stochastic mini-batch updates
We have implemented the stochastic mini-batch approach for NMTF. In each iteration we randomly permute the dataset into b batches such that each row is included in exactly one batch. We iterate over all batches and update the dataset using a subset of 8/9 the data X i and a subset of U factor: U i , where i ∈ {1, 2, . . . , b}. We split the data into b = 20 batches. The objective function is evaluated over the entire dataset.
We show the convergence of mini-batch versions together with its non-batch counterparts in Figure C. Each experiment was run for a maximum of 1000 iterations. Mini-batch versions of ALS (dashed yellow line) and COD (dashed red line) do not converge. Projected gradient mini-batch (dashed green line) and MUR mini-batch (dashed blue line) approaches show faster convergence at the beginning of some experiments (AlphaDigit, Coil20, MovieLens). The final solution is in all cases worse than the non-batch counterpart and large oscillations (larger than ε = 0.01) make it difficult to determine stopping criteria.