Skip to main content
Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

S3CMTF: Fast, accurate, and scalable method for incomplete coupled matrix-tensor factorization

  • Dongjin Choi,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, United States of America

  • Jun-Gi Jang,

    Roles Validation, Writing – original draft

    Affiliation Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea

  • U Kang

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing – review & editing

    Affiliation Department of Computer Science and Engineering, Seoul National University, Seoul, Republic of Korea


How can we extract hidden relations from a tensor and a matrix data simultaneously in a fast, accurate, and scalable way? Coupled matrix-tensor factorization (CMTF) is an important tool for this purpose. Designing an accurate and efficient CMTF method has become more crucial as the size and dimension of real-world data are growing explosively. However, existing methods for CMTF suffer from lack of accuracy, slow running time, and limited scalability. In this paper, we propose S3CMTF, a fast, accurate, and scalable CMTF method. In contrast to previous methods which do not handle large sparse tensors and are not parallelizable, S3CMTF provides parallel sparse CMTF by carefully deriving gradient update rules. S3CMTF asynchronously updates partial gradients without expensive locking. We show that our method is guaranteed to converge to a quality solution theoretically and empirically. S3CMTF further boosts the performance by carefully storing intermediate computation and reusing them. We theoretically and empirically show that S3CMTF is the fastest, outperforming existing methods. Experimental results show that S3CMTF is up to 930× faster than existing methods while providing the best accuracy. S3CMTF shows linear scalability on the number of data entries and the number of cores. In addition, we apply S3CMTF to Yelp rating tensor data coupled with 3 additional matrices to discover interesting patterns.


Given a tensor data, and related matrix data, how can we analyze them efficiently? Tensors (i.e., multi-dimensional arrays) and matrices are natural representations for various real world high-order data [1, 2, 3]. For instance, an online review site Yelp provides rich information about users (name, friends, reviews, etc.), or businesses (name, city, Wi-Fi, etc.). One popular representation of such data includes a 3-way rating tensor with (user ID, business ID, time) triplets and an additional friendship matrix with (user ID, user ID) pairs. Coupled matrix-tensor factorization (CMTF) is an effective tool for joint analysis of coupled matrices and a tensor. The main purpose of CMTF is to integrate matrix factorization [4] and tensor factorization [5] to efficiently extract the factor matrices of each mode. The extracted factors have many useful applications such as latent semantic analysis [6, 7, 8], recommendation systems [9, 10], network traffic analysis [11], and completion of missing values [12, 13, 14].

However, existing CMTF methods do not provide good performance in terms of time, accuracy, and scalability. CMTF-Tucker-ALS [15], a method based on Tucker decomposition [16], has a limitation that it is only applicable for dense data and not parallelizable. For sparse real-world data, it assumes empty entries as zero and outputs highly skewed results which lead to high reconstruction error. Moreover, CMTF-Tucker-ALS does not scale to large data because it suffers from high memory requirement caused by M-bottleneck problem [17]. CMTF-OPT [12] is a CMTF method based on CANDECOMP/PARAFAC (CP) decomposition [18]. SDF [19] provided Quasi-Newton and nonlinear least squares optimization techniques for general coupled factorization problems where factors may have certain structures as Toeplitz, orthogonal and nonnegative. CMTF-Tucker-ALS and CMTF-OPT undergo high reconstruction error since the former is not applicable to sparse data, and the latter focuses only on CP model and thus cannot be generalized to the Tucker model. Furthermore, both methods are sequential and hard to take benefit of multi-core parallelization.

In this paper, we propose S3CMTF (Sparse, lock-free SGD based, and Scalable CMTF), a CMTF method which resolves the problems of previous methods. S3CMTF provides parallel, sparse CMTF based on Tucker factorization unlike previous methods which do not support sparse tensors or cannot be parallelized. We also show that asynchronously parallel stochastic gradient descent (SGD) is useful for S3CMTF in multi-core shared memory systems without expensive locking. S3CMTF further boosts the performance by storing intermediate computation and reusing them. Table 1 shows the comparison of S3CMTF and other existing methods. The main contributions of our study are as follows:

  • Algorithm: We propose S3CMTF, a coupled tensor-matrix factorization algorithm for matrix-tensor joint datasets. S3CMTF is designed to efficiently extract factors from the joint datasets by taking advantage of sparsity, exploiting intermediate data. We propose a method which resolves conflicts of parallelization and leads to a solution with guaranteed convergence.
  • Performance: S3CMTF shows the best performance on accuracy, speed, and scalability. S3CMTF runs up to 930× faster and is more scalable than existing methods as shown in Fig 1A. For real-world datasets, S3CMTF converges faster to the better optimum as shown in Fig 1B.
  • Discovery: Applying S3CMTF on Yelp review dataset with a 3-mode tensor (user, business, time) coupled with 3 additional matrices ((user, user), (business, category), and (business, city)), we observe interesting patterns and clusters of businesses and suggest a process for personal recommendation.

Table 1. Comparison of our proposed S3CMTF and the existing CMTF methods.

S3CMTF outperforms all other methods in terms of time, accuracy, scalability, memory usage, and parallelizability.

Fig 1. Comparison of our proposed S3CMTF and the existing methods.

(a) For a fixed number of nonzeros, S3CMTF takes constant time as dimensionality grows, while existing methods become slower. Our sequential method S3CMTF-opt1 is 930× and 54× faster than CMTF-OPT and CMTF-Tucker ALS, respectively. (b) S3CMTF-opt20 shows the best convergence rate and accuracy on real world Yelp dataset. CMTF-Tucker-ALS shows O.O.M. in both experiments. (O.O.M.: out of memory error).

Preliminaries and related works

In this section, we describe preliminaries for tensor and coupled matrix-tensor factorization. We list all symbols used in this paper in Table 2.


A tensor is a multi-dimensional array. Each ‘dimension’ of a tensor is called mode or way. The length of each mode is called ‘dimensionality’ and denoted by I1, ⋯, IN. In this paper, an N-mode or N-way tensor is denoted by the boldface Euler script capital (e.g. ), and matrices are denoted by boldface capitals (e.g. A). xα and aβ denote the entry of and A with indices α and β, respectively.

We describe tensor operations used in this paper. A mode-n fiber is a vector which has fixed indices except for the n-th index in a tensor. The mode-n matrix product of a tensor with a matrix is denoted by and has the size of I1×⋯In−1×J×In+1 ⋯ × IN. It is defined as: (1) where is the (j, in)-th entry of A. For brevity, we use the following shorthand notation for multiplication on every mode as in [20]: (2) where {A} denotes the ordered set {A(1), A(2), ⋯, A(N)}.

We use the following notation for multiplication on every mode except n-th mode. We examine the case that an ordered set of row vectors {a(1), a(2), ⋯, a(N)}, denoted by {a}, is multiplied to a tensor . First, consider the multiplication for every corresponding mode. By Eq (1), where denotes the k-th element of a(m). Then, consider the multiplication for every mode except n-th mode. Such multiplication results in a vector of length In. The k-th entry of the vector is (3) where denotes the index set of having its n-th index as k. α = (i1 i2iN) denotes the index for an entry.

Tucker decomposition

Tucker decomposition is one of the most popular tensor factorization models. Tucker decomposition factorizes an N-mode tensor into a core tensor and factor matrices satisfying Element-wise formulation of Tucker model is (4) where α is a tensor index (i1i2iN), and denotes the in-th row of factor matrix U(n). {u}α denotes the set of factor rows . The core tensor indicates the relation between the factors in Tucker formulation. When the core tensor size is restricted as J1 = J2 = ⋯ = JN and the core tensor structure is hyper-diagonal, it is equivalent to CANDECOMP/PARAFAC (CP) decomposition. Orthogonality constraint can optionally be imposed to the Tucker decomposition by forcing the factor matrices to have orthonormal columns (e.g. U(n)T U(n) = I for n = 1, ⋯, N where I is an identity matrix).

Coupled matrix-tensor factorization

Coupled matrix-tensor factorization (CMTF) is proposed for joint factorization of a tensor and matrices. CMTF integrates matrix factorization and tensor factorization.

Definition 1. (Coupled Matrix-Tensor Factorization) Given an N-mode tensor and a matrix where c is the coupled mode, , and are the coupled matrix-tensor factorization. is the c-th mode factor matrix, and denotes the factor matrix for the coupled matrix. Finding the factor matrices and core tensor for CMTF is equivalent to solving (5) where ‖ • ‖ denotes the Frobenius norm.

Various methods have been proposed to efficiently solve the CMTF problem. An alternating least squares (ALS) method CMTF-Tucker-ALS [15] was proposed. CMTF-Tucker-ALS is based on Tucker-ALS (HOOI) [21] which is a popular method for fitting the Tucker model. Tucker-ALS suffers from a crucial intermediate memory-bottleneck problem known as M-bottleneck problem [17] that arises from materialization of a large dense tensor as intermediate data where . Generalized coupled tensor factorization frameworks [22, 23] have been proposed, and they propose multiplicative methods for non-negative factorization. SDF [19] provided Quasi-Newton and nonlinear least squares optimization techniques for general coupled factorization problems where factors may have certain structures as Toeplitz, orthogonal and nonnegative. A Bayesian method [24] has been proposed. It suggests a generative model for tensor factorization and gets parameters with Gibbs sampling method. Most methods for CMTF use CP decomposition model for where J1 = J2 = ⋯ = JN and the core tensor is hyper-diagonal [12, 25, 26, 27, 28, 19]. CMTF-OPT [12] is a representative algorithm for this problem which uses nonlinear conjugate gradient descent method to find factors. HaTen2 [26, 29], and SCouT [25] propose distributed methods for CMTF using CP decomposition model based on the MapReduce framework. Turbo-SMT [27] provides a time-boosting technique for CP-based CMTF methods.

Note that Eq (5) requires all data entries of and Y to be observed. Unobserved values are set to zeros when and Y are sparse, which results in low accuracy. However, most real world data set shows high sparsity. For example, the density of real world tensor we use for experiments vary from 10−7 to 10−4. For this reason above methods show low accuracy for real-world sparse data; what we focus on this paper is solving CMTF for sparse data.

Definition 2. (Sparse CMTF) When and Y are sparse, sparse CMTF aims to find factors only considering the observed entries. Let indicates the observed entries of such that Let W(2) indicates the observed entries of Y analogously. We modify Eq (5) as (6) where * denotes the Hadamard product (element-wise product).

CMTF-Tucker-ALS does not support sparse CMTF since it calculates a singular vector of full and dense matrix. CMTF-OPT provides single machine approach for sparse CMTF for CP model, and CDTF [30] and FlexiFaCT [28] provide distributed methods for sparse CMTF for CP model. Note that all existing methods are based on CP model. Our method is for more general setting, Tucker decomposition, and also easily applied to CP model.

Proposed method


S3CMTF provides an algorithm for the joint factorization of Tucker decomposition. The major challenge of parallel Tucker decomposition is to avoid the race condition, and design an efficient algorithm for updating factors.

In this section, we describe S3CMTF (Sparse, lock-free SGD based, and Scalable CMTF), our proposed method for fast, accurate, and scalable CMTF. Our purpose is to minimize the number of race conditions with probabilistic guarantee by exploiting problem characteristic and minimize calculations by exploiting intermediate data.

We first propose a lock-free parallel method S3CMTF-base; then, we propose a time-improved version S3CMTF-opt. Fig 2 shows the overall scheme of S3CMTF. S3CMTF-base employs asynchronous parallel SGD for the parallel update with proper workload distribution, and S3CMTF-opt further improves the speed of S3CMTF-base by exploiting intermediate data and reusing them.

Objective function & gradient

We discuss the improved formulation of the sparse CMTF problem defined in Definition 2. For simplicity, we consider the case that one matrix is coupled to the c-th mode of a tensor . Naive calculation of Eq (6) takes excessive time and memory since it includes materialization of dense tensor . Therefore, we re-formulate the new CMTF objective function f to exploit the sparsity of data and add regularization. f is the weighted sum of two functions ft and fm which are element-wise sums of squared reconstruction error and regularization terms of tensor and matrix Y, respectively. (7) where λm is a balancing factor of the two functions. where α = (i1iN), is the observable index set of , and λreg denotes the regularization parameter for factors. We rewrite the equation so that it is amenable to SGD update: where α = (i1iN). Note that is the subset of having in as the n-th index. Now we formulate fm, the sum of squared errors of coupled matrix and regularization term corresponding to the coupled matrix. We calculate the gradient of f (Eq (7)) with respect to factors and core for stochastic gradient descent update. Consider that we pick one index and matrix index β = (j1j2) ∈ ΩY. We calculate the corresponding partial derivatives of f with respect to the factors and the core tensor as follows. (8a) (8b) (8c) (8d)

Note that our formulated coupled matrix-tensor factorization model is easily generalized to the case that multiple matrices are coupled to a tensor. We couple multiple matrices to a tensor for experiments in Sections for experiments and discovery.

Multi-core parallelization

How can we parallelize the SGD updates for CMTF in multiple cores? In CMTF, SGD is hard to be parallelized without conflicts since each update may suffer from memory conflicts by attempting to write the core tensor to memory concurrently [31]. One solution for this problem is memory locking and synchronization. However, there are lots of overhead associated with locking. Therefore, we use lock-free strategy to parallelize S3CMTF. We develop a parallel update scheme for S3CMTF by adopting HOGWILD! update scheme [32]. For any SGD problem, a hypergraph is induced where its nodes represent parameters and edges represent the set of parameters related to a data point.

Definition 3. (Induced Hypergraph) The objective function in Eq (7) induces a hypergraph G = (V, E) whose nodes represent factor rows and the core tensor. Each entry of and Y induces a hyperedge eE consisting of corresponding factor rows or core tensor. Fig 3A shows an example induced graph of S3CMTF.

Fig 3. Example hypergraphs induced by S3CMTF objective function (Eq (7)).

A matrix Y is coupled to the second mode of with a coupled factor matrix V. Each node represents a factor row or the core tensor. Each hyperedge includes corresponding factors to an SGD update. (a) Induced hypergraph with the core tensor. Every hyperedge corresponding to tensor entries includes . (b) Induced hypergraph without core tensor. The graph has sparse structure as every node is shared by only few hyperedges.

Lock-free parallel updates often converge nearly linearly for a sparse SGD problem in which conflicts between different updates rarely occur [32]. However, in CMTF with Tucker formulation, every update of tensor entries includes the core tensor as shown in Fig 3A. We allocate the update of the core tensor to one dedicated CPU core and increase the step size by the number to keep the expected step size unchanged, which leads to line 7 of Algorithm 1 described in the next section. Then we obtain a new induced hypergraph in Fig 3B. Previous induced hypergraph (Fig 3A) implies that every factor update (red, blue, and orange hyperedges) is in conflict with each other on updating the core tensor, resulting to unexpected behaviors. In contrast, the new induced hypergraph shows that the update of factors is independent of that of the core tensor.

Note that our problem with this induced hypergraph is a general case of matrix completion problem in [32] which provides convergence guarantee of lock-free parallelism; each edge in our hypergraph entails N vertices, while that in [32] entails only 2 vertices.

Algorithm 1 S3CMTF-base

Require: Tensor , rank (J1, ⋯, JN), number of parallel cores P, initial learning rate η0, decay rate μ, coupled mode c, and coupled matrix

Emsure: Core tensor , factor matrices U(1), ⋯, U(N), V

1: Initialize , for n = 1, ⋯, N, and V randomly

2: repeat

3:  for , ∀β = (j1j2) ∈ ΩY in random order do in parallel

4:   if α is picked then

5:    (,⋯,,) ←compute_gradient(α,xα,)

6:    , (for n = 1, ⋯, N)

7:     (executed by one dedicated CPU core)

8:   end if

9:   if β is picked then

10:    ,


12:    ,

13:   end if

14:  end for

15:  ηt = η0(1 + μt)−1

16: until convergence conditions are satisfied

17: for n = 1, …, N do

18:  Q(n),R(n) ← QR decomposition of U(n)

19:  U(n)Q(n),

20: end for


22: return , U(1), ⋯, U(N), V


We present our method, S3CMTF-base, combination of the aforementioned techniques. S3CMTF-base solves the sparse CMTF problem by parallel SGD techniques explained above. Algorithm 1 shows the procedure of S3CMTF-base. In the beginning, S3CMTF-base initializes factor matrices and the core tensor randomly (line 1 of Algorithm 1). The outer loop (lines 2-16) repeats until the factor variables converge. The inner loop (lines 3-14) is performed by several cores in parallel. In each inner loop, S3CMTF-base selects an index which belongs to or ΩY in random order (line 3). If a tensor index α is picked, then the algorithm calculates the partial gradients of corresponding factor rows using compute_gradient (Algorithm 2) in line 5, and updates factor row vectors (line 6). Core tensor is updated by one dedicated CPU core (line 7). Note that if line 7 is run by multiple cores, a core may interrupt another core’s update of by overwriting the gradient , which leads to unexpected update of and hinders convergence; thus, we eliminate the possibility of such conflict by allocating update of to the dedicated CPU core. The update of line 7 is done independently by the dedicated CPU core, but concurrently with gradient calculation (line 5) and factor updates (line 6) of other CPU cores. The number P of cores is multiplied to the gradient to compensate for the one-core update so that SGD uses the same expected learning rate for all the parameters. If a coupled matrix index β is picked, then the gradient update is performed on corresponding factor row vectors (lines 9-13). At the end of the outer loop, the learning rate ηt of the t-th iteration is monotonically decreased [33]. (line 15). QR decomposition is applied on factors to satisfy orthogonality constraint of factor matrices (lines 17-20). QR decomposition of U(n) generates Q(n), an orthogonal matrix of the same size as U(n), and a square matrix . Substituting U(n) by Q(n) (line 19) and by (after N-th execution of line 19) result in orthogonal factors with equivalent factorization quality [5]. In the same manner, we substitute V by (line 21) since .

Algorithm 2 compute_gradient(α,xα,)

Require: Tensor entry xα, , core tensor

Ensure: Gradients ,,⋯,,


2: for n = 1, ⋯, N do


4: end for


6: return ,,⋯,,

Algorithm 3 compute_gradient_opt(α,xα,)

Require: Tensor entry xα, , core tensor

Ensure: Gradients ,,⋯,,


2: for do



5: end for

6: for n = 1, …, N do


8: end for


10: return ,,…,,


There is much room for improvement in calculations of S3CMTF-base. The computational bottleneck of S3CMTF-base is compute_gradient. There are implicitly redundant calculations during multiple tensor-matrix products. For example, calculation of is repeated N times for every execution of compute_gradient (Algorithm 2) in line 5 of Algorithm 1. The calculation of for the n-th mode is equivalent to a special case of a well-studied operation, matricized tensor times Khatri-Rao product (MTTKRP). MTTKRP is an operation to compute X(n)kn A(k) where X(n) is a matricized tensor along the n-th mode, and ⊙ denotes the Khatri-Rao product [34]. is equivalent to an MTTKRP G(n)kn u(k) where the matrix A(k) is replaced by the vector u(k).

Calculating MTTKRP along all modes is known as the CP gradient problem. In compute_gradient, we need to calculate for all N modes (line 3 of Algorithm 2), raising the special case of the CP gradient problem. To solve the particular CP gradient problem faster, we propose a method to avoid redundant computations by reusing the intermediate calculations in previous steps. Calculation of is equivalent to a summation of (Eq 3) which is a product of the core value and N − 1 related factor values. Before the calculation of the CP gradient, is calculated in line 1 of Algorithm 2. We exploit the fact that is the summation of the product (Eq 4), the product of a core value and all N related factor values. In S3CMTF-opt, we save time by storing the intermediate calculations for and reusing them.

Definition 4. (Intermediate Data) When updating the factor rows for a tensor entry , we define (j1j2jN)-th element of intermediate data :

There is no extra time required for calculating because is generated while calculating . Lemma 1 shows that is calculated by summing all entries of .

Lemma 1. For a given tensor index α, the estimated tensor entry .

Proof. The proof is straightforward by Eq (4).

We use with following Collapse operation to calculate gradients efficiently.

Definition 5. (Collapse) The Collapse operation of the intermediate tensor on the n-th mode outputs a row vector defined as:

Collapse operation aggregates the elements of intermediate tensor with respect to a fixed mode. We re-express the calculation of gradients for tensor factors in Eqs (8a)–(8d) in an efficient manner.

Lemma 2. (Efficient Gradient Calculation) The following statements are equivalent calculations of the gradients as in Eqs (8a)–(8d). (9a) (9b) (9c) where α = (i1 i2iN), anddenotes element-wise division.

Proof. In Lemma 1, Eq (9a) is proved. To prove the equivalence of Eq (9b) and the Eq (8a), it suffices to show We use Eq (3) for the proof. and . Next, to show the equivalence of Eq (9c) and the second equation of Eq (8), it suffices to show .

S3CMTF-opt replaces compute_gradient (Algorithm 2) of S3CMTF-base with compute_gradient_opt (Algorithm 3), the time-optimized alternative. We prove that the new calculation scheme is faster than the previous one.

Lemma 3. compute_gradient_opt is faster than compute_gradient. The theoretical time complexity of compute_gradient is and the time complexity of compute_gradient_opt is where J1 = J2 = ⋯ = JN = J.

Proof. We assume that I1 = I2 = ⋯ = IN = I for brevity. First, we calculate the time complexity of compute_gradient (Algorithm 2). Given a tensor index α, computing (line 1 of Algorithm 2) takes . Computing () (line 3) takes . Thus, aggregate time for calculating the row gradient for all modes (lines 2-4) takes . Calculating (line 5) takes . In total, compute_gradient takes time. Next, we calculate the time complexity of compute_gradient_opt (Algorithm 3). Computing an entry of intermediate data (line 3 of Algorithm 3) takes . Aggregate time for getting (lines 2-5) is since . Calculating row gradient for all modes (lines 6-8) takes since Collapse operation takes . Calculating gradient for core tensor (line 9) takes . In total, compute_gradient_opt takes time.


We analyze the proposed method in terms of time complexity per iteration. For simplicity, we assume that I1 = I2 = ⋯ = IN = I, and J1 = J2 = ⋯ = JN = J. Table 3 summarizes the time complexity (per iteration) and memory usage of S3CMTF and other methods. Note that the memory usage refers to the auxiliary space for temporary variables used by a method.

Table 3. Comparison of time complexity (per iteration) and memory usage of our proposed S3CMTF and other CMTF algorithms.

S3CMTF-opt shows the lowest time complexity and S3CMTF-base shows the lowest memory usage. For simplicity, we assume that all modes are of size I, of rank J, and an I × K matrix is coupled to one mode. P is the number of parallel cores. (* indicates the lowest time or memory).

Lemma 4. The time complexity (per iteration) of S3CMTF-base is and the time complexity (per iteration) of S3CMTF-opt is where P denotes the number of parallel cores.

Proof. First, we check the time complexity of S3CMTF-base. When a tensor index α is picked in the inner loop (line 4 of Algorithm 1), calculating gradients with respect to tensor factors (line 5) takes as shown in Lemma 3. Updating factor rows (line 6) takes , and updating core tensor (line 7) takes . If a coupled matrix index β is picked (line 9), calculating (line 10) takes . Calculating and updating the factor rows corresponding to coupled matrix entry (lines 10-12) take . All calculations except updating core tensor (line 7) are conducted in parallel. Finally, for all and β ∈ ΩY, S3CMTF-base takes for one iteration. S3CMTF-opt uses compute_gradient_opt instead of compute_gradient in line 5 of Algorithm 1, whose time complexity is shown in Lemma 3. Overall running time per iteration for S3CMTF-opt is .


In this and the next sections, we experimentally evaluate S3CMTF. Especially, we answer the following questions.

Q1: Performance How accurate and fast is S3CMTF compared to competitors?

Q2: Scalability How do S3CMTF and other methods scale in terms of dimensionality, the number of observed entries, and the number of cores?

Q3: Discovery What are the discoveries of applying S3CMTF on real-world data?

The source codes of our method and datasets used in this paper are available at

Experimental settings


Table 4 shows the data we used in our experiments. We use three real-world datasets, MovieLens (, Netflix (, and Yelp (, as well as synthetic data to evaluate S3CMTF. Each entry of the real-world datasets represents a rating, which consists of (user, ‘item’, time; rating) where ‘item’ indicates ‘movie’ for MovieLens and Netflix, and ‘business’ for Yelp. We use (movie, genre) and (movie, year) as coupled matrices for MovieLens and Netflix, respectively. We use (user, user) friendship matrix, (business, category) and (business, city) matrices for Yelp. Particularly for scalability experiments, we generate 3-mode synthetic random tensors with dimensionality I and corresponding coupled matrices to observe speed property while size is varying. We vary I in the range of 1K∼100M and the number of tensor entries in the range of 1K∼100M. We set the number of entries as for synthetic coupled matrices. We generated observed indices randomly, and their entries to follow uniform random distribution between 0 and 1.

Table 4. Summary of the data used for experiments.

‘K’ means thousand, and ‘M’ million. Tensors and matrices of density 1 are fully observed.


We use test RMSE as the measure for tensor reconstruction error. where Ωtest is the index set of the test data tensor, xα stands for each test tensor entry, and is the corresponding reconstructed value.


For fair comparison, we compare single core run of S3CMTF-base and S3CMTF-opt with other single machine CMTF methods: CMTF-Tucker-ALS and CMTF-OPT (described in Section). To examine multi-core performance, we run two versions of S3CMTF-opt: S3CMTF-opt1 (1 core), and S3CMTF-opt20 (20 cores). We exclude distributed CMTF methods [25, 26, 28] since they are designed for Hadoop with multiple machines, and thus take too much time for single machine environment. For example, [17] reported that HaTen2 [26] takes 10,700s to decompose 4-way tensor with I = 10K and , which is almost 7,000× slower than our single machine implementation of S3CMTF-opt. For CMTF-Tucker-ALS, we implemented a C++ version based on Tucker-MET [20], and for CMTF-OPT, we implemented a C++ version of CMTF-OPT [12]. Our implementation for CMTF-OPT solves Eq (6) by sparse matrix operations. We implement S3CMTF with C++. For all of our C++ implementations, we used C++11 with O2 flag. We used Armadillo 7.700 with LAPACK 3.7.0 and BLAS 3.7.0 for matrix operations such as eigenvector calculations. We used OpenMP 4.0 library for multi-core parallelization of S3CMTF.

We conduct all experiments on a machine equipped with Intel Xeon E5-2630 v4 2.2GHz CPU and 256GB RAM. We mark out-of-memory (O.O.M.) error when the memory usage exceeds the limit.


We set pre-defined hyperparameters that resulted in the best reconstruction error on a 10% validation set by random grid search: tensor rank J, regularization factor λreg, λm, the initial learning rate η0, and decay rate μ. We set λreg to 0.1, λm = 10, and μ = 0.1 for all datasets. For rank and initial learning rate, MovieLens: J = 12, η = 0.001, Netflix: J = 11, η = 0.001, and Yelp: J = 10, η = 0.0005. For synthetic datasets, we use J = 10 for all experiments.

Performance of S3CMTF

We observe the performance of S3CMTF to answer Q1. As seen in Figs 1B and 4, S3CMTF converges faster to the optimum with the lowest test error than existing methods with the following details.

Fig 4. Test RMSE of S3CMTF and other CMTF methods over iterations.

S3CMTF-opt20 shows the best convergence rate and accuracy.


We divide each data tensor into 80%/20% for train/test sets. Specifically, 80% of the tensor entries are regarded as the train set and remaining 20% as the test set. The lower error for a same elapsed time implies the better accuracy and faster convergence. Figs 1B and 4 show the changes of test RMSE of each method on three datasets over elapsed time which are the answers for Q1. S3CMTF achieves the lowest error compared to others for the same elapsed time. For Yelp, CMTF-Tucker-ALS yielded an O.O.M. error. S3CMTF-opt20 achieves the lowest error 1.253, 0.9147, and 0.8037 while the best competing method, CMFT-OPT, gives the error 1.370, 1.018, and 0.8125 for Yelp, Netflix, and MovieLens datasets, respectively. Note that the competing method CMFT-Tucker-ALS gives either an out of memory error or results in the highest error rate.

Running time.

We compare our method with the multi-core version of SALS-single [30], a parallel CP decomposition algorithm, to demonstrate the high performance of S3CMTF compared to the state-of-the-art decomposition algorithms. We used non-coupled CP version of our method, S3CMTF-CP-opt, by setting to be hyper-diagonal and not coupling any matrices. Fig 5 shows that S3CMTF is better than SALS-single in terms of both error and time for MovieLens dataset. S3CMTF-TUCKER explicitly denotes S3CMTF-opt for Tucker model.

Fig 5. Comparison with SALS-single for movieLens dataset.

We compare two non-coupled version of S3CMTF, S3CMTF-CP-opt and S3CMTF-TUCKER-opt with the parallel CP decomposition method, SALS-single. For (a), we set 1 mark per 20 iterations for clarity. (a) S3CMTF converges faster to a lower error than SALS does. (b) S3CMTF-CP-opt is 2.3× faster than SALS-single.

Scalability analysis

We present scalability of our proposed S3CMTF and competitors to answer Q2, in terms of two aspects: data scalability and parallel scalability. We use synthetic data of varying size for evaluation. As a result, we show the running time (for one iteration) of S3CMTF follows our theoretical analysis in Section.

Data scalability.

The time complexity of CMTF-Tucker-ALS and CMTF-OPT have and as their dominant terms, respectively. In contrast, S3CMTF exploits the sparsity of input data, and has the time complexity linear to the number of entries (, |ΩY|) and is independent of the dimensionality (I) as shown in Lemma 4. Figs 1A and 6A show that the running time (for one iteration) of S3CMTF on real world data sets follows our theoretical analysis in Section. First, we fix to 1M and |ΩY| to 100K, and vary dimensionality I from 1K to 100M. Fig 1A shows the running time (for one iteration) of all methods with J = 10. Note that all of our proposed methods achieve constant running time as dimensionality increases because they exploit the sparsity of data by updating factors related to only observed data entries. However, CMTF-Tucker-ALS and CMTF-OPT show exponentially increasing running time, and CMTF-OPT shows O.O.M. when I = 10M. Next, we investigate the data scalability over the number of entries as shown in Fig 6A. We fix I to 10K and raise from 10K to 100M. CMTF-Tucker-ALS shows O.O.M. when , and CMTF-OPT shows near-linear scalability. Focusing on the results of S3CMTF, all three versions of our approach show linear relation between running time and .

Fig 6. Comparison of scalability.

(a) S3CMTF shows linear scalability as the number of entries increases. (b) S3CMTF-base and S3CMTF-opt show linear speed up as the number of cores grows. O.O.M.: out of memory error.

Parallel scalability.

We conduct experiments to examine parallel scalability of S3CMTF on shared memory systems. For measurement, we define speed up as (iteration time on 1 core)/(iteration time). Fig 6B shows the linear speed up of S3CMTF-base and S3CMTF-opt. The slope of the parallel scalability curve is not one (perfectly parallelizable) since the growing number of cores leads to the concurrent read accesses to memory, which leads to conflicts. S3CMTF-opt shows higher speed up than S3CMTF-base because it reduces reading accesses for core tensor by utilizing intermediate data.


In this section, we use S3CMTF for mining real-world data, Yelp, to answer the question Q3 in the beginning of the previous section. First, we demonstrate that S3CMTF has better discernment for business entities compared to the naive decomposition method by jointly capturing spatial and categorical prior knowledge. Second, we show how S3CMTF is possibly applied to the real recommender systems. It is an open challenge to jointly capture the spatio-temporal context along with user preference data [35]. We exemplify a personal recommendation for a specific user. For discovery, we use the total Yelp data tensor along with coupled matrices as explained in Table 4. For better interpretability, we found a non-negative factorization by applying projected gradient method [36]. An orthogonality condition is not imposed to keep non-negativity, and each column of factors is normalized.


First, we compare discernment by S3CMTF and the Tucker decomposition. We use the business factor U(2). Fig 7A shows the gap statistic values of clustering business entities with k-means clustering algorithm. Gap statistic is a theoretical tool to measure separability between k-means clusters [37]. A higher gap statistic means higher separability between clusters. S3CMTF shows higher gap statistic values compared to the Tucker decomposition which means S3CMTF outperforms the naive Tucker decomposition for entity clustering with respect to the gap statistic.

Fig 7.

(a) Gap statistics on U(2) of S3CMTF and the Tucker decomposition for Yelp dataset. S3CMTF outperforms the naive Tucker decomposition for its clustering ability. (b) Visualization of the personal recommendation scenario.

As the difference between S3CMTF and the Tucker decomposition is in the existence of coupled matrices, the high performance of S3CMTF is attributed to the unified factorization using spatial and categorical data as prior knowledge. Table 5 shows the found clusters of business entities. Note that each cluster represents a certain combination of spatial and categorical characteristics of business entities.

Table 5. Clustering results on business factor U(2) found by S3CMTF.

We found dominant spatial and categorical characteristics from each cluster. Businesses in a same cluster tend to be in adjacent cities and are included in similar categories.

User-specific recommendation

Commercial recommendations are one of the most important applications of factorization models [4, 9]. Here we illustrate how factor matrices are used for personalized recommendations with a real example. Fig 7B shows the process for recommendation. Below, we illustrate the process in detail.

  • An example user Tyler has a factor vector u, namely user profile, which has been calculated by previous review histories.
  • We then calculate the personalized profile matrix . measures the amount of interaction of user profile with business and time factors.
  • Norm values of rows in indicate the influence of latent business concepts on Tyler. Dominant and weak concepts are found based on the calculated norm values. In the example, B4 is the dominant, and B7 is the weak latent concept.
  • We inspect the corresponding columns of business factor matrix U(2) and find relevant business entities with high values for the found concepts (B4 and B7).

We found both strong and weak entities by the above process. The strong and weak entities provide recommendation information by themselves in the sense that the probability of the user to like strong and weak entities are high and low, respectively, and they also give extended user preference information. For example, strong entities for Tyler are related to ‘spa & health’ and located in neighborhood cities of Arizona, US. Weak entities are related to ‘grill & restaurants’ and located in Toronto, Canada. The captured user preference information potentially makes commercial recommender systems interpretable with additional user-specific information such as address, current location among others.


We propose S3CMTF, a fast, accurate, and scalable CMTF method. S3CMTF provides up to 930× faster running times and the best accuracy by sparse CMTF with carefully derived update rules, lock-free parallel SGD, and reusing intermediate computation results. S3CMTF shows linear scalability for the number of data entries and parallel cores. Moreover, we show the usefulness of S3CMTF for cluster analysis and recommendation by applying S3CMTF to real-world Yelp data. For future improvements, applying recent achievements in the literature to improve CP gradient algorithm [38, 39] to our method is possible. Also, future works include extending the method to a distributed setting.


  1. 1. Park N, Jeon B, Lee J, Kang U. BIGtensor: Mining Billion-Scale Tensor Made Easy. In: Proceedings of the International Conference on Information and Knowledge Management. ACM; 2016.
  2. 2. Park N, Oh S, Kang U. Fast and Scalable Distributed Boolean Tensor Factorization. In: Data Engineering (ICDE), 2017 IEEE 33rd International Conference on. IEEE; 2017. p. 1071–1082.
  3. 3. Oh S, Park N, Sael L, Kang U. Scalable Tucker Factorization for Sparse Tensors—Algorithms and Discoveries. In: Data Engineering (ICDE), 2018 IEEE 34th International Conference on. IEEE; 2018. p. 1120–1131.
  4. 4. Koren Y, Bell R, Volinsky C. Matrix factorization techniques for recommender systems. Computer. 2009;42(8).
  5. 5. Kolda TG, Bader BW. Tensor decompositions and applications. SIAM review. 2009;51(3):455–500.
  6. 6. Ding C, Li T, Peng W. On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing. Computational Statistics & Data Analysis. 2008;52(8):3913–3927.
  7. 7. Peng W, Li T. On the equivalence between nonnegative tensor factorization and tensorial probabilistic latent semantic analysis. Applied Intelligence. 2011;35(2):285–295.
  8. 8. Xu W, Liu X, Gong Y. Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval. ACM; 2003. p. 267–273.
  9. 9. Karatzoglou A, Amatriain X, Baltrunas L, Oliver N. Multiverse recommendation: n-dimensional tensor factorization for context-aware collaborative filtering. In: Proceedings of the fourth ACM conference on Recommender systems. ACM; 2010. p. 79–86.
  10. 10. Rendle S, Schmidt-Thieme L. Pairwise interaction tensor factorization for personalized tag recommendation. In: Proceedings of the third ACM international conference on Web search and data mining. ACM; 2010. p. 81–90.
  11. 11. Sael L, Jeon I, Kang U. Scalable tensor mining. Big Data Research. 2015;2(2):82–86.
  12. 12. Acar E, Kolda TG, Dunlavy DM. All-at-once optimization for coupled matrix and tensor factorizations. arXiv preprint arXiv:11053422. 2011.
  13. 13. Acar E, Rasmussen MA, Savorani F, Næs T, Bro R. Understanding data fusion within the framework of coupled matrix and tensor factorizations. Chemometrics and Intelligent Laboratory Systems. 2013;129:53–63.
  14. 14. Narita A, Hayashi K, Tomioka R, Kashima H. Tensor factorization using auxiliary information. Data Mining and Knowledge Discovery. 2012;25(2):298–324.
  15. 15. Ozcaglar C. Algorithmic data fusion methods for tuberculosis. Rensselaer Polytechnic Institute; 2012.
  16. 16. Tucker LR. Some mathematical notes on three-mode factor analysis. Psychometrika. 1966;31(3):279–311. pmid:5221127
  17. 17. Oh J, Shin K, Papalexakis EE, Faloutsos C, Yu H. S-HOT: Scalable High-Order Tucker Decomposition. In: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM; 2017. p. 761–770.
  18. 18. Hitchcock FL. The expression of a tensor or a polyadic as a sum of products. Studies in Applied Mathematics. 1927;6(1-4):164–189.
  19. 19. Sorber L, Van Barel M, De Lathauwer L. Structured data fusion. IEEE Journal of Selected Topics in Signal Processing. 2015;9(4):586–600.
  20. 20. Kolda TG, Sun J. Scalable tensor decompositions for multi-aspect data mining. In: Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on. IEEE; 2008. p. 363–372.
  21. 21. De Lathauwer L, De Moor B, Vandewalle J. On the best rank-1 and rank-(r 1, r 2,…, rn) approximation of higher-order tensors. SIAM journal on Matrix Analysis and Applications. 2000;21(4):1324–1342.
  22. 22. Ermiş B, Acar E, Cemgil AT. Link prediction in heterogeneous data via generalized coupled tensor factorization. Data Mining and Knowledge Discovery. 2015;29(1):203–236.
  23. 23. Yılmaz KY, Cemgil AT, Simsekli U. Generalised coupled tensor factorisation. In: Advances in neural information processing systems; 2011. p. 2151–2159.
  24. 24. Khan SA, Leppäaho E, Kaski S. Bayesian multi-tensor factorization. Machine Learning. 2016;105(2):233–253.
  25. 25. Jeon B, Jeon I, Sael L, Kang U. Scout: Scalable coupled matrix-tensor factorization-algorithm and discoveries. In: Data Engineering (ICDE), 2016 IEEE 32nd International Conference on. IEEE; 2016. p. 811–822.
  26. 26. Jeon I, Papalexakis EE, Kang U, Faloutsos C. Haten2: Billion-scale tensor decompositions. In: Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE; 2015. p. 1047–1058.
  27. 27. Papalexakis EE, Faloutsos C, Mitchell TM, Talukdar PP, Sidiropoulos ND, Murphy B. Turbo-smt: Accelerating coupled sparse matrix-tensor factorizations by 200x. In: Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM; 2014. p. 118–126.
  28. 28. Beutel A, Talukdar PP, Kumar A, Faloutsos C, Papalexakis EE, Xing EP. Flexifact: Scalable flexible factorization of coupled tensors on hadoop. In: Proceedings of the 2014 SIAM International Conference on Data Mining. SIAM; 2014. p. 109–117.
  29. 29. Jeon I, Papalexakis EE, Faloutsos C, Sael L, Kang U. Mining billion-scale tensors: algorithms and discoveries. The VLDB Journal. 2016;25(4):519–544.
  30. 30. Shin K, Sael L, Kang U. Fully scalable methods for distributed tensor factorization. IEEE Transactions on Knowledge and Data Engineering. 2017;29(1):100–113.
  31. 31. Bradley JK, Kyrola A, Bickson D, Guestrin C. Parallel coordinate descent for l1-regularized loss minimization. arXiv preprint arXiv:11055379. 2011.
  32. 32. Recht B, Re C, Wright S, Niu F. Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: Advances in neural information processing systems; 2011. p. 693–701.
  33. 33. Bottou L. Stochastic gradient descent tricks. In: Neural networks: Tricks of the trade. Springer; 2012. p. 421–436.
  34. 34. Bader BW, Kolda TG. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing. 2007;30(1):205–231.
  35. 35. Gao H, Tang J, Hu X, Liu H. Exploring temporal effects for location recommendation on location-based social networks. In: Proceedings of the 7th ACM conference on Recommender systems. ACM; 2013. p. 93–100.
  36. 36. Lin CJ. Projected gradient methods for nonnegative matrix factorization. Neural computation. 2007;19(10):2756–2779. pmid:17716011
  37. 37. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2001;63(2):411–423.
  38. 38. Vannieuwenhoven N, Meerbergen K, Vandebril R. Computing the gradient in optimization algorithms for the CP decomposition in constant memory through tensor blocking. SIAM Journal on Scientific Computing. 2015;37(3):C415–C438.
  39. 39. Phan AH, Tichavskỳ P, Cichocki A. Fast alternating LS algorithms for high order CANDECOMP/PARAFAC tensor factorizations. IEEE Transactions on Signal Processing. 2013;61(19):4834–4846.