Generic, network schema agnostic sparse tensor factorization for single-pass clustering of heterogeneous information networks

Heterogeneous information networks (e.g. bibliographic networks and social media networks) that consist of multiple interconnected objects are ubiquitous. Clustering analysis is an effective method to understand the semantic information and interpretable structure of the heterogeneous information networks, and it has attracted the attention of many researchers in recent years. However, most studies assume that heterogeneous information networks usually follow some simple schemas, such as bi-typed networks or star network schema, and they can only cluster one type of object in the network each time. In this paper, a novel clustering framework is proposed based on sparse tensor factorization for heterogeneous information networks, which can cluster multiple types of objects simultaneously in a single pass without any network schema information. The types of objects and the relations between them in the heterogeneous information networks are modeled as a sparse tensor. The clustering issue is modeled as an optimization problem, which is similar to the well-known Tucker decomposition. Then, an Alternating Least Squares (ALS) algorithm and a feasible initialization method are proposed to solve the optimization problem. Based on the tensor factorization, we simultaneously partition different types of objects into different clusters. The experimental results on both synthetic and real-world datasets have demonstrated that our proposed clustering framework, STFClus, can model heterogeneous information networks efficiently and can outperform state-of-the-art clustering algorithms as a generally applicable single-pass clustering method for heterogeneous network which is network schema agnostic.


Introduction
Information networks are widely used to describe realistic applications in the cyber domain. Vertices in information networks map the objects in real-world applications, and edges map the relations between them. While the mining of information networks has been studied for To solve the problem of clustering heterogeneous information networks with general network schemas or even without network schema information, e.g., Douban Movie network in Fig 1(b), and clustering all types of objects simultaneously in a single pass, we propose a sparse tensor factorization based method, which is called STFClus (Sparse Tensor Factorization based Clustering). We model a heterogeneous information network as a multi-way array, i.e., tensor. Each object type maps onto one mode of the tensor, and the relations between different types of objects map onto the elements in tensor. The main contributions made by our paper are as follows: 1. We propose a novel clustering framework based on sparse tensor factorization, namely STFClus, which can cluster heterogeneous information networks with general network schemas or even without network schema information. Another advantage is that STFClus can cluster all types of objects simultaneously in a single pass.
2. The clustering issue based on tensor factorization is modeled as an optimization problem, which is similar to the well-known Tucker decomposition [24,25]. We propose an Alternating Least Squares (ALS) [26] algorithm to solve the clustering problem.
3. In STFClus, only nonzero tensor elements together with corresponding tensor indices are handled, and a non-distance function for similarity measurement between pairs of objects is needed. 4. We discuss the bottleneck of implementation for STFClus, and propose a performance improvement method that avoids the need to calculate large scale intermediate results. We also propose a feasible initialization method to start STFClus. 5. STFClus is tested on both synthetic and real-world networks. Experimental results show that STFClus outperforms the state-of-the-art baselines in terms of key performance indicators such as accuracy and efficiency.

Preliminaries
First, we introduce some related concepts and tensor notation that will be used in this paper. More details about tensor algebra can be found in [27][28][29]. A tensor is a multi-dimensional array. The order of a tensor is the number of dimensions, also known as ways or modes. We will follow the convention used in [27] to denote scalars by lowercase letters, e.g., a, b, c, vectors (one mode) by boldface lowercase letters, e.g., a, b, c, matrices (two modes) by boldface capital letters, e.g., A, B, C, and tensors (three modes or more) by boldface calligraphic letters, e.g., X ; Y; Z. The a r: denotes the rth row of matrix A, and a :r denotes the rth column of matrix A. Elements of a matrix or a tensor are denoted by lowercase letters with subscripts, i.e., the (i 1 , i 2 , Á Á Á, i N )th element of an Nth order tensor X is denoted by x i 1 , i2, Á Á Á, iN . Some common definitions for tensors are set out below, as used in [28]. Definition 1 (Matricization) [28]. Matricization transforms an N-order tensor into a matrix by arranging the elements in a particular order.
For example, the matricization of a tensor X 2 R I 1 ÂI 2 ÂÁÁÁÂI N along the nth mode is denoted as X ðnÞ 2 R I n ÂðI 1 ÂÁÁÁÂI nÀ 1 ÂI nþ1 ÂÁÁÁÂI N Þ . A special case of matricization is vectorization, which transforms a tensor into a vector, i.e., all modes of the tensor become row modes. The vectorization of a tensor X 2 R I 1 ÂI 2 ÂÁÁÁÂI N is denoted byX X ð;Þ 2 R Q N n¼1 I n .
Definition 2 (Hadamard product) [28]. The Hadamard product for two tensors with the same dimensions is also known as the element-wise product. For X ; Y 2 R I 1 ÂI 2 ÂÁÁÁÂI N , their Hadamard product is denoted by X Ã Y 2 R I 1 ÂI 2 ÂÁÁÁÂI N , and its elements are given by ðX Ã YÞ i 1 Definition 3 (Kronecker product) [28]. The Kronecker product for two matrices A 2 R IÂJ and B 2 R KÂL is denoted by A B, which is a matrix of size (IJ) × (KL) and defined by Definition 4 (Inner product) [28]. The inner product for two tensors with the same dimension, X ; Y 2 R I 1 ÂI 2 ÂÁÁÁÂI N , is denoted by hX ; Yi. The result of the inner product is the sum of all elements in their Hadamard product, and defined as hX ; Yi ¼ ðX Ã YÞ i 1 i 2 ÁÁÁi N Definition 5 (Frobenius norm) [28]. The Frobenius norm for a tensor X 2 R I 1 ÂI 2 ÂÁÁÁÂI N is defined as k X k F ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffi hX ; X i p . Definition 6 (Mode-n matrix product) [28]. The Mode-n matrix product of a tensor X 2 R I 1 ÂI 2 ÂÁÁÁÂI N with a matrix U 2 R JÂI n is denoted by X Â n U and is of size I 1 × Á Á Á × I n−1 × J × I n+1 × Á Á Á × I N . Its elements are given by ðX Â n UÞ i 1 ÁÁÁi nÀ 1 ji nþ1 ÁÁÁi N ¼ The Mode-n matrix product of a tensor X 2 R I 1 ÂI 2 ÂÁÁÁÂI N with a matrix U 2 R JÂI n is equivalent to first matricization of X along the nth mode, followed by the matrix multiplication of X ðnÞ with U, before finally folding the result back as a tensor.
Given an Nth order tensor X 2 R I 1 ÂI 2 ÂÁÁÁÂI N , the Tucker decomposition [24] of X yields a core tensor G of specified size J 1 × J 2 × Á Á Á × J n , J n I n and factor matrices U ðnÞ 2 R I n ÂJ n ; n ¼ 1; 2; Á Á Á ; N, such that The Tucker decomposition approximates a tensor as a series of Mode-n matrix products of a smaller core tensor with a factor matrix along each mode. In traditional Tucker decomposition, the factor matrices fU ðnÞ g N n¼1 are assumed to be orthogonal. We now give the definition for an information network, which is based on work by Y. Sun and J. Han [3,5].
Definition 7 (Information network) [3]. An information network is a weighted graph defined on a set of objects belonging to T types, denoted by V ¼ fV t g T t¼1 , a set of binary relations on V, denoted by E, and a weight mapping function, denoted by W : E ! R þ . The information network is denoted by G ¼ ðV; E; WÞ. Specially, when T ! 2, the information network is called as heterogeneous information network, otherwise, it is called as homogeneous information network.
We denote each object of type V t as fu t n g N t n¼1 , where N t is the number of objects in type V t , i.e., N t ¼ jV t j and t = 1, 2, Á Á Á, T. The total number of objects in the network is given by In particular, we need to give some restrictions for heterogeneous information networks in our work. Firstly, each edge e j i 2 E only appears on different types of objects, i.e., t a 6 ¼ t b . Secondly, we assume that the heterogeneous information network G ¼ ðV; E; WÞ is undirected, i.e., e ðt a ;t b Þ . It is noteworthy that many edges in real-world applications appear on objects of the same type. An example is the friendship relation type between users in a Douban Movie network, as shown in Fig 1(b). In this case, we can take a copy of this type of object, and let the edge appear only between the two types of objects. In the Douban Movie network, we can take a copy of users and denote it as user_copy. Then, we can let the friendship relations appear only between user and user_copy. The revised network schema of the Douban Movie network is shown in Fig 2. In the following sections, the heterogeneous information network G ¼ ðV; E; WÞ will comply with these restrictions, unless there are special instructions.

Sparse tensor factorization based clustering
Tensor construction and sparse representation. The relationships in heterogeneous information networks show a semantic link in real-world applications, which is defined as follows: Definition 8 (Relationship in Heterogeneous Information Network). Given a heterogeneous information network G ¼ ðV; E; WÞ, a relationship R is a connected sub-graph of G, denoted by For example, a semantic relation in a real-world bibliographic network (in Fig 1A, containing four types of objects {A, P, V, T}), "an Author u A i writes a Paper u P j published in the Venue u V m , and containing the Term u T n ", can be represented by a relationship we can use the subscript of each object in R to mark the corresponding relationship. In this example, the relationship can be marked by R i, j, m, n .
Let X be a Tth order tensor of size N 1 × N 2 × Á Á Á × N T , each mode of X representing one type of object in the network G. An arbitrary element, x n 1 n 2 Á Á Án T ! 0, for n t = 1, 2, Á Á Á, N t , is the weight of the corresponding relationship R n 1 , n 2 , Á Á Á, n T that exists, i.e., : where ⊠ is an operation on the weights of all edges in R n 1 , n2, Á Á Á, nT . In the simplest example, ⊠ can be defined as ⊠ WÞ can then be represented in tensor form as X . The method of determining whether the relationship R n 1 , n 2 , Á Á Á, n T actually exists is related to graph theory and will not be discussed here.  An example of tensor construction from a given heterogeneous information network. On the left is the original network with three types of objects (yellow circle, blue square and red triangle), and on the right cube is the constructed 3-order tensor. The number within each object is the object identifier. Each element (black dot in the right cube) in the tensor represents a relationship in the network (black dashed circle in the left). To deal with the sparse tensor X , we use the coordinate format as proposed in [30]. Assuming there are J non-zero elements in X , then a vector z 2 R J and a matrix M 2 R JÂT can represent the value and the corresponding coordinates of each non-zero element in X respectively. Here the jth non-zero value is given by z j and its subscript is given by the jth row of M, i.e., m j: . Let x n 1 n 2 Á Á Án T be the jth non-zero element in X , we have m j: = [m j 1 , m j 2 , Á Á Á, m j T ] = [n 1 , n 2 , Á Á Á, n T ] and z j = x n 1 n 2 Á Á Án T . In other words, m j t = n t represents the fact that the tth coordinate of the jth non-zero element in X is n t , and that the value of the jth non-zero element in X is z j . This sparse representation of tensors is the same as the format implemented in the MATLAB Tensor Toolbox [31].
Problem formulation. Given a heterogeneous information network G ¼ ðV; E; WÞ, the tensor representation X is usually large and sparse. We can use the sparse representation for X , i.e., the J non-zero weight elements vector z 2 R J and the corresponding coordinates matrix M 2 R JÂT . Each row of M can be treated as a relationship in the network, and the corresponding element in z is the weight of the relationship.
So all the rows of the coordinates matrix M ¼ ½m > 1: ; m > 2: ; Á Á Á ; m > J: > , m j: = [m j 1 , m j 2 , Á Á Á, m j T ], for j = 1, 2, Á Á Á, J, represent the input relationships in the network, which we want to partition into K sub-tensors (clusters) fC 1 ; C 2 ; Á Á Á ; C K g. The vector z = [z 1 , z 2 , Á Á Á, z J ] is the weight vector for the input relationships. The centre of the cluster C k is denoted by c k = [c k 1 , c k 2 , Á Á Á, c k T ], for k = 1, 2, Á Á Á, K. Let y i 2 {1, 2, Á Á Á, K} be the associated unknown cluster label. For example y j = k represents m j: belonging to the kth cluster, and y j t = k 0 represents the subscript m j t of m j: (that is the j t th object of type V t in G) belonging to the k 0 th cluster. Generally speaking, a relationship (or a sub-graph) in the heterogeneous information network may belong to several clusters. Meanwhile, the objects in the relationship may also belong to more than one cluster. We assume that there is already a way to measure the probability that the objects or relationships belong to a specific cluster. Let's denote p j t , k = P(y j t = k|m j t ) as the probability that the tth component of the point m j: belongs to the kth cluster, and p j, k = P(y j = k|m j: ) as the probability that the point m j: belongs to the kth cluster.
A basic clustering approach minimizes the sum of differences between individual relationships in each cluster and the corresponding cluster centres. So the heterogeneous information network clustering problem can be formalized by the vectorized version as follows: 8j; In Eq (3), z 2 j > 0 and m j: À 8t; 8j; Actually, Eq (3) aims to cluster relationships in heterogeneous information networks, and Eq (4) partitions different types of objects into K clusters. Now we form T matrices, denoted by U ðtÞ 2 R N t ÂK , for t = 1, 2, Á Á Á, T. The element u ðtÞ i;k 2 U ðtÞ , for i = 1, 2, Á Á Á, N t ; t = 1, 2, Á Á Á, T; k = 1, 2, Á Á Á, K, can be defined as u ðtÞ i;k ¼ p j t ;k , if i = j t ; otherwise, u ðtÞ i;k ¼ 0. Then, u ðtÞ i;k represents the probability that the ith object in type V t , i.e., u t i , belongs to the kth cluster. So the matrix U ðtÞ 2 R N t ÂK is the projection matrix for the corresponding mode of X . Then a new small size tensor G 2 R K Â K Â Á Á Á Â K zfflfflfflfflfflfflfflfflfflfflfflfflffl}|fflfflfflfflfflfflfflfflfflfflfflfflffl{ T is used as the mixture coefficient among different modes and clusters.
Let G be the core tensor and U (t) , t = 1, 2, Á Á Á, T be the factor matrices, we can use ½½G; U ð1Þ ; U ð2Þ ; Á Á Á ; U ðTÞ to approximate X , i.e., X % ½½G; U ð1Þ ; U ð2Þ ; Á Á Á ; U ðTÞ . Then, we can formalize the clustering problem in a way that is similar to the Tucker decomposition in [32]. 8t; In Eq (5), i = 1, 2, Á Á Á, N t ;t = 1, 2, Á Á Á, T;k = 1, 2, Á Á Á, K, and K < min{N 1 , N 2 , Á Á Á, N T } is the total number of clusters. The first constraint in Eq (5) guarantees that the sum of probabilities for each object belonging to all clusters is 1. The second constraint in Eq (5) stipulates that each probability should be in the range [0, 1]. The last constraint in Eq (5) ensures that each factor matrix is of full column rank, i.e., for any mode, there is no empty cluster and any two clusters are not the same.
In fact, Eq (5) can achieve the results of Eqs (3) and (4) simultaneously. That is, Eq (5) clusters different types of objects and relationships in a heterogeneous information network simultaneously. The factor matrices U (1) ,U (2) , Á Á Á,U (T) are the cluster indication matrices for the T types of objects respectively and the probability of relationship R n 1 , n 2 , Á Á Á, n T belonging to the kth cluster is given by g k;k;ÁÁÁ;k u ð1Þ n 1 ;k u ð2Þ n 2 ;k Á Á Á u ðTÞ n T ;k , where g k;k;ÁÁÁ;k 2 G and u ðtÞ n t ;k 2 U ðtÞ . Algorithm for STFClus. The Alternating Least Squares (ALS) method is a common approach for solving the Tucker decomposition problem. It updates one factor matrix iteratively at each round, while keeping the other factor matrices unchanged. The proposed algorithm for STFClus is also an ALS method that consists of two stages: the factor matrices updating and the core tensor updating. After the core tensor and all factor matrices are initialized, all variables in Eq (5) are fixed in the factor matrices updating stage, except for the modet factor matrix U (t) . Then an approach similar to NMF is applied to search for the optimal U (t) that minimizes the objective function. In the core tensor updating stage, using the optimal factor matrices obtained by the factor matrices updating stage, the core tensor is updated. Finally, the factor matrices updating and core tensor updating stages are iteratively implemented until the approximation error in the objective function is unchanged. The details of the tensor algebra and properties used in the algorithm can be found in [29].
In the factor matrices updating stage, each mode-t factor matrix U (t) is obtained, while the core tensor and other factor matrices are fixed. The objective function in Eq (5) can be rewritten by matricization of X along the tth mode as follows: If we assume that the optimal solution U (t) satisfies all the constraints in Eq (5), then Eq (6) can be written as the following linear equation: We denote S as the tensor ½½G; U ð1Þ ; Á Á Á ; U ðtÀ 1Þ ; U ðtþ1Þ ; Á Á Á ; U ðTÞ . Then, S 2 R N 1 ÂÁÁÁÂN tÀ 1 ÂKÂN tþ1 ÂÁÁÁÂN T , and the matricization of S along the tth mode is where S ðtÞ 2 R KÂðN 1 ÂÁÁÁÂN tÀ 1 ÂN tþ1 ÂÁÁÁÂN T Þ . Now Eq (7) is similar to the NMF problem in [33,34], i.e., Thus, we can use the NMF update rule in [34] to update U (t) as follows: where the symbol ðÞ ðÞ denotes the element-wise division of two matrices with the same size. Note that the factor matrices derived by Eq (10) do not satisfy the first and second constraints in Eq (5). To satisfy these two constraints, we can normalize each row of the factor matrices.
In the core tensor updating stage, we keep all the factor matrices unchanged and rewrite the objective function in Eq (5) by vectorization of X as follows: We assume that all the factor matrices satisfy the constraints in Eq (5). Then the core tensor G in Eq (12) can be obtained by solving the following linear equation: We set: where Q 2 R ð Q T t¼1 N t ÞÂK T . Then, we can transform Eq (13) to a NMF model, i.e., Thus, the NMF update rule in [34] can be used to updateG as follows: where and The properties of Kronecker products and vectorization operators can be found in [35]. Then, Eq (16) is equal to: According to Eq (19), we can get the update rule of the core tensor G as follows: Feasibility and convergence analysis. First, we discuss the feasibility of STFClus. Theorem 1: The STFClus optimization problem is equivalent to the optimization problem in Eq (4).
Before giving the proof of Theorem 1, we first review the clustering problem as defined in Eq (4). Eq (4) is a sparse form, which partitions each object into different clusters. The p j t , k is the cluster indicator, which gives the probability of an object belonging to the corresponding cluster. In the matrix form, clustering tth type of objects can be formalized as: where P is the cluster indication matrix for the tth type of objects, and C is the cluster centres. Proof. Since Eq (6) is transformed from Eq (5) for updating U (t) , Eq (5) can be rewritten as: where U (t) is the cluster indication matrix for the tth type of objects, and S ðtÞ is the cluster centres.
X ðtÞ is the matricization of X along the tth mode, and M is the sparse representation of X , and let P = U (t) , C ¼ S ðtÞ , so Eq (5) will have the same form as Eq (4). Then, we give the convergence analysis of STFClus. Since Lee and Seung have proven the convergence of NMF in [34], we cite theorem 1 in [34] as Theorem 2 in this paper.
By extending Theorem 2 to high-dimensional space, we prove that STFClus is stable. (5)  where the equality holds if and only if U ðtÞ iterþ1 ¼ U ðtÞ iter and U ðtÞ iter is at a local minima. By substituting Eq (8) into this inequation, we obtain: Then, fold the result back as a tensor: where the equality holds if and only if U ðtÞ iterþ1 ¼ U ðtÞ iter and U ðtÞ iter is at a local minima. By reversing the roles of U (t) and G, the update rule of the core tensor in Eq (20) can be similarly proven.

Implementation issues
Performance improvement. The bottleneck of the STFClus lies in the calculation of S ðtÞ . According to Eq (8), we need to compute the Kronecker products of T − 1 dense factor matrices. In fact, the Kronecker products need not be calculated here. According to Eq (8), we can rewrite the X ðtÞ S > ðtÞ and S ðtÞ S > ðtÞ in Eq (10) as the following form.
In this way, by Eqs (21) and (22), we can directly compute Eq (10) and update U (t) without calculating S ðtÞ . In other words, we don't need to compute the Kronecker products round by round. Algorithm 1 gives the pseudo-code of STFClus.
1. Input relationship tensor X , number of clusters K, initial guess for fU ðtÞ g T t¼1 and G, and convergence threshold . 2. repeat 3. for t 1 to T: 4. Update U (t) according to Eqs (10), (21) and (22) Initialization. In the STFClus algorithm, the initial guess of the core tensor and factor matrices have a large impact on the final result. The best method for the core tensor and factor matrices initialization may vary between given real-world datasets. In general, each mode of the input tensor has its own physical meaning, and each element of the input tensor represents a relationship among different modes of the tensor. The STFClus algorithm aims to cluster all modes of the input tensor simultaneously by utilizing these relationships. Each factor matrix is the cluster indication matrix for a mode of the input tensor, and the core tensor is the mixture coefficient among different factor matrices.
As with the cluster indication matrices, the factor matrices should meet the constraints stated in Eq (5). It is clear that a factor matrix satisfying all the constraints is not unique, and the strategies for initialization are diversified. Of course, we can use random initialization, as it is simple and rapid. However, random initialization may lead to an increase in the number of iterations or even result in an unacceptably slow convergence speed.
Therefore, We propose a feasible method for initialization of the factor matrices that is similar to the traditional K-means method, called the STFClus_initial. We first cluster each mode of the input tensor independently as the corresponding mode factor matrix initialization; subsequently, the core tensor can be determined uniquely using the factor matrices.
Since STFClus_initial works in a similar way for different modes of input tensors, we will simply describe the process for a single mode. Without loss of generality, we detail the use of STFClus_initial on the tth mode of the input tensor. It is known that the tth mode of the tensor represents the tth type of objects in the heterogeneous information network. STFClus_initial on the tth mode of the input tensor can then be formalized as: given the tensor X of the heterogeneous information network, we want to partition the tth mode of X into K clusters.
The key aspect of STFClus_initial is how it measures the similarity between different objects. We note that, in the sparse representation of the input tensor M ¼ ½m where the |•| denotes the cardinality of a set, and the function sam(•) denotes the total number of the same components in corresponding columns (except the tth column) of two matrices. For two matrices A 2 R r 1 Âl and B 2 R r 2 Âl with the same number of columns, and a natural number t l, the function sam(•) can be defined as: According to Eq (23) we can see that the similarity function holds three properties: We can also define the similarity between an object and a cluster as the weighted sum of the similarity between the object and each object in the cluster, namely, Thus the probability of an object belonging to the corresponding cluster can be calculated as: Furthermore, STFClus_initial on the tth mode of input tensor can be summarized as follows: • Step1: Choose K objects from tth type as the initial clusters fO t 1 ; O t 2 ; Á Á Á ; O t K g randomly. Here, we require that the similarity between any two of the K objects should not be equal to one. In practice, the algorithm will converge in less than 3 iterations in most cases. Since the U (t) is only the initial guess for STFClus, and it will be updated in STFClus, we can set iterNum = 2.
After obtaining the initialization of U (t) , for t = 1, 2, Á Á Á, T, the core tensor G is determined uniquely by the factor matrices. According to the objective function in Eq (5), we can get the core tensor as follows: where the superscript ' † ' specifies the Moore-Penrose pseudo-inverse. The last constraint in Eq (5) makes sure that U (t) is full column rank, i.e., the columns of U (t) are linearly independent. So the Moore-Penrose pseudo-inverse can be calculated as: The pseudo-code of STFClus_initial is given in Algorithm 2. Algorithm 2: STFClus_initial (An initial algorithm for STFClus). Time complexity analysis. The time complexity for the proposed method comprises of two parts: STFClus_initial and STFClus. First, in STFClus_initial, we need to calculate the initial guess for factor matrices and core tensor. For the factor matrices initialization, we need to compute the similarity between each non-zero element in the tensor, i.e., the relationships in the heterogeneous information network, with each relationship containing T objects. So the time complexity for factor matrices initialization is O(TJ 2 ), where J is the number of non-zero elements in the tensor. For the core tensor initialization, according to Eqs (26) and (27), we need to compute the Moore-Penrose pseudo-inverse of each factor matrix, and mode-n matrix product of tensor X with all factor matrices. The time complexity for computing the Moore-Penrose pseudo-inverse of all factor matrices is O(2K 2 N + TK 3 ), where N ¼ P T t¼1 N t is the total number of objects in the network. So the time complexity for the core tensor initialization is O(TKJ + 2K 2 N + TK 3 ). Therefore, the total time complexity for STFClus_initial is O(TJ 2 + TKJ + 2K 2 N + TK 3 ).
For heterogeneous information networks, T is the number of object types, K is the number of clusters, J is the number of relationships and N is the total number of objects. We have T ( J, T ( N, and K ( J, K ( N. In order to show this more clearly, the time complexity for STFClus_initial can be summarized as O(a 1 J 2 + a 2 J + a 3 N + a 4 ), and the time complexity for STFClus can be summarized as O(b 1 J + b 2 N + b 3 ), where a 1 , a 2 , a 3 , a 4 , b 1 , b 2 , and b 3 are all constants. Thus, we can see that the time complexity for STFClus_initial is proportional to the number of objects and to the square of the number of relationships in the network, while the time complexity for STFClus is almost a linear function of the number of objects and relationships in the network.

Experiments and results
In this section, we present several experiments on synthetic and real-world datasets for heterogeneous information networks, and compare the performance of our method, STFClus, with a number of state-of-the-art clustering methods.
All experiments are implemented in the MATLAB R2015a (version 8.5.0) 64-bit. The synthetic datasets are generated by the codes of synthetic datasets generation algorithm, which are shown in the S1 File. The real-world datasets are all publicly available online. The Matlab codes for STFClus_initial algorithm and STFClus algorithm are shown in the S2 File and the S3 File respectively. All the source codes are available online at https://github.com/ tianshuilideyu/STFClus. The MATLAB Tensor Toolbox (version 2.6, http://www.sandia.gov/ *tgkolda/TensorToolbox/) is used in our experiments. All experimental results are average values obtained by running the algorithms ten times on corresponding datasets, thus providing significant insight into the performance of different parameters and different algorithms.

Dataset description
The synthetic datasets. The purpose of using synthetic datasets is to be able to verify the level of the performance that STFClus can deliver given that the detailed cluster structures of the synthetic datasets are known and so it is possible to evaluate the performance quantitatively based on the STFClus with different parameters.
The synthetic datasets are generated with the following parameters: • T: the number of object types in the heterogeneous information network. It is also the number of modes in the tensor.
• K: the number of clusters.
• D: the density of the tensor, i.e., the percentage of nonzero elements in the tensor. And D ¼ J S , where J is the number of nonzero elements. • O: Whether the clusters are overlapping, denoted by a 1(yes) or 0 (no).
In order to make the synthetic datasets similar to a realistic situation, we assume the distribution for different types of objects that appear in a relationship to follow Zipf's law (see details https://en.wikipedia.org/wiki/Zipf%27s_law). Zipf's law is defined by f t ðr; r t ; where N t is the number of the tth type of objects, r is the object index, and ρ t is the parameter characterizing the distribution. Zipf's law denotes the frequency of the rth object of tth type appearing in the relationship. Then, with the parameters above, we can construct different synthetic datasets for different experiments. Experiment A on synthetic datasets: in order to evaluate the performance quantitatively with different D and S, we fix T = 4, K = 2, and O = 1, and we set the parameter ρ 1 = 0.95, ρ 2 = 1.01, ρ 3 = 0.99, and ρ 4 = 1.05. We then construct four different scaled datasets, with S = 2.5K, S = 250K, S = 2.5M and S = 25M, respectively. For each network, we set different densities as D = 0.5%, D = 1%, D = 5% and D = 10% respectively. See details in Table 1.
The real-world datasets. In order to test the performance of STFClus in real-world scenarios, one medium-scale real-world dataset and two large-scale real-world datasets are used, and the details are summarized in Table 3.
The first real-world dataset is extracted from the DBLP database, called DBLP-four-areas dataset, which contains the ground truth of cluster labels for some objects. It is a four research areas subset of DBLP used in [3-6, 8, 13, 36], and it can be downloaded from: Each non-zero element in the 4-mode tensor represents a relationship or a sub-network in the DBLP, i.e., one author wrote a paper published on a conference and that contained a specific term. We compare the performance of STFClus with several other state-of-the-art methods on the labelled records in this dataset. The second real world dataset is the DBLP database(downloaded form http://dblp.uni-trier. de/xml/ in August 2015), called DBLP-full-areas dataset, which contains all the research areas in computer science. It includes four types of objects: Author, Paper, Venue (conferences or journals) and Term, which are organized in a star network schema, as shown in Fig 1A. In the DBLP database, papers may come from journals, conferences, books, web pages and so on. We Compared with the DBLP-four-areas dataset, we can see that the increased number of multiple modes leads to an explosion of relationships (non-zero elements) in the generated tensor, although it is still very sparse.
For an additional case study, we use the Douban Movie Network, which is collected by Chuan Shi [37], and can be downloaded from https://github.com/zzqsmall/SemRec/tree/ master/data. The Douban Movie Network follows a general network schema, as shown in Fig  1B, and includes 12,677 movies, 6,311 actors, 2,449 directors, 38 movie types, 13,367 users and 2,753 user groups. In addition to the attribute information of users and movies, the Douban Movie Network also includes social relations among users and recommendation actions between users and movies. The records of users, movies, directors and actors in this dataset are anonymous. In order to meet the restrictions of a heterogeneous information network in our work, we take a copy of users and denote it as user_copy, and organize the seven types of objects as the network schema shown in Fig 2. The density of the Douban Movie Network is 1.20416 × 10 −16 , so we construct a very large-scale7-mode tensor with size 12,677 × 6,311 × 2,449 × 38 × 13,367 × 2,753 × 13,367, and 441,008,031 non-zero elements. Each non-zero element in this 7-mode tensor represents a user with social relation information recommended a movie with the attribute information.

Evaluation metrics
In order to compare the clustering results with other state-of-the-art clustering methods for heterogeneous information networks, we adopt the Normalized Mutual Information (NMI) [38] and Accuracy (AC) as our performance measurements.
NMI is used to measure the mutual dependence information between the clustering result and the ground truth. Given N objects, K clusters, one clustering result, and the ground truth classes for the objects, let n(i, j), i, j = 1, 2, Á Á Á, K be the number of objects that labelled i in clustering result while in the jth class of ground truth. The joint distribution can be defined as pði; jÞ ¼ nði;jÞ N , the marginal distribution of rows can be calculated as p 1 ðjÞ ¼ P K i¼1 pði; jÞ, and the marginal distribution of column can be calculated as p 2 ðiÞ ¼ P K j¼1 pði; jÞ. Then, the NMI is defined as: The NMI ranges from 0 to 1, the larger value of NMI, the better the clustering result is. AC is used to compute the clustering accuracy that measures the percent of the correct clustering result. AC is defined as: where mapðu t n Þ is the cluster label of the object u t n ; the labelðu t n Þ is the ground truth class of the object u t n . The δ(Á) is an indicator function: Since both of NMI and AC are used to measure the performance of clustering one type of object, the weighted average NMI and AC is also used to measure the performance of STFClus and other state-of-the-art methods:

Experimental results
STFClus on synthetic datasets. Experiment A: In order to evaluate the performance quantitatively with different densities D and network scales S, STFClus is tested on the datasets in Table 1 Table 1. It should be noted that, since the running time of STFClus_initial algorithm on Syn_a4 with D = 5% and D = 10% is unacceptable, we use the random initialization method to initialize factor matrices on Syn_a4 with D = 5% and D = 10%. We also find that the STFClus doesn't converge sporadically starting with the random initialization. In fact, non-convergence occurs two or three times out of ten. In addition, the iteration number and running time of STFClus are increased with increased network scale and density. Fig 5 shows the AC and NMI of STFClus on the synthetic datasets in Table 1. We can find that with increased density, both AC and NMI are increased and become close to 1. This means that with the increase in network density, useful relationships in the network become richer and richer, and the clustering results become more and more close to the real world. When D = 0.5%, both AC and NMI on four synthetic datasets are low, since too few useful relationships exist in the network. Generally, larger network scales and density result in greater iteration numbers and running times, but offer higher accuracy and quality of clustering results.
To conclude, the use of synthetic networks in experiment A demonstrates that STFClus can work well on large-scale and sparse heterogeneous information networks.
Experiment B: In order to evaluate the performance quantitatively with different object types T and various overlapping O states, we apply the STFClus method for the datasets in Table 2. In fact, there are 8 synthetic datasets grouped into 4 differently scaled networks, since each synthetic dataset has both overlapping and non-overlapping clusters. The experimental results are shown in Figs 6 and 7. Fig 6 shows the iteration number and running time of STFClus on the synthetic datasets in Table 2. It can be found that with an increasing number of object types at the same network scale, the iteration number and running time are increased. When the clusters are nonoverlapping, usually the iteration number is less than that when the clusters are overlapping. In Fig 6, both the iteration number and running time increase abruptly when T = 8. There are two possible reasons for this. First, the more object types T in the network, the more  dimensions possessed by the tensor. This means that the number of factor matrices and the scale of core tensor would become larger when the object types T in the network is increased. The second reason is that when the network scale and density are fixed, the number of objects in each type, i.e., N t , is decreased while the object types T is increased. This phenomenon can be found in Table 2. When T = 8, N t becomes less than ten. The network scale and density being fixed means that the non-zero elements in the tensor are unchanged. In other words, the number of relationships in the network remains unchanged while the network scale and density are fixed. With an increase in object types T, each relationship becomes more complex, i.e., each relationship contains more objects, and the frequency of each object appearing in the relationships is increased. Fig 7 presents the AC and NMI results of the STFClus on synthetic datasets in Table 2. Both the AC and NMI are increased and equal to 1 with increasing number of object types. This means that when the network scale and density are fixed, the accuracy of the clustering results improves with increasing number of object types T in the network. We can also see that the clustering results of non-overlapping clusters are better when T = 2 and T = 4. However, the advantage disappears when T = 6 and T = 8. That is to say, when the number of object types T is small, the clustering results of STFClus on non-overlapping clusters are improved. However, when the number of object types T becomes sufficiently large, the clustering results of STFClus on both overlapping and non-overlapping networks are satisfactory. Because there are more object types in the network, more useful information about each object is shown through relationships.
Overall experiment B shows that STFClus can work better on networks with more object types. When the number of object types is sufficiently large, STFClus can handle networks with overlapping or non-overlapping clusters equally well.
STFClus on DBLP-four-areas dataset. In this section, the clustering performance of STFClus on the DBLP-four-areas dataset is compared with a number of state-of-the-art clustering methods as follows: 1. NetClus [3]: This is an extended version of RankClus [1], which can deal with networks following the star network schema.
2. PathSelClus [6,8]: This is a clustering method based on the pre-defined symmetric metapath, requiring user guidance. In PathSelClus, the distance between the same type object is measured by PathSim [5], and the method starts with seeds as given by the user. 3. FctClus [4]: This is a recently proposed clustering method for heterogeneous information networks. As with NetClus, the FctClus method can deal with networks following the star network schema.
As the baseline methods can only deal with heterogeneous information networks of a specific schema, here we must construct different sub-networks for them. For NetClus and FctClus, we use all four modes, but they are organized as a star network schema [3,4], where the paper (P) is the centre type, and author (A), conference (C) and term (T) are the attribute types. For PathSelClus, we also use the four modes: author (A), paper (P), conference (C) and term (T). However, we select the symmetric meta-path of P-T-P, A-P-C-P-A and C-P-T-P-C to cluster the papers, authors and conferences respectively, and in PathSelClus, we give each cluster one seed to start.
Since the STFClus doesn't need any information of network schema, we model the DBLPfour-areas dataset as a 4-mode tensor, and each mode represents one object type. The 4 modes are author (A), paper (P), conference (C) and term (T), respectively. The actual sequence of the object types is insignificant. Each element of the tensor represents a relationship among the four types of objects and we use the sparse representation of tensor. The AC, NMI and running time on DBLP-four-areas dataset of STFClus and the three baseline methods are summarized in Tables 4-6. From the experimental results on DBLP-four-areas dataset, we can see that STFClus performs the best on AC and NMI, while PathSelClus gives the best running time.
Though STFClus gives the longest running time in experiment, STFClus can obtain the clusters of all types of objects simultaneously, while the other baselines can only cluster one type of objects each time. This is why only the total time is shown for STFClus in Table 6. NetClus performs worse in the AC and NMI, just achieving 71.86% on AC and 55.03% on NMI. However, an important advantage of NetClus is that the objects ranking in each cluster can be obtained while clustering the objects. PathSelClus performs better than NetClus on AC and NMI. And it has an advantage too, i.e., based on the PathSim [5], PathSelClus can rapidly measure the similarity between any two objects of the same type using the predefined symmetric meta-path. PathSelClus also delivers the best result for running time. However, the results of PathSelClus strongly depend on the choice of meta-path and seeds as given by users.
Case studies on DBLP-full-areas dataset and Douban Movie Network. Since there is no ground truth for cluster labels of the DBLP-full-areas dataset and the Douban Movie Network, we cannot adopt AC and NMI to measure the performance of STFClus. In Table 3, we can see that the tensors constructed from both the DBLP-full-areas dataset and the Douban Movie Network are large-scale but low density. If we don't use the sparse representation, the scale of the entire tensors may reach exabyte, or even zetabyte levels. Such large scale tensors are currently unrealistic for memory access and retrieval. Further, the storage of the entire tensor is unacceptable to most PCs. Although the storage space can be reduced to gigabyte (even megabyte) levels by using sparse representation and the computation of Kronecker products is avoided in STFClus, the intermediate results during the tensor decomposition may be much larger than the final result, and thus can lead to memory overflows.
To resolve such issues arising with larger-scale operations, we adopt the method introduced in [39] to divide the tensors constructed from both the DBLP-full-areas dataset and the Douban Movie Network into a grid of multiple smaller-scale sub-tensors and thereafter the STFClus is applied to all sub-tensors, and the results are re-constructed for the original tensors. In the experiment, Matlab Distributed Computing Server toolbox and Parallel Computing toolbox are used. All the experiments are run on a parallel system with 8 labs.
For DBLP-full-areas dataset, we divided the tensor with size 952,214 × 1,237,709 × 1,534 × 192,995 into a 1,000 × 1,000 × 100 × 1,000 dimensional grid that consists of 10 1 1 sub-tensors. We find that more than 99.98% of sub-tensors are zero tensors, i.e., all elements in these subtensors are zero elements. In practice, we maintain only the sparse sub-tensors, whose elements are not all zero elements, and their corresponding indices in the grid. Then, STFClus runs on all the sparse sub-tensors, whose elements are not all zero elements, simultaneously. For the sub-tensors whose elements are all zero elements, we set the elements in corresponding factor matrices and core tensors equal to zero. Finally, the strategy of re-constructing factor matrices and core tensor for original tensors in [39] is used. The same method is used to deal with the Douban Movie Network. We divided the tensor with size 12,677 × 6,311 × 2,449 × 38 × 13,367 × 2,753 × 13,367 into a 100 × 10 × 10 × 1 × 100 × 10 × 100 dimensional grid consisting of 10 9 sub-tensors. More than 97.54% sub-tensors are zero tensors.
We set the number of clusters K = 15 for DBLP-full-areas dataset and K = 20 for Douban Movie Network. The details of implementation and results are summarized in Table 7. In Table 7, the non-zero sub-tensor represents the elements of sub-tensor that are not all zero elements.
From Table 7, we can see that the number of non-zero sub-tensors is very large, although most sub-tensors are zero tensors in such a large-scale dataset. Moreover, all the non-zero subtensors are very sparse. The total running time includes three constituents: grid generation, parallel computing of STFClus, and factor matrices and core tensor reconstruction. For the DBLP-full-areas dataset and the Douban Movie Network, the grid generation and factor matrices and core tensor reconstruction took up most of the running time, while the parallel computing of STFClus just cost 32.1% of the time on the DBLP-full-areas dataset and 21.68% of the time on the Douban Movie Network. The system in total spent about 2.5 days to handle the DBLP-full-areas dataset and almost 3 days to handle the Douban Movie Network.

Discussion
The experimental results on both synthetic and real-world datasets show that STFClus is an effective and efficient method for clustering heterogeneous information networks. It can handle all types of objects simultaneously, and obtain a good clustering result without any information on the network schema. In the experiments, we found that the random initialization of STFClus may lead to the non-convergence. That is to say, a good initialization can improve the performance of STFClus, and the STFClus_initial algorithm can provide a good start for STFClus.
Unfortunately, the current STFClus_initial algorithm is not perfect. It is highly efficient for sparse networks but not for dense networks. In other words, when the scale and density of the heterogeneous information network becomes large, the time cost of the STFClus_initial algorithm increases rapidly. In general, the network scale is usually large in real world applications, so STFClus_initial algorithm performs better with amaller the network density. We must thus make a compromise between the time complexity and efficiency of the whole method and this is a trade-off to be optimized by users on case-specific basis.
However, case studies on two very large-scale datasets show that STFClus can be used to analyze very large heterogeneous information networks off-line. The running time is acceptable, and STFClus has demonstrated high accuracy clustering results which can be used as prior knowledge for on-line analysis.

Conclusions
Many clustering methods for heterogeneous information networks have been proposed, such as FctClus [4], NetClus [3], PathSelClus [6,8] and so on. Each of them can deal with one type of heterogeneous information networks with a specified network schema. However, for general network schemas or in cases without any information of network schema, these clustering methods are not useable. Tensor factorization is a powerful tool for clustering multi-dimensional data. It has been widely used in some specific applications, such as computer graphics [16] and vision [40]. However, many existing tensor factorization based clustering methods focus on 3-mode tensors and clustering one mode of the tensor. In this paper, the STFClus method is presented as a way to cluster heterogeneous information networks based on sparse tensor factorization. The STFClus models heterogeneous information networks as a sparse tensor. In this approach, each type of objects in the network was modeled as one dimension of the tensor, and the relationships among different types of objects were modeled as the elements in the tensor.
In contrast to the existing clustering methods for heterogeneous information networks, STFClus has two distinct advantages. Firstly, STFClus can model different types of objects and the semantic relations in heterogeneous information networks without any information Tensor factorization, single-pass clustering, heterogeneous information networks and network schema agnostic regarding the network schema. In addition, based on the tensor factorization, STFClus can cluster all types of objects simultaneously by running the algorithm only once; i.e. STFClus is generally applicable single-pass clustering method for heterogeneous network which is network schema agnostic. Furthermore, an initialization algorithm is specifically developed for STFClus. In general, the initialization algorithm is good at handling sparse networks. The experimental results showed that STFClus can deal with large-scale and sparse heterogeneous information networks and perform better on networks with more types of objects. Moreover, STFClus can handle overlapping and non-overlapping clusters simultaneously. STFClus outperforms the state-of-the-art baselines on real-world datasets.
Nevertheless, STFClus is sensitive to the initialization of factor matrices and core tensor. A good initialization can improve the performance of STFClus, while a sub-optimal initialization may lead to an unacceptable slow convergence speed and unsatisfactory local minima. Although the STFClus_initial algorithm can provide a good initialization, the time cost increases rapidly for large scale and very dense networks.
In future work, this novel approach of clustering heterogeneous information networks based on tensor factorization can be combined with other rank-based clustering methods, e.g., RankClus and NetClus. Another challenge in future work is to deal with dynamically changing tensors as the heterogeneous information networks are changing. Possible solutions include increasing the number of tensor modes, or the number of dimensions of the tensor.
Supporting information S1 File. The matlab codes of synthetic datasets generation algorithm. It uses Zipf's law to generate the synthetic datasets for Experiment A and Experiment B.