FctClus: A Fast Clustering Algorithm for Heterogeneous Information Networks

It is important to cluster heterogeneous information networks. A fast clustering algorithm based on an approximate commute time embedding for heterogeneous information networks with a star network schema is proposed in this paper by utilizing the sparsity of heterogeneous information networks. First, a heterogeneous information network is transformed into multiple compatible bipartite graphs from the compatible point of view. Second, the approximate commute time embedding of each bipartite graph is computed using random mapping and a linear time solver. All of the indicator subsets in each embedding simultaneously determine the target dataset. Finally, a general model is formulated by these indicator subsets, and a fast algorithm is derived by simultaneously clustering all of the indicator subsets using the sum of the weighted distances for all indicators for an identical target object. The proposed fast algorithm, FctClus, is shown to be efficient and generalizable and exhibits high clustering accuracy and fast computation speed based on a theoretic analysis and experimental verification.


Introduction
Information networks are ubiquitous and include social information networks and DBLP bibliographic networks. Numerous studies on homogeneous information networks, which consist of a single type of data object, have been performed; however, little research has been performed on the clustering of heterogeneous information networks, which consist of multiple types of data objects. Clustering on a heterogeneous network may lead to better understanding the hidden structures and deeper meanings of the networks [1].
The star network schema is popular and important in the field of heterogeneous information networks. The star network schema includes one data object target type and multiple data object attribute types, whereby each relation is the target data objects and all attribute data objects linking to it.
Algorithms based on compatible bipartite graphs can effectively consider multiple types of relational data. Various classical clustering algorithms, such as algorithms based on semi-definite programming [2,3], algorithms based on information theory [4] and spectral clustering algorithms for multi-type relational data [5], have been proposed for heterogeneous data from the compatible point of view. These algorithms are generalizable, but the computational complexity of these algorithms is too great for use in clustering heterogeneous information networks. Sun et al. presents an algorithm, NetClus [6], and a PathSim-based clustering algorithm [7] for clustering heterogeneous information networks. NetClus is effective for DBLP bibliographic networks, but the algorithm is not a general model for clustering other heterogeneous information networks; NetClus is not sufficiently stable. The concept behind NetClus is also used for clustering service webs [8,9]. The PathSim-based clustering algorithm requires a user guide, and the clustering quality reflects the requirements of users rather than the requirements of the network. ComClus [10] is a derivation algorithm of NetClus for use with hybrid networks that simultaneously include heterogeneous and homogeneous relations. NetClus and ComClus are not general and depend on the given application.
Dynamic link inference in heterogeneous networks [11] requires more accurate initial clustering. A high clustering quality is necessary for network analysis, but low computation speed is intolerable because of the large network scales involved. The accuracy of the LDCC algorithm [12] is improved, while both the heterogeneous and homogeneous data relations are explored. The CESC algorithm [13] is very effective for clustering homogeneous data using an approximate commute time embedding. A heterogeneous information network with a star network schema can transform into multiple compatible bipartite graphs from the compatible point of view. When the relation between any two nodes of the bipartite graph is presented with the commute time, the relation of both heterogeneous and homogenous data objects can be explored; the clustering accuracy can also be improved. The heterogeneous information networks are large but very sparse; therefore, the approximate commute time embedding of each bipartite graph can be quickly computed using random mapping and a linear time solver [14]. All of the indicator subsets in each embedding indicate the target dataset, and subsequently, a general model for clustering heterogeneous information networks is formulated based on all indicator subsets. All weighted distances between the indicators and the cluster centers in the respective indicator subsets are computed. All indicator subsets can be simultaneously clustered according to the sum of the weighted distances for all indicators for an identical target object. Based on the above discussion, an effective clustering algorithm, FctClus, which is based on the approximate commute time embedding for heterogeneous information networks, is proposed in this paper. The computation speed and clustering accuracy of FctClus are high.

Commute Time Embedding of the Bipartite Graph
Given two types of datasets, X 0 ¼ fx ð0Þ 1 ; x ð0Þ 2 ; Á Á Á ; x ð0Þ n 0 g and X 1 ¼ fx ð1Þ 1 ; x ð1Þ 2 ; Á Á Á ; x ð1Þ n 1 g, the graph G b = hV, Ei is called a bipartite graph if V(G b ) = X 0 [ X 1 and EðG b Þ ¼ fhx ð0Þ i ; x ð1Þ j g, where 1 i n 0 , 1 j n 1 . W n 0 Ân 1 is the relation matrix between X 0 and X 1 , where the element w ij is the edge weight between x ð0Þ i and x ð1Þ j . Then, the adjacency matrix of the bipartite graph G b can be denoted asW thus the Laplacian matrix of the bipartite graph G b is L ¼ D ÀW . L can be eigen-decomposed into L = FΛF T , where Λ = diag(λ 1 , λ 2 ,Á Á Á, λ n ) is a diagonal matrix composed of the eigenvalues of L and λ 1 λ 2 Á Á Á λ n , F = (ϕ 1 , ϕ 2 , Á Á Á, ϕ n ) is an eigenmatrix and ϕ i is an eigenvector corresponding to the eigenvalue λ i . Let L + be a pseudo-inverse matrix of L and L þ ¼ The bipartite graph is also an undirected weighted graph. According to the literature [15], the commute time c ij between nodes i and j of G b can be computed by the pseudo-inverse matrix L + .
where l þ ij is the (i, j) element of L + , g v = ∑ w ij , e i is a unit column vector in which the i-th element is 1; that is, e i ¼ ½0 According to the literature [15,16], the commute time c ij between nodes i and j of G b is Thus, the commute time c ij is the square pairwise Euclidean distance between the row vectors in the space ð ffiffiffiffi is called the commute time embedding of the bipartite graph G b . c ij is the average path length between two nodes rather than the shortest path between two nodes. Using the commute time for clustering the noisy data increases robustness and captures the complex clusters. Therefore clustering in the commute time embedding can also effectively capture the complex clusters.
ffiffiffiffi g v p L À1=2 F T is used in this paper. If a normal Laplacian matrix L n = D −1/2 LD −1/2 is used, the commute time embedding is ffiffiffiffi g v p L À1=2 F T D À1=2 [13].

Approximate Commute Time Embedding of the Bipartite Graph
If directly computing ffiffiffiffi g v p L À1=2 F T or ffiffiffiffi g v p L À1=2 F T D À1=2 , the process requires O(n 3 ) time for the eigen-decomposition of the Laplacian matrix L or L n . n = n 0 +n 1 is the number of nodes and s is the number of edges in the bipartite graph G b . According to the literature [17], if the edges in G b are oriented and where i and j are nodes of G b , then B s×n is a directed edge-node incidence matrix. UsingŴ sÂs as a diagonal matrix whose entries are the edge weights, thus L ¼ B TŴ B. Furthermore, thus, ψ is the commute time embedding of the bipartite graph G b , where the square root of the commute time is the Euclidean distance between i and j in ψ because According to the literature [18], given vectors v 1 ,Á Á Á, v n 2 R s and ε > 0, Q k r Âs is a random matrix of row vectors, where Qði; jÞ ¼ AE1= ffiffiffiffi k r p is equivalent when k r = O(log n / ε 2 ). With probability 1−1 / n, at least for all pairs. Therefore, given the bipartite graph G b with n nodes and s edges, ε > 0, and a matrix Y k r Ân ¼ ffiffiffiffi g v p QŴ 1=2 BL þ with probability of at least 1−1 / n: for any nodes i, The proof of Eq (4) comes directly from Eq (2) and Eq (3). c ij %||Y(e i − e j )|| 2 with an error ε based on Eq (4). If directly computing Y k r Ân ¼ ffiffiffiffi g v p QŴ 1=2 BL þ , L + must first be computed, but the computational complexity of directly computing L + is excessive. However, using the method in the literature [19,20] to compute Y k r Ân , the complexity is decreased. Let BÞis computed, and then, YL = θ. Each row of Y, y i , is computed by solving the system y i L = θ i , where θ i is the i-th row of θ. The linear time solver of Spielman and Teng [19,20] requires onlyÕðsÞ time to solve the system. Because ky i Àŷ i k L εky i k L [17], whereŷ i is the solution, y i L = θ i using the linear time solver. Then, [17] ð1 À εÞ 2 c ij jjŶ ðe i À e j Þjj 2 ð1 þ εÞ 2 c ij Therefore, c ij % jjŶ ðe i À e j Þjj 2 with an error bound of ε 2 . The component of the algorithm for the approximate commute time embedding of the bipartite graph is illustrated as follows. Algorithm1 ApCte (Approximate Commute Time Embedding of the Bipartite Graph) 1. input the relation matrix W n 0 Ân 1 ; 2. compute the matrices B,Ŵ and L using W n 0 Ân 1 ; 4. compute eachŷ i using the system y i L = θ i by calling to the Spielman-Teng solver k r times [14], 1 i k r ; 5. output the approximate commute time embeddingŶ .
All data objects of X 0 and X 1 are mapped into a common subspaceŶ , where the first n 0 column vectors ofŶ indicate X 0 and the last n 1 column vectors ofŶ indicate X 1 . The dataset is composed of the n = n 0 +n 1 column vectors ofŶ is called an indicator dataset. The input matrix W n 0 Ân 1 is a sparse matrix with s nonzero elements. Therefore, the complexity of computing the matrices B,Ŵ and L in step 2 is O(2s) + O(s) + O(n). The sparse matrix B has 2s nonzero elements, and the diagonal matrixŴ has s nonzero elements. Computing y ¼ ffiffiffiffi g v p QðŴ 1=2 BÞ takes O(2sk r + s) time in step 3. Because the linear time solver of Spielman and Teng [19,20] requires onlyÕðsÞ time to solve for each y i of system y i L = θ i , constructingŶ takesÕðsk r Þ time in step 4. Therefore, the complexity of algorithm1, ApCte, is only O(2s) + O(s) +O(n) + O (2sk r + s) +Õðsk r Þ =Õð4s þ n þ 3sk r Þ. In practice, k r = O(log n / ε 2 ) is small and does not vary between different datasets. The indicator dataset includes low-dimensional homogeneous data; therefore, traditional algorithms can be used for the indicator dataset.

A General Model Formulation
is a binary relation on V and W: E ! R + . Such an information network is called a heterogeneous information network when T ! 1 and a homogeneous information network when T = 0 [6].
To derive a general model for clustering the target dataset, a heterogeneous information network with a star network schema using the dataset w ¼ fX t g T t¼0 with T+1 types is given, where X 0 is the target dataset and fX t g T t¼1 are the attribute datasets. X t ¼ fx ðtÞ 1 ; x ðtÞ 2 ; Á Á Á ; x ðtÞ n t g, where n t is the object number of X t . W ð0tÞ 2 R n 0 Ân t denotes the relation matrix between the target dataset X 0 and the attribute dataset X t , where the element w ð0tÞ ij denotes the relation between x ð0Þ i of X 0 and x ðtÞ j of X t . If an edge between x ð0Þ i and x ðtÞ j exists, its edge weight is w ð0tÞ ij . If no edge exists, w ð0tÞ ij = 0. T relation matrices fW ð0tÞ g T t¼1 exist in the heterogeneous information network with a star network schema.
The target dataset X 0 and the attribute dataset X t constitute a bipartite graph, G (0t) , which corresponds to the relation matrix W (0t) . The indicator dataset Y ð0tÞ ¼ fy ð0tÞ 1 ; y ð0tÞ 2 ; Á Á Á ; y ð0tÞ n 0 þn t g which also is the approximate commute time embedding of G (0t) can be quickly computed by ApCte, where the first n 0 data of Y (0t) indicate X 0 and the last n t data of Y (0t) indicate the attribute dataset X t . Y ð0Þ t consists of the first n 0 data of Y (0t) , and Y (t) consists of the last n t data of Y (0t) . Y ð0Þ t and Y (t) are called the indicator subsets. y ðtÞ i 2 Y ð0Þ t indicates the i-th object of X 0 and is called an indicator for 1 i n 0 . There exists a one-to-one correspondence between the indicators of Y ð0Þ t and the objects of X 0 . Because T bipartite graphs correspond to T indicator datasets, the target dataset X 0 is simultaneously indicated by the T indicator subsets fY ð0Þ t g T t¼1 , and each object of X 0 is simultaneously indicated by T indicators.
, which indicate the identical object of X 0 , belong to T clusters. The T clusters are in T different indicator subsets and are denoted using the same label. Let where o ðtÞ j is the j-th cluster center of the indicator subset Y ð0Þ t . There exists a one-to-one correspondence between the indicator function g ¼ fg ij g n 0 i¼1 and the objects of X 0 . If all indicators, fy ðtÞ i g T t¼1 , that indicates the i-th object of X 0 belong to the j-th cluster, γ ij = 1; otherwise, γ ij = 0. If the objective function F in Eq (5) is minimized, the clusters of X 0 are optimal from the compatible point of view because each indicator subset reflects the relation between the target dataset and the attribute dataset. Obviously, determining the global minimum of Eq (5) is NP hard.

Derivation of Fast Algorithm for Clustering Heterogeneous Information Networks
The following steps allow for the local minimum of F in Eq (5) to be quickly achieved by simultaneously clustering all of the indicator subsets.

Setting the Cluster Label
When given the cluster label of each indicator subset, the modeling process can be simplified. Suppose that the labels of the K clusters of each Y ð0Þ t are set. Let q 1 , q 2 2 X 0 , y ð1Þ 1 ; y ð1Þ Each cluster of Y ð0Þ t has an initial center. K random objects are selected from the target dataset X 0 . The indicators indicating the K objects are taken as the initial cluster centers for each Y ð0Þ t and for the clusters whose center indicates an identical target object with the same label. Then, all of the other indicators for an identical target object only belong to the j-th cluster in each Y ð0Þ t or no indicators belong to the j-th cluster, where 1 j K. Therefore, the K clusters of fY ð0Þ t g T t¼1 are set labels.

The sum of the Weighted Distances
An object of X 0 is indicated by T indicators. All of the T distances between the indicator and the center in each Y ð0Þ t affect the object allocation. The target object allocation is determined by the sum of the weighted distances for the T indicators. Setting q i 2 X 0 , y ð1Þ where j is the cluster label.
The Local Minimum of F F in Eq (5) can also be expressed as Obviously, Eq (7) is another representation of Eq (5). Given the initial centers fo ðtÞ j g K j¼1 and the cluster labels in the T indicator subsets fY ð0Þ t g T t¼1 , fY ð0Þ t g T t¼1 is first partitioned by computing Eq (6) and setting F = F 0 in Eq (7). The cluster centers of fY ð0Þ t g T t¼2 remain the same, and γ ij is unchanged. The new center fô ð1Þ j g K j¼1 of each cluster in Y ð0Þ 1 is computed. The new center is the mean of all data of each cluster. The new centers fô ð1Þ j g K j¼1 of Y ð0Þ 1 replace the old centers, and subsequently, Eq (7) is used to set F = F 1 . Then, proving Because only the new centers fô ð1Þ j g K j¼1 of Y ð0Þ 1 replace the old centers, γ ij remains unchanged. Therefore remain unchanged. Re-clustering fY ð0Þ t g T t¼1 using Eq (6), where the corresponding value is F = F 2 in Eq (7), gives F 2 F 1 .
Partitioning fY ð0Þ t g T t¼1 using Eq (6) computes the new cluster centers fô ð1Þ j g K j¼1 of Y ð0Þ 1 ; the new centers replace the old centers fo ð1Þ j g K j¼1 . Then, the same procedure is repeated for each fY ð0Þ t g T t¼2 . The value of F decreases in this case. The above procedures are repeated until F in Eq (7) converges; then, the local minimum of F in Eq (7) is obtained. The algorithm based on the approximate commute time embedding for heterogeneous information networks is shown below. ð4s t þ n t þ 3s t k r ÞÞ in algorithm 2, where T is the number of relational matrices in the heterogeneous information network and k r is the data dimension of Y ð0Þ t . n t and s t are the node number and edge number of the t-th bipartite graph, respectively. Step 6 requires only O(K) time; the time is constant. The object number of X 0 is equal to the indicator number of each indicator subset, thus the computational complexity of steps 7~13 is O(uTKk r n 0 ), where K is the number of clusters of each Y ð0Þ t ; n 0 is the data number of each Y ð0Þ t ; and u is the iteration number for F in Eq (7) convergence. Therefore, the computational complexity of algorithm 2, FctClus, isÕð ð4s t þ n t þ 3s t k r ÞÞ + O (uTKk r n 0 ), where k r and u are small and T and K are constant.

Experiments
The Experimental Dataset The experimental datasets are composed of real data selected from the DBLP data. The DBLP is a typical heterogeneous information network in computer science domain and contains 4 types of objects, including papers, authors, terms and venues. Two different-scaled heterogeous datasets called S small and S large respectively are used in experiments. S small is the small test dataset and is called the "four-area dataset", as in the literature [6]. S small extracted from the DBLP dataset downloaded in 2011 contains four areas related to data mining: databases, data mining, information retrieval and machine learning. Five representative conferences for each area are chosen, and all papers and terms that appear in the titles are included. S small is showed in S1 File.
S large is the large test dataset and extracted from the Chinese DBLP dataset, which are sharing resources released by Institute of automation, Chinese Academy of Sciences. S large includes 34 computer science journals, 16, 567 papers, 47, 701 authors and 52,262 terms(keywords). S large is showed in S2 File.
When analyzing the papers, this object is the target dataset, and the other objects are the attribute datasets. There is no direct link between papers because the DBLP provided very limited citation information. When analyzing the authors, this object is the target dataset, while papers and venues are the attribute datasets. However, there is a direct link between authors because of the co-author relation between various authors; therefore, authors are another attribute dataset related to the target dataset.
The experiments are performed in the MATLAB 7.0 programming environment. The matlab source codes for our algorithm are showed in S3 File and are available online at https:// github.com/lsy917/chenlimin, which include a main program and three function programs. FctClus.m is the main program which output the clusters of the object dataset, and ApCte.m, Prematrix.m and Net_Branches.m are function programs. The Koutis CMG solver [14] is used in all experiments as the nearly linear time solver to create the embedding. The solver uses symmetric, diagonally dominant matrices that are available online at http://www.cs.cmu.edu/j koutis/cmg.html.

The Relational Matrix
Papers are the target dataset, while authors, venues and terms are the attribute datasets. X 0 denotes papers, and X 1 , X 2 and X 3 denote authors, venues and terms, respectively. W (0t) is the relation matrix between X 0 and X t , 1 t 3. The element of fW ð0tÞ g 3 t¼1 is p if i 2 X 0 ; j 2 X 3 ; node i appears p times in node j; When authors are the target dataset, papers and venues are the attribute datasets. Authors are also an attribute dataset because of the co-author relation existing between authors. X 0 denotes authors when X 1 and X 2 denote papers and venues, respectively. W (0t) is the relation matrix between X 0 and X t , 0 t 2. The element of fW ð0tÞ g 2 t¼0 is p if i 2 X 0 ; j 2 X 0 ; node i and j co À author p papers; All the algorithms use the same relation matrix for all experiments.

Parameter Analysis
Analysis of Parameter k r . The equation [13] Accuracy ¼ X n i¼1 d½mapðc i Þ ¼ labelðiÞ n is used to compute the clustering accuracy in the experiments, where n is the object number of dataset, label(i) is the cluster label, and c i is the predicted label of an object i. δ(Á) is an indicator function: dðÁÞ ¼ 1 mapðiÞ ¼ labelðiÞ 0 mapðiÞ 6 ¼ labelðiÞ : ( k r is small in practice, and minimal differences exist among the various datasets [13]. The literature [13] has proved that the accuracy curve is flat for clustering different homogeneous datasets when k r !50.
Using the small dataset S small , the clustering accuracy as a function of k r in a heterogeneous information network is studied.
An experiment with different k r is conducted in the small dataset, S small . In the FctClus algorithm, the weight of fW ð0tÞ g

Comparison of Clustering Accuracy and Computation Speed
The complexity of the algorithms is too high for large-scale networks based on semi-definite programming [2,3] and spectral clustering algorithms for multi-type relational data [5]. The low-complexity algorithms CIT [4], NetClus [6] and ComClus [10] are selected for comparison with the FctClus algorithm in terms of clustering accuracy and computation speed; the datasets S small and S large are also chosen for this experiment.
The initial cluster centers of FctClus or the initial cluster partitions of the other three algorithms are randomly selected 3 times. The best clustering accuracy of the 3 measurements is used as the clustering accuracy of the four algorithms, and the computation speed at this time is considered as the measured computation speed. The parameters in literature [6] are used as the parameters in NetClus, and the parameters in literature [10] are used as the parameters in ComClus in this experiment. The comparison results are shown in Table 1 and Table 2.
The clustering accuracy of FctClus is the highest of all four algorithms. The clustering accuracy of CIT is lower than that of FctClus because the bipartite graphs of the heterogeneous  information networks are sparse. The computational complexity of CIT is O(n 2 ), and the convergence speed of CIT is low when the heterogeneous information network is sparse. The clustering accuracy of NetClus is low because only heterogeneous relations are used. Homogeneous and heterogeneous relations are both used in ComClus; therefore, the accuracy of ComClus is higher than that of NetClus. FctClus is an algorithm based on commute time embedding. The data relations are explored using commute time and the direct relations of the target dataset are considered. FctClus is not affected by the sparsity of networks; thus, FctClus is highly accurate.
The computation speed of FctClus is nearly as fast as NetClus. The experiment demonstrates that FctClus is effective. FctClus is more universal and can be adapted for clustering any heterogeneous information network with a star network schema. However, NetClus and Com-Clus can only be adapted for clustering bibliographic networks because NetClus and ComClus depend on a ranking function of a specific application field.

Comparison of Clustering Stability
To compare the stability of the FctClus, NetClus and CIT algorithms, the small dataset S small is used for clustering papers in this experiment. ComClus is a derivation algorithm of NetClus; it has the same properties as NetClus. ComClus is not considered in this study.
The initial cluster centers of FctClus and the initial cluster partitions of NetClus and CIT are randomly recorded 10 times, and the three algorithms are executed 10 times respectively. The clustering accuracy of the three algorithms for 10 times is shown in Fig 5. Although the computation speeds of FctClus and NetClus are both high, Fig 5 shows that the stability of  FctClus is higher than that of NetClus and that the initial centers do not greatly impact the clustering result of FctClus. However, NetClus is very unstable, and the initial clusters greatly impact the clustering accuracy and convergence speed of NetClus. CIT is more stable than Net-Clus, but the clustering accuracy is low.

Running Time Analysis of the FctClus Algorithm
The running time distributions of FctClus on the two datasets are shown in Table 3. The experimental data show that FctClus is effective. The running time for serial computing the three embedding is less than 50% of the total running time. When utilizing parallel computing for the three embedding, the computation speed is higher. When clustering indicator subsets in parallel, the computation speed may also be increased.