Link prediction based on non-negative matrix factorization

With the rapid expansion of internet, the complex networks has become high-dimensional, sparse and redundant. Besides, the problem of link prediction in such networks has also obatined increasingly attention from different types of domains like information science, anthropology, sociology and computer sciences. It makes requirements for effective link prediction techniques to extract the most essential and relevant information for online users in internet. Therefore, this paper attempts to put forward a link prediction algorithm based on non-negative matrix factorization. In the algorithm, we reconstruct the correlation between different types of matrix through the projection of high-dimensional vector space to a low-dimensional one, and then use the similarity between the column vectors of the weight matrix as the scoring matrix. The experiment results demonstrate that the algorithm not only reduces data storage space but also effectively makes the improvements of the prediction performance during the process of sustaining a low time complexity.


Introduction
With the unprecedentedly rapid development of technology, the world has become increasingly complicated with frequent networking. In the real world, a number of information, biological, and social systems, ranging from interpersonal relationships to the colony structure, from transportation to the online world, from the ecosystem to the nervous system, can be considered as a network, in which vertices stand for the interactions between vertices or links and entities denote relations. Undoubtedly, there will be some potential links which are undetected and meanwhile there will be redundant links or some errors during the process of complicated network due to the limitations of space or time as well as the experimental conditions. Besides, based on the known network information, we are required to forecast the potential and missing links. This is the objectivity of the forecast challenge linked with the network, due to the dynamic development of the links of complicated network [1,2].
In fact, the link prediction method can be utilized as an auxiliary means to investigate the structure of social network. Actually, it is essential to forecast the potential link of people or the future. There is an extensive scope of practical application values in varieties of areas in the PLOS  link prediction problem. For example, users' potential friends can be shown by the prediction of link, and even introduced to others in networks online [3]. Besides, the potential links between people can be found based on the analysis on the social relations [4][5][6]. Furthermore, the prediction of link could be utilized in the academic network to forecast the cooperators as well as the kind of an article [7]. The link between the nodes shows an interactive relation in the biological networks, such as disease-gene and metabolic networks as well as proteinprotein interaction networks [8]. In addition, the research on link prediction has significant theoretical importance and an extensive scope of practical value [9][10][11][12]. For instance, it can offer a unified and convenient platform which can compare the mechanisms of network development more fairly and help comprehend the mechanism of the development of complicated networking theoretically, in order to develop the theoretical study on the model of complicated network evolution.
In recent decades, one of the link prediction problems of increasing interest revolves around the expansion of network size, the scoring matrix sparsely and the noise of the data. Due to high-dimensionality of user-rating matrix, the time and space complexity of personalized link prediction are increasing, which will cause a great impact on the performance of the prediction system. To deal with this problem, numerous new methods have been reported to improve the efficiency of link prediction. For example, Liu and Lv [13]used the local random walk instead of the global random walk to solve the link prediction problem, which can achieve good result. In [14], G.Rossetti et al defined the concept the multidimensional link prediction problem and proposed the several predictors based on node correlation. H. H. Song et al. [15] considered that the new edges would be linked in the near future and proposed the incremental update algorithm to deal with the large-scale networks.
In addition, with the consideration of the importance of network organization principle, Pan [16] proposed a predefined structural Hamiltonian algorithm which can be used to calculate the probability of non-observed links. Liu et al [17] proposed a local naive Bayes model based on the fact that various neighbors might have various functions, thus resulting in various outcomes. Hu [18] put forward a new model that examines the performances of the algorithms of state-of-the-art recommendation in the constantly developing network. Besides, it was also found that with the passage of time, the accuracy of recommendation gradually reduces if the online network evolution completely depends on the recommendation. In order to optimize the weights applied in a linear combination of node similarity indices and sixteen neighborhood, a method is provided by Bliss to predict the links in the future with the application of Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [19]. Barzel B [20] used global silencing of indirect correlations to make the prediction of missing links. A global algorithm of optimization is presented by Zhu which can infer the potential space effectively. Two alternative algorithms of optimization with incremental and local updates are also put forward to make the model scale to the bigger network without compromising the accuracy of forecast [2]. A nonparametric method has been proposed to link prediction in large-scale dynamic networks by Sarkar [21]. They considered that the characteristics of pairs of nodes based on graph and those of the local neighborhoods are utilized through using the model to forecast whether the nodes can be linked at every step of time. Richard [22] proposed a vector autoregressive (VAR) model based on the node degree to enhance the forecast accuracy. Zhang et al [23]proposed a link prediction algorithm based on an integrated diffusion on user-item-tag tripartite graphs and found that the usage of tag information can significantly improve novelty of link prediction. Recently, some swarm intelligence methods are also involved in link prediction. For example, Sherkat [24] proposed a new unsupervised structural ant colony approach algorithm which has the best outcome in some networks. Despite that those methods are designed especially for the large scale networks, the accuracy of the results cannot be guaranteed due to the limit of computation time [25].
Nowadays, matrix decomposition technology has been widely applied, because it is relatively simple and able to obtain high prediction accuracy [26]. There are numerous wellknown matrix decomposition methods, such as: singular value decomposition (SVD) [27], principal component analysis (PCA) [28,29], independent component analysis (ICA) [30][31][32]. In SVD, it requires complementing scoring matrix in order to solve the sparsity of the network, but the operation would seriously increase the data storage space, therefore and it is not practical in the link prediction systems. Furthermore, due to the high time complexity of SVD, it is not applicable for large-scale network. With the aim to improve the performance of the SVD based method, SimonFunk [33] proposed LFM models. However, in the actual scoring system, the user ratings for goods do not have a uniform standard, which are of large arbitrariness due to the personalized habits on selecting goods. LFM model does not take the impact of the users' history buying into consideration. To overcome such shortcomings of LFM model, Koren [34] proposed a SVD++ model based on the LFM joined with the user's history scoring.
However, the existing matrix factorization models do not consider the situation that negative elements exist in the matrix. In real world applications, the a negative score by a user for an object does not necessarily mean that this object will never be selected by the user As time goes on, maybe this user will become interested in this object. Therefore, Lee and Seung [35,36] proposed a method based on non-negative matrix factorization (NMF). It is a collection of algorithms which depends on linear algebra and multivariate analysis where two matrices without any negative elements consist in the original matrix. It is because of this non-negativity that makes it easier to analysis the final matrices. Therefore, NMF technology has been widely used in data mining [37,38], image analysis [39][40][41], medical imaging [42] and voice processing [43], etc. However, in the link prediction, it has not attracted broad attentions. In this paper, we will apply NMF technology in link prediction and propose the NMF-LP model.

Problem formulation and evaluation methods
Using an undirected and simple network which has node attitudes G(U, V, E), in which E means the series of links, U refers to the series of nodes, and V is the series of node attitudes can represent a network. G does not allow various self-connections and links. Let M = |V| be the amount of node attitudes and N = |U| be the amount of nodes in G. U is considered to represent the universal set which all the possible links of N(N − 1)/2 are consisted in. Link prediction aims to identify the links which will occur in the future or the missing links in the series of nonexistent links U − E.
Assigning S(x, y) as a score which stands for the similarities between the two nodes to every pair of the nodes (x, y) 2 U is our method. In a pair of nodes, (x, y) in U\E, larger S(x, y) tends to lead to higher possibility in a link between the nodes.
To test the accuracy of the results by our algorithm, the observed links in E are randomly divided into two parts: the training set, E T , which is treated as known information, while the probe set (i.e., validation subset), E P , which is used for testing and no information in this set is used for prediction. ET [ EP = E and ET \ EP = ;. In principle, a link prediction algorithm provides an ordered list of all non-observed links (i.e., U − E T ) or equivalently gives each nonobserved link, say (x, y) 2 U − E T , a score S xy to quantify its existence likelihood. To evaluate the accuracy of prediction algorithms, there are two standard metrics: AUC and Precision.
(1) AUC The value of AUC is the area under the curve of ROC(receiver operating characteristic). we randomly choose a missing link and nonexistent link. After doing n times' independent evaluations, there should have n 00 times' missing link which shows a similar score and a higher score. Under this situation, the value of AUC is as below: Generally, a bigger value of AUC shows better performance; thus, the AUC result through a randomly selected predictor is 0.5 while the AUC value of the perfect result is 1.0.
(2) Precision Given the ranking of the non-observed links, the Precision is defined as the ratio of relevant objects selected to the total number of objects selected. That is to say, if we take the Top-L links as the predicted ones, among which m links are right, then the Precision can be expressed as: Clearly, higher precision means higher prediction accuracy.

Non-negative matrix factorization and link prediction
For the reader's reference, Table 1 summarizes frequently used notations.
where m is the number of attributes. The values of B can be defined within the range [0,1] by normalizing its row vectors. We assume the adjacency matrix A as a non-negative characteristic matrix where each column represents the characteristic vector of a vertex and the goal of NMF is to obtain two non-negative matrices U 2 P n Ã k and V 2 P k Ã n so that their product is very close to matrix A: Here, k is the dimension of the latent space (k < n). U consists of the bases of the latent space, and is called the base matrix. V represents the combination coefficients of the bases for reconstructing the matrix A, and is called the coefficient matrix. Generally, this decomposition  problem can be modeled as the following Frobenius norm optimization problem: Here, kÁk F is the Frobenius norm, constrain U ! 0 and V ! 0 requires that all the elements in matrixes U and V are nonnegative. The similarity between nodes i and j in the latent space can be represented by the similarity between the i th and j th row vectors in matrix V. Similarly, NMF form of the attribute matrix B is as follows: Therefore, we need to solve the optimization problem as follows: Here, U (B) ! 0 and V (B) ! 0 are respectively n Ã k and k Ã m non-negative matrixes.
In order to integrate the topology matrix A and the attribute matrix B, we map them into a latent space by a same projection. Such mapping can be implemented by non-negative matrix factorizings on A and B. However, when we factorize A and B at the same time, we cannot ensure that the projection matrixes V and V (B) are identical. Therefore, we use matrix V Ã to make V and V (B) has the minimal distance between V and V (B) , namely, we need to minimize the formula as follows: Therefore, our goal is to solve the following the optimization problem: In order to remove the constrains kUk 1 = 1 and kU (B) k 1 = 1 in Eq (8), we define auxiliary diagonal matrixes Q and Q (B) as follows: The matrixes U and U (B) can be normalized into UQ −1 and U ðBÞ Q ðBÞ À 1 . Since UV = (UQ −1 ) (QV) and U ðBÞ V ðBÞ ¼ ðU ðBÞ Q ðBÞ À 1 ÞðQ ðBÞ V ðBÞ Þ, the matrixes V and V (B) can be normalized into VQ and V (B) Q (B) respectively. Therefore the optimal problem (8) is equivalent to the one as follows: Here, Therefore, our goal is to find the optimal resolution of the formula (11).

Iterative update rule of non-negative matrix
To solve the optimization problem in Eq (11), we present an iterative method. Since U, V, U (B) , V (B) and V Ã are variables in Eq (11) Derivation 1 To optimize the objective function (12), the following updating rule can be used: Proof: According to the definition of Frobenius norm definition, we can get: We take the derivative of F(U) on U lm : By Karush-Kuhn-Tucker (KKT) condition we know that @FðUÞ @U lm ¼ 0, and can get: So, the update rule of U is as follows: Updating U (B) . Derivation 2 we construct the function as follows: the update rules for F(U (B) ) can be written as: Derivation 2 can be proved in a similar way as Derivation 1.
Updating V. Derivation 3 we construct the function as follows: the update rules for G(V) can be written as: Proof: According to the definition of Frobenius norm definition, we can get: We take the derivative of G(V) on V lm : By Karush-Kuhn-Tucker (KKT) condition we know that @GðVÞ @V lm ¼ 0, and can get: Namely: So, the update rule of V lm is as follows: Updating V (B) . Derivation 4 we construct the function as follows: GðV ðBÞ Þ ¼k A À U ðBÞ V ðBÞ k 2 F þm k V ðBÞ À V Ã k 2 the update rules for G(V (B) ) can be written as: Derivation 4 can be proved in a similar way as Derivation 3. Updating V Ã . Derivation 5 we construct the function as follows: the update rules for H(V Ã ) can be written as: Proof: According to the definition of Frobenius norm definition, we can get: By Karush-Kuhn-Tucker (KKT) condition we know that @HðV Ã Þ @V Ã lm ¼ 0, and can get: So, we can get the update rule of V Ã lm as follows:

Framework of the NMF-LP algorithm for link prediction
Based on the iterative method for NMF computing, we present an algorithm named NMF-LP for link prediction based on non-negative matrix factorization. The framework of our algorithm NMF-LP is as follows. Line 12 of the algorithm calculates the similarity between the column vectors of weight matrix V Ã , and store the similarities in the the scoring matrix S, which is output on line 13 as the final result of link prediction.

Time complexity analysis
In each iteration, steps 5 to 10 require O(n 2 k) time. Since each row in matrix V Ã is a k-dimensional vector, which takes O(k) time to compute the similarity between such vectors. Therefore, step 12 requires O(n 2 k) time for compute the similarities for all pairs of the row vectors in V Ã . Since k and t can be treated as constants, complexity of the algorithm is O(n 2 ). In the similarity based link prediction methods, n 2 similarity scores between the node pairs must be computed. Therefore, O(n 2 ) is the lower bound the time complexity of the similarity based link prediction methods.

Convergence analysis of the algorithm NMF-LP
In this section, we will prove the convergence and correctness of U, V, U (B) , V (B) and V Ã in their iterative process. In order to prove convergence of the algorithm, we will make use of an auxiliary function based on the following lemma: Lemma 1 Let F 2 R nÂn þ , G 2 R kÂk þ be two symmetric matrixes, S 2 R nÂk þ and S 0 2 R nÂk þ be two n × k matrixes. Then we can get:

Proof:
We set S ij ¼ S 0 ij p ij , then the difference value between the two sides of the formula (22) is: Noticing that F, G are symmetric matrixes, we get: Namely: Based on the above thoughts of constructing auxiliary functions, we give the following lemma to prove the convergence of the algorithm NMF-LP. Lemma 3 Fixing any four matrices in U, V, U (B) , V (B) and V Ã , and using the updating rules (13), (15), (17), (20) and (22) in each iteration of algorithm NMF-LP the value of objective function J is non-increasing.
Proof: Firstly, we fix the value of U, U (B) , V (B) and V Ã to update V, we can translate formula (12) into the following optimization problem: We can get the following equation according to the definition of Frobenius norm: We construct the following function: Then we prove Therefore, we get the following inequality: Based on the definition of auxiliary function and Lemma 2, we know that if we find the V 0 value to reach the local minimum of auxiliary function F(V 0 , V |t| ), then the value of J(V) is non-increasing over t. Therefore, we find the V 0 value to minimize F(V 0 , V |t| ) through fixing V |t| .
In order to obtain the minimum value of function F(V 0 , V |t| ), according to KKT conditions, we can obatin: So, we can get the update following formula: Therefore, J(V |t| ) is non increasing according to update rule (17) and we can get the update formula about V jk by replacing V 0 jk and V jk by V tþ1 jk and V ðtÞ jk , respectability. Therefore, using Eq (17) at each iteration of algorithm NMF-LP, the value of objective function J is nonincreasing.
In a similar way, we can prove that using the rules (13), (15) and (20) at each iteration of algorithm NMF-LP, for updating U, U (B) , V (B) and V Ã , respectively, the value of objective function J is also non-increasing.
Although it is not guaranteed that the process in Algorithm NMF-LP will converge to a global minimum, the end condition of iterations k A À UV k 2 F ε can ensure that the result matrixes UV is an acceptable factorization of A and will meet our requirements in solving the problem of link prediction.
Based on Eq (25), we know that the value of V by Eq (30) is also non-increasing. Since V > 0, it converges. The correctness of the converged solution is assured by the fact that at convergence, from Eq (30), the solution will satisfy It is the same as the fixed point condition of Eq (18). In a similar way, the convergence and correctness of formula for updating U, U (B) , V (B) and V Ã can be proved.

Test on networks without node attributes
In this section, we testified the reliability of our algorithm NMF-LP on six benchmark data sets which served networks without node attributes: US airport network(USAir), US political blogs (PB) network, coauthor-ships network between scientists(NS), protein-protein interaction network (PPI), electrical resource grid of the western US(Grid) and Internet(INT). For each non-connected network, we figured out the largest connected component. Table1 listed the topological characteristics of these largest ones from these applied networks, where N, M respectively mean the amount of nodes and links. NUMC indicates the number of the components connected within network as well as the size of the largest component. For instance, 1222/2 can be explained as: for this network, there are 2 connected components, while the largest consists of 1222 nodes. In this table, e represents the network's performance, C and r are clustering and assortative coefficients. K represents the degree of average of network.
To estimate the reliability of outputs, a 10-fold cross-validation, which was randomly produced, was applied. For this applied cross-validation, the original nodes were randomly divided into 10 subsets. Among these 10, one subset served as the criterion data for testifying the reliability of our algorithm, while the other 9 subsets served as data for training. The process of cross-validation was then repeated for 10 times. The average of the 10 outcomes obtained from the folds can be taken as a single estimation.
In the first step of the non-negative matrix factorization, we need to set the column number of base matrix. We assume that the original adjacency matrix of N rows M columns, and then the column number (λ) of base vectors are required to satisfy: (N + M)λ < NM. Since M = N in the network without node attitudes, we can get λ < N/2. In our experiments, we set λ = N/2 i , where i = [1, 2, . . ., 6]. In the six data sets tested, the changes of the AUC scores with different i values of NMF-LP algorithm are shown in Fig 1. From Fig 1, it can be seen that the AUC of NMF-LP algorithm shows a decrease trend when the i value increases. However it always keeps a high value in six data sets, indicating that NMF-LP algorithm not only can obtain a relatively higher AUC value on real networks than other algorithms, but also reduce data storage space.
And we also made comparison between the results produced by NMF-LP and the results of other algorithms such as CN, Jaccard, Sorensen, PA, Salton, HDI, LHN_I and HPI, the criterion applied was AUC score. Table 2 listed the averaged AUC scores of these tests via NMF-LP and other algorithms. In this table, 'bold-face' was applied to mark the highest AUC score for each data set through the 9 algorithms.
As can be seen from Table 3, for the 9 algorithms applied in this study, NMF-LP gained the highest AUC scores of most of these data sets. Even according to the data set INT, which was considered as the most difficult one, NMF-LP also performed very well, and the score gained was the highest: 0.961. This helped prove the high reliability of the algorithm NMF-LP.  . We chose some authors from each database. For example, in database ACM, the authors we chose are the people who had been a membership of the ACM conference from 1986 to 1996. We also set a network for every database in which each node was a specific author. And co-authorship between two authors were recorded by the link between the corresponding nodes. And for each author, we could master his overall publications. Key words in the paper titles could represent the attributes of the author. Because networks of some databases were not connected, our tests were only conducted on the largest ones. Table 4 listed the topological characteristics of these largest ones from these applied networks, where N, M respectively mean the amount of nodes and links. NUMC indicates the number of the components connected within network as well as the size of the largest component. And e represents the network's performance, C and r are clustering and assortative coefficients. K represents the degree of average of network.  Simialr to the test reults on networks without node attitudes, Fig 2 shows that the AUC of NMF-LP algorithm presents a decrease trend when the i value increases. But it always keeps a high value as i increases in eight data sets. It also indicates that NMF-LP algorithm not only can decompose the scale of the original matrix and reduce dimension of the matrix, but also maintain a relatively higher AUC value with reducing the dimension of base vectors.
Apart from that, the AUC scores of the results have also been evaluated and compared by NMF-LP and other algorithms such as Salton, Sorensen, HPI, HDI, LHN_I, PA and CN. 10-fold CV tests which use NMF-LP and other algorithms can test the average AUC value, as is shown in Table 4. In the table, what is highlighted in bold-face is the highest AUC scores for each data which is set by the 9 algorithms.
In Table 5, we can see that NMF-LP has the highest AUC scores on all of the data sets in 9 algorithms. For instance, other algorithms get AUC scores less than 0.8635, yet algorithm NMF-LP obtains the highest AUC score 0.9980 in dataset ACM, which shows that the algorithm NMF-LP can obtain high quality results that have strong robustness.

Test on the time requirement of the algorithm
In the process of the non-negative matrix factorization, we calculate the error between the original matrix A and the product of the base matrix U and weight matrix V. Error formula is error ¼k A À UVj 2 F , we plot the relation between the error and the number of iterations in non-negative matrix factorization on different datasets. The results of tests on networks without node attributes are as shown in Fig 3, and Fig 4 shows the results of tests on networks with node attributes. Based on Figs 3 and 4, we can see that in the process of non-negative matrix factorization, the tendency of error decreases sharply and it quickly arrives at the condition of convergence. The experiment results demonstrate that the algorithm not only reduces data storage space but also effectively improves the prediction performance while keeping a low time complexity.

Conclusion
With the expansion of network size, a large amount of redundant information causing by high-dimensionality and sparsely of actual network reduces the performance of the link prediction. In this paper, a new link prediction algorithm was proposed on the basis of nonnegative matrix factorization. In the algorithm, the original adjacent and attribute matrixes of the network are effectively decomposed into the base matrix and the weight matrix which are non-negative. Thereafter, we use the similarity of the weight matrix as the scoring matrix  Link prediction based on non-negative matrix factorization between the nodes. The experimental results show that one thing it can gain is higher quality results on real networks comparing with other algorithms, another thing is that it reduces data storage space while maintaining low time complexity.