A Noise-Filtering Method for Link Prediction in Complex Networks

Link prediction plays an important role in both finding missing links in networked systems and complementing our understanding of the evolution of networks. Much attention from the network science community are paid to figure out how to efficiently predict the missing/future links based on the observed topology. Real-world information always contain noise, which is also the case in an observed network. This problem is rarely considered in existing methods. In this paper, we treat the existence of observed links as known information. By filtering out noises in this information, the underlying regularity of the connection information is retrieved and then used to predict missing or future links. Experiments on various empirical networks show that our method performs noticeably better than baseline algorithms.


Introduction
About one and a half decades ago, Barabási and Albert pointed out that the property of scaleinvariance of many real networked systems originates from a specific growth process, named preferential attachment [1]. Since then, the study of complex networks has led to dramatic changes in many different fields [2][3][4][5][6][7], and also, many facets of node attractiveness in growing networks, rather than preferential attachment, have been revealed, e.g. similarity [8]. Since different growing processes often result in networks with strikingly different macroscopic properties, how real-world networks are evolved is a fundamental question in understanding our complex world. Link prediction, one of whose capabilities is to rank the best candidates of future links, plays an important role in revealing the evolution processes of networks [9,10].
On the other hand, many applications have to predict missing links in networked systems [11][12][13]. Determining whether a link exists in such networks is usually very costly, yet the answer is crucial. For example, knowing the map of protein-protein interactions will reveal many aspects of the cellular function [14], but little has been studied. Link prediction are also widely used in these applications [15,16].
The problem of link prediction has received much attention from the network science community in the past few years [9,12,17,18]. In general, both topological feature and node attributes can be used in the prediction. However, the latter is usually unavailable or unreliable. For example, in online social networks, the personal information of users are inaccessible due to privacy policies. Thus, many algorithms consider only topological features.
Basically, there are two classes of topological methods-similarity-based and likelihoodbased algorithms. Similarity based algorithms assume that two nodes are likely to be connected if they are similar. It assigns a score s xy to each pair of nodes x and y, which is defined as the similarity between them. All non-observed links are ranked according to their scores, and the links connecting more similar nodes are supposed to be of higher existence likelihoods. A wealth of methods of this type have been proposed. For example, CN (Common Neighbours) [19] uses the number of common neighbours to rank the similarity of nodes and the likelihood that they are/will be linked. Many variations of CN are also proposed: AA (Adamic-Adar) [20], Resource Allocation (RA) [19] give more importance to common neighbours with lower degree, and Jaccard's index is a normalised CN. Only local structural information are used in these methods. There are also methods utilizing quasi-global or global information. For example, the Local Path method defines the similarity as the number of paths passing through two nodes, whose length may be larger than 2.
Recently, the organization patterns existing in many real-world networks are utilized in predicting missing links. Likelihood-based methods make assumptions of the structure, with specific parameters obtained by maximising the likelihood of the known structure. Predictions of the non-observed links are made based on the presumed pattern and the parameters. For example, Ref. [21] utilizes the hierarchical structure existing in many networks to predict missing links. And Cannistraci et al. propose the local-community-paradigm to improve the performance of classical predictors [13].
We know that real-world information always contains noise, which is also the case in an observed network. However, this problem is rarely considered in existing methods. In Ref. [18], the authors use the average of the eigen-decomposition of perturbed adjacency matrix (by removing some links) to suppress the noise. However, the underlying physical meaning is not clear, say, why should the eigenvectors of the adjacency matrix reflect the regularity of a network, if they actually are sensitive to perturbation [22]? Besides, it has a high computational complexity. In this paper, by treating the existence of observed links as known "information" (as in [23,24]), and filtering out the noise in it, we obtain similarity scores for all non-observed links. We give a more theoretical analysis of the link prediction problem and a more meaningful demonstration of the noise-filtering (NF) method. Our method outperforms the typical predictors.

Metrics
In this paper, two metrics are used to compare the performance of the base-line algorithms and the proposed noise-filtering method.
Consider that we are given an simple network G(V, E), where V and E are the set of nodes and links, respectively. By "simple", we mean there are no self-loops or multi-links in the network. In a similarity-based algorithm, for each pair of nodes x, y 2 V without a link, a similarity score is assigned. Then all unlinked pairs are ranked in descending order according to their scores, and the links on the top are considered as the ones with the highest likelihoods to be connected.
To test the accuracy of a predictor, we randomly divide the observed links in the network into a training set E T and a probe set E P . Here, E T is treated as known information while E P is only used to test the accuracy. Clearly, we have E T [E P = E and E T \E P = ;.
In this study, we use two metrics, AUC (Area Under the Receiver operating characteristic curve) and precision to evaluate the performance of a predictor. They are defined as follows.
• AUC: AUC is a metric in the receiver operating characteristics (ROC) analysis [25]. Taking the top L links as predicted links, a ROC curve is obtained by plotting true positive rates versus false positive rates for varying L values. Thus AUC can be interpreted as the probability that a randomly chosen missing link (i.e., a link in E P ) has a higher score than a randomly chosen non-existent link (i.e., a link in U − E), in the rank of all non-observed links. In the algorithmic implementation, if among n times of independent comparisons, there are n 0 times in which the score of the missing link is higher than that of the non-existent link and n 00 times in which the two have the same score, then AUC can be expressed as If all the scores are generated from an independent and identical distribution, AUC will be approximately 0.5. Therefore, the extent to which AUC exceeds 0.5 indicates how much better the algorithm performs than pure chance.
• Precision: Given the ranking of the non-observed links, the precision is defined as the ratio of relevant items selected to the number of items selected. Thus if we choose the top-L links in the rank, and there are L r links correctly predicted, then Clearly, higher precision means higher accuracy. In this paper, L is always set to the size of the probe set.

Data Description
Networks from different fields are considered in the experiment, including biological, social, and technological networks. The original networks are turned into undirected, and simple (with multiple links or loops removed) networks. These networks are described in the following. i) Karate [26]: A social network of a university karate club. ii) FoodWeb [27]: A food web in Florida Bay during the rainy season. iii) Jazz [28]: A collaboration network of jazz musicians. iv) Neural [29]: The neural network of C.elegans. v) USAir [30]: The US Air transportation network. vi) Metabolic: The metabolic network of C.elegans. vii) Email [31]: A network of Alex Arenas's email. viii) PB [32]: A network of US political blogs. ix) Yeast [33]: A protein-protein interaction network. x) EPA [34]: A network of web pages linking to the website www.epa.gov. xi) Router [35]: The router-level topology of the Internet. xii) WikiVote [36,37]: The network contains all the Wikipedia voting data from its inception till January 2008. Their basic topological parameters are summarized in Table 1.

Baseline Algorithms for Comparison
In this paper, six representative similarity indices are considered for performance comparison, including the Common Neighbours (CN), Adamic-Adar (AA) [20], Resource Allocation (RA) [19], Preferential Attachment (PA) [38], Local Path (LP) [39], and Katz [40]. The first four are local indices, the fifth is a quasi-local index, and the last is a global index. Some of them are briefly introduced earlier.
Here we present the details of these algorithms.
1. CN index. The CN index follows the intuition that two nodes x and y are more likely to have connection if their nearest neighbours overlap substantially. The similarity score is obtained by where Γ(x) is the set of neighbours of x and | Á | denotes the cardinality of a set.
2. AA index. AA is a variation of CN: it gives less importance to common neighbours with high degree: 3. RA index. Similar to AA, the only difference is that RA punishes high-degree common neighbours to a higher extent: 4. PA index. The PA index supposes that popular nodes are more likely to be connected to. This index is defined as 5. LP index. Unlike the previous indices, LP uses second order information (information about neighbours of the neighbours) to improve performance. It is defined by 6. Katz index. This index sums over the number of paths (including loops) between two nodes, with each number exponentially damped by the path length Note that the LP index and Katz are both parameter-dependent.

Link Prediction via Noise Filtering
In many networks, the formation of links usually embodies both regularities and irregularities.
Only the former shows a uniform pattern, which is called the intrinsic pattern. For a specific link, if its existence does not correspond with this pattern, then its existence should be treated as noise. For a specific link, if its existence does not correspond with the connection pattern of the whole network, then its existence is treated as noise. A large body of link prediction methods (i.e. common neighbor method) assumes that nodes are linked if they are similar. Following this assumption, we treat links connecting dissimilar nodes as noise. By filtering out the noise, we can obtain the intrinsic connection pattern, which can be further used to predict missing or future links.
To this end, one has to define a measure to quantify the degree to which a link connects dissimilar nodes.
For every node in the network, assume that its topological features are captured by some vectors in R m . Define the feature matrix X to be an n-by-m matrix whose rows are the feature vectors of nodes. Thus, X ik is the k-th feature of node i, and X •k , the k-th column vector of X, is the k-th feature of all nodes. In real-world cases, features usually contain noise.
In some typical link prediction methods (i.e. common neighbor method), nodes are assumed to be linked because they are similar. Now focusing on the k-th feature, we may measure to what degree dissimilar nodes are linked in the whole network by where i * j indicates that i and j are neighbors, and L is the Laplacian matrix [41]. However, this measure is biased. In the rhs of the first equation, the feature X ik of node i appears in d i different terms in the summation, where d i is the degree of node i. So features of high-degree nodes dominate the value of D 0 k , while in many real-world networks, most nodes are of low degree [1]. Thus the value of D 0 k does not properly count the similarity of the features from the majority.
The rightmost term in the above equation is the quadratic form of the Laplacian. To treat features from different nodes equally, a natural alternative is using the quadratic form of the normalised Laplacian matrixL [41], The quadratic form ofL has similar interpretation of that of L, so larger D k indicates to a larger extent, dissimilar nodes are linked together. Thus D k can be used as a non-biased dissimilarity measure of the k-th feature. In signal processing, to filter out noise, the signal is decomposed into a set of sine waves with different frequencies. For higher frequencies, the sine waves oscillate much more rapidly. Then the waves with frequencies that are considered within the band of noise are filtered out. In our case, the eigenvectors of the normalised Laplacian provide a similar notion of frequency. To understand this, denote by λ 1 < λ 2 < Á Á Á < λ n the eigenvalues of the normalised Laplacian matrixL, and v 1 , v 2 , Á Á Á, v n the corresponding eigenvectors. The Courant-Fischer Theorem [42] tells us that v 1 ¼ arg min and v l ¼ arg min x:kxk 2 ¼1;x?spanfv 1 ;...v lÀ1 g So, if X •k = v 1 , then D k achieves its smallest, which indicates that v 1 oscillates slowly among connected nodes (since D k is a dissimilarity measure). The eigenvectors associated with larger eigenvalues oscillate more rapidly.
Similar to filtering noise in signal processing, we can project X •k onto {v i }, and filter out the components with high "frequency", i.e., the components on v i with large subscript i, since we treat the existence of links connecting dissimilar nodes as noise. Denote the cut-off threshold by t, the noise-filtered X •k readŝ is a matrix whose columns are the first t eigenvectors of L with the smallest eigenvalues. Since no prerequisite is required for k, we can easily generalise the above derivation for the k-th feature to any other feature. Then we obtain the noise-filtered features for the whole net-workX For any node i, its connections with all other nodes in the same network are totally characterized by the corresponding rows in the adjacency matrix A. So one may use these rows as the feature vectors for nodes, as in [43,44], and interpret the k-th feature of node i as whether it is a neighbour of k. But there are some minor issues with this choice. Recall that the above derivation is based on the minimisation of the dissimilarity measure of all linked nodes (see Eq (9)). We now consider two linked nodes i and j, which have exactly the same neighbourhoods, so we expect the dissimilarity of them is 0. However, their i-th feature will not be the same, since the i-th feature of i is 0 while the i-th feature of j is 1. This is the same with the j-th feature. We can see from this analysis that one can use the rows of A + I rather than A as the feature vectors for nodes. So the k-th feature of node i can be interpreted as whether its to node k is no more than 1. This is further demonstrated in Fig 1. Apply the above methodology, we havê Entries ofŜ reflect the intrinsic connection pattern, so they can be used to predict missing links. However, since we are focusing on undirected networks, there is still one problem witĥ S. We can see that according to Eq (14), it might not be symmetric. So we will make predictions based on entries of 1 2 ðŜ þŜ T Þ instead ofŜ.

Experimental Results
To compare the performance of the Noise-Filtering (NF) method and some well-known algorithms, 12 real-world networks, including biological, social, and technological networks, are considered in the experiments. They are transformed into undirected, and simple (with multiple links or loops removed) networks. The resulting networks are summarized in Table 1. Table 2 shows the prediction accuracy measured by AUC. Results measured by another widely used metric, precision, is presented in Table 3. These metrics are introduced in the Methods section. The highest AUC/precision for each network (in each column) is shown in boldface. Under the AUC metric, NF performs best in 7 out of 12 networks, while under the precision metric, NF performs best in 9 of them. Figs 2 and 3 compare prediction accuracy of different algorithms under varied partitioning ratio. It can be seen that the proposed method is either the best or very close to the best, except for only one network-PB. Moreover, the robustness of the proposed method can also be verified by Figs 2 and 3. Since in most networks, the accuracy of the proposed method is either the best or very close to the best, even with the size of training sets varied.
Intuitively, the more the amount of known information, the higher the prediction accuracy. But in Fig 3, we see that most of the time, the precisions do not increase with the size of training sets. This is due to different sizes of probe sets (follow a conventional way, we always set L in  reads [0, 1, 1, 0, 1], and the 5th reads [0, 1, 1, 1, 0], which are different. By adding I, the 4th and 5th rows of A + I now are both [0, 1, 1, 1, 1], which is exactly what we want. This is also the case for nodes 2 and 3. The k-th feature of a node can be interpreted as whether the distance between it and node k is no more than 1. For example, the distance between node 1 and 4 is greater than 1, while the distance between all the other nodes and node 4 are within 1, so the 4th feature is [0, 1, 1, 1, 1] T .     Eq (2) to the size of the probe set). Thus with different sizes of training set, the precisions cannot be compared [45]. For all the parameter dependent methods considered in the experiment, i.e., LP, Katz, and NF, the results correspond to the optimal parameter, subject to the highest prediction accuracy. The optimal parameter can be found through a process similar to the K-fold validation. For example, in the proposed method NF, the training set is first partitioned into K units, a single unit is retained as the validation data for testing the method with specific t, and the remaining K − 1 units are used as known information. The cross-validation is then repeated K times (the folds), with each of the K units used exactly once as the validation data. The K results from the folds are then averaged. This whole process is repeated several times to find the optimal value of t (the value of optimal t is manually bounded in the range [1,125], so the computation complexity is relatively small). In Fig 4, we see that for the two metrics considered here, the optimal t is robust, since the value of t where the prediction accuracy peaks does not change with the choice of the size of the training set. So there is no need to search for an optimal t in every single run of the simulation. Once the optimal t is found, it is set to this same value in all subsequent simulations, even with the size of the training set varied.
The experiments are conducted on a workstation with 64 GB RAM and an Intel (R) Xeon (R) E5-2687W @ 3.10 GHz 8-core processor. The comparison of computational time is summarized in Table 4. We see that the proposed method NF has similar run time with the global index Katz, especially on large networks, but having better performance.

Discussion
Real-world information always contains noise. This is also the case when making observation of a network structure. This problem is rarely considered in existing link prediction methods. To address this issue, we treat the connection of a given network as known information, and filter out the noises in it, based on an assumption that connected nodes should have similar neighbourhoods. The underlying regularity of the connection information is then retrieved and used to predict missing or future links. Experimental results show that it performs better than typical algorithms. Future works include how to improve the performance of existing methods based on the same idea of noise filtering.  Each value is the total time (in seconds) for 100 runs, with independent random divisions of the training set(90%) and the probe set(10%). The method proposed in this paper is in the last column, NF (Noise Filtering).