Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Link prediction of heterogeneous complex networks based on an improved embedding learning algorithm

  • Lang Chai ,

    Contributed equally to this work with: Lang Chai, Rui Huang

    Roles Conceptualization, Funding acquisition, Methodology, Software, Writing – original draft

    chailang@cqjtu.edu.cn

    Affiliation School of Mathematics and Statistics, Chongqing Jiaotong Univeristy, Chongqing, China

  • Rui Huang

    Contributed equally to this work with: Lang Chai, Rui Huang

    Roles Methodology, Supervision, Writing – review & editing

    Affiliation School of Foundation Courses, Chongqing Institute of Engineering, Chongqing, China

Abstract

Link prediction in heterogeneous networks is an active research topic in the field of complex network science. Recognizing the limitations of existing methods, which often overlook the varying contributions of different local structures within these networks, this study introduces a novel algorithm named SW-Metapath2vec. This algorithm enhances the embedding learning process by assigning weights to meta-path traces generated through random walks and translates the potential connections between nodes into the cosine similarity of embedded vectors. The study was conducted using multiple real-world and synthetic datasets to validate the proposed algorithm’s performance. The results indicate that SW-Metapath2vec significantly outperforms benchmark algorithms. Notably, the algorithm maintains high predictive performance even when a substantial proportion of network nodes are removed, demonstrating its resilience and potential for practical application in analyzing large-scale heterogeneous networks. These findings contribute to the advancement of link prediction techniques and offer valuable insights and tools for related research areas.

Introduction

Nowadays, the interactions between various entities are becoming increasingly complex and frequent, leading to the generation of massive amounts of unstructured data. In the field of complex network science, scholars often abstract this unstructured data into complex networks composed of nodes and edges [1]. By studying these complex networks, researchers can gain significant insights into solving real-world problems [2]. Among these studies, link prediction in complex networks is a crucial problem in the science of complexity [3, 4]. Link prediction aims to forecast unobserved or potential future edges in a complex network by analyzing the observable nodes and edges [5, 6]. Theoretically, link prediction has the potential to reveal the mechanisms underlying the generation of complex network structures [7, 8]. Practically, it has wide applications across various fields, such as extracting hidden information in military combat networks of weapon systems [9]; detecting fraud in the financial sector [10]; predicting connections between neurons in E.coli and chemical reactions among components in metabolic networks in biology [11]; providing more precise product recommendations in consumer networks based on user-product relationships [12]; inferring associations between small molecules with unclear chemical information and proteins in drug discovery [13]; and identifying potential suspects in criminal investigations through social network analysis to swiftly dismantle criminal organizations [14].

In the filed of link prediction, numerous outstanding works have advanced its development. Clauset et al. proposed a hierarchical model-based approach to link prediction, highlighting the significance of hierarchical organization in networks [15]. Guimerà and Sales-Pardo further investigated the impact of missing and spurious interactions on network reconstruction [16]. Matrix factorization has been extensively studied in the domain of link prediction, providing robust and interpretable frameworks for uncovering hidden relationships in networks [1719]. Moreover, several link prediction algorithms based on hybrid methods have been extensively studied [2022]. Lu et al. explored the predictability of link prediction in complex networks and introduced a framework for assessing the predictability of network structures [23]. Furthermore, these studies [2426] have explored link predictability from various perspectives, yielding numerous valuable insights and results.

However, most of the link prediction researches for complex networks mainly focuses on homogeneous networks currently [2729]. In the real world, From the perspective of node types and edge types [30], complex networks can be classified into homogeneous and heterogeneous networks. Heterogeneous complex networks have multiple types of node types, which are more advantageous in characterizing complex real-world systems [31]. And Heterogeneous networks are increasingly prevalent, facilitating communication among different types of individuals. Examples of such networks include recommendation system [32], where recommendation links exist between users and commodities; scientific citation network [33], where the citation links connect scholars and articles, and protein synthesis networks [34], which exist different types of links between various amino acids. Unfortunately, the research on heterogeneous complex network links prediction is still relatively weak, but its importance has become increasingly prominent in both theoretical and practical terms.

Thanks to the swift progress in deep learning methods, as seen in references [3537], researchers have effectively combined deep learning with link prediction in heterogeneous networks, leading to a series of research successes. A significant milestone was the introduction of meta-paths to capture the nuances of heterogeneous networks [38]. These meta-paths are effective at representing the semantic content and structural connections within the networks. Since their inception, meta-paths have become a leading approach in link prediction for heterogeneous networks. For example, using meta-paths, Shakibian and colleagues [39] proposed a similarity index for predicting links in heterogeneous networks by analyzing co-occurrence matrices. Additionally, Shakibian introduced an unsupervised learning algorithm based on a multi-core single-class SVM, leveraging meta-paths [40]. However, while these meta-path-based algorithms characterize network information, they do not fully utilize deep learning’s strengths in capturing network structure.

The goal is to combine the benefits of meta-paths with deep learning to create advanced algorithms for predicting links in heterogeneous networks. In 2017, Dong and team [41] presented the Metapath2vec framework, which integrates meta-paths with the word2vec technique. This framework sorts the types of neighboring nodes and maps similar types into the same potential space. To better capture the structural information in heterogeneous networks, researchers have started to include node weights and meta-path weights in the embedding process. In 2019, Phu Pham and others [42] proposed the W-MetaPath2Vec algorithm, building on Metapath2vec, assigning weights to nodes based on topic relevance and training the model with these weighted nodes. In 2020, Zhang and colleagues [43] developed a method that assigns weights to each meta-path and trains the embedding model following the Metapath2vec approach.

These achievements have significantly pushed forward the study of link prediction in heterogeneous networks. However, current research has not yet fully considered how different local structures in these networks contribute to embedding learning. Just as the graph attention network GAT [44] acknowledges varying contributions from local structures to node feature learning, incorporating this local structural information into embedding learning for heterogeneous networks is an issue that needs further exploration.

Following the analysis presented, this paper introduces an innovative method for link prediction in heterogeneous networks. Our method integrates network embedding with local structural insights, allowing us to allocate weights to various local structures within heterogeneous networks. This facilitates the creation of a new objective function for embedding learning and the development of a unique embedding learning algorithm for heterogeneous networks, namely the Structural Weighted Metapath2vec (SW-Metapath2vec) approach. The SW-Metapath2vec algorithm initiates with random walks to produce meta-path sequences. Subsequently, it applies a structure-weighted strategy for embedding learning in heterogeneous networks. We then utilize the resulting node embedding vectors to forecast links within these complex networks. To substantiate the efficacy of our proposed algorithm, we perform experiments on both real-world and synthetic datasets, contrasting our outcomes with those of benchmark algorithms.

In summary, the proposed SW-Metapath2vec algorithm represents a promising approach for link prediction in heterogeneous networks. By fusing network embedding with local structural information and designing a novel embedding learning algorithm, we provide a powerful method for analyzing and predicting links in complex networks.

Preliminaries

To better introduce the main content of this paper, this section provides an overview of the essential preparatory knowledge on heterogeneous networks, meta-paths, and the cosine similarity index.

Heterogeneous network

Difinition 1 (Heterogeneous network) On the network G = (V, E, TV, TE), where V, E are the node and edge set, TV, TE are the node type and edge type collections, respectively. Then there exists π: VTV and Ψ: ETE, making the node vV and the link eE among G have π(v)∈TV, Ψ(e)∈TE, and |TV| ≥ 2, then the network G is called heterogeneous network.

Definition 2 (Network scheme) Network GT(TV, TE) is a network scheme if the heterogeneous network G(V, E, TV, TE) is a directed network under the node map π: VTV and the link map Ψ: ETE.

Meta-path

Definition 3 (Meta-path): Meta-path is a trace defined on the network scheme GT(TV, TE), and the mode is , where , .

Remark 1: Meta-path fixes the link combination between node type and node type , where “∘” is the links combination operator.

To illustrate the meta-path, we take a simple “Author-Paper-Venue” network as an example. Fig 1 shows some meta-paths in the network. We call the path p1: a1p1v1p3a3 a meta-path trace that follows the meta-path , where the “A” represents the “Author”, the “P” represents the “Paper”, and the “V” represents the “Venue”. The semantic relationships of this case show that both the author a1 and the author a3 have published their papers in the journal v1, so the possibility of citation or cooperation between the two authors is greater. In parallel, the path p2: a2p4vv is a meta-path trace following the meta-path . In particular, (1) there is often more than one meta-path trace following a meta-path . For example, p3: a1p1v1 is also a meta-path trace following . (2) The length of the required meta-path trace can be specified for each application.

thumbnail
Fig 1. Meta-paths of the “Author-Paper-Venue” network.

https://doi.org/10.1371/journal.pone.0315507.g001

There have been numerous advancements in the generation of meta-path traces [4548]. One common method is using a random walker to generate these traces. The following provides a brief introduction to the process of obtaining a meta-path using this approach.

When using a random walker to generate a meta-path trace, it is necessary to randomize the starting nodes of the random walker . Different meta-path traces can be obtained by adjusting the hyper-parameters such as the walk length L and the times of node traversal N. Given G(V, E, TV, TE) and the meta-path , the meta-path trace can be generated as follows.

Firstly, randomly set the type start node in the meta-path trace. Then, according to the transition probability of the random walker from step i to step i + 1 under the meta-path (1) generate the type node . The remaining are deduced by looping through the process until we get a meta-path trace of the specified length. In Eq (1), is the t type node in step i, and represents the number of t + 1 type neighbor nodes of type t nodes in step i.

Remark 2. Generally, using an symmetric meta-path is more conducive to capturing effective information, especially the similarity between nodes; an asymmetric meta-path has a very limited ability to obtain useful information. For example, the meta-path APVPA in Fig 1 can reflect the cooperation between two authors and the possibility of article quotation; the meta-path APV can only show that an author publishes an article in a journal.

Materials and methods

Embedding learning with structure weights for general networks

The embedding learning model of complex networks was inspired by the word embedding methods in natural language processing (NLP) [4951]. It learns the embedding vector of each node in the network. For a general complex network G = (V, E), a traditional embedding learning objective function was defined as (2) where the embedding function maps each node vV into a d dimensional vector space [0, 1]d, d = |V|. N(v) is the neighbor set of the node v in the network G. The conditional probability p(c|v) represents the probability that the node c is a neighbor node of node v. It also can describe the local structure of the network. p(c|v) is defined as (3) where Xnode = g(node), node = c, v, u represents the potential embedding, respectively. It should be noted that Eq (2) only considers the likelihood probability of node local structure. There is no deeper level structural information among nodes c, v and u.

Definition 4. The matrix X is called the embedding matrix, in which the row vectors correspond to the embedding vector of each node. For example, Xc, Xv, and Xu are in the c, v, and u row of the matrix X, respectively.

In this paper, we consider assigning different weights to various local structures in heterogeneous networks to measure their contributions to network embedding. As shown in Fig 2, when embedding the node “1” and the node “2”, the link strength between the nodes “1” and “2” in Fig 2(a) is stronger than in Fig 2(b) in terms of network semantics. This is because, in Fig 2(a), there are some links between the neighbor nodes of the node “1” and the node “2”. These links between the neighbor nodes enhance the strength of the link between node “1” and node “2” and may indicate that node “1” and node “2” are more likely to belong to the same level in the network. In contrast, in Fig 2(b), there are no links between the neighbor nodes of node “1” and node “2”, suggesting that the link between node “1” and node “2” may be serendipitous, and the likelihood that they belong to the same level is relatively low. Based on this idea, we define the following formula to determine the weight for the local structure in networks. (4) where N(v) and N(c) are the neighbors set for the node v and c separately, #(N(v), N(c)) represents the number of links between the nodes of N(v) and N(c). Specially, we regard there exists a link between the nodes of N(v) and N(c) if a node m is the common neighbor node between nodes v and c, such as the node “7” and “8” in Fig 2(a). Therefore, in Fig 2, the and , respectively.

thumbnail
Fig 2. The local structure of the link between node “1” and node “2”.

https://doi.org/10.1371/journal.pone.0315507.g002

Combined with the Eq (2) and the formula Eq (4), in this paper we define the embedding learning with structure weight as follows (5)

In the traditional embedding objective function (2), the contribution of each local structure of the network to the embedding is equal. While in object (5), the weight value of different local structures can highlight the structural impact in the link prediction. This means that the features obtained by the embedding function g are more suitable for link prediction.

Embedding learning with structure weight for heterogeneous networks

For the heterogeneous networks, we also design a new embedding method as the Eq (6) in the following. Given a heterogeneous network G(V, E, TV, TE) and |Tv| ≥ 2, for any node vtV, The heterogeneous networks embedding learning with structure weight is defined as (6) where Nt(vt) is the neighbor set of the node vt, which node type is t′. Γ(vt, vt) is the weight between nodes vt and vt according to Eq (4).

When the scale of the network is extremely huge, the optimization of the objective function (6) is relatively difficult. To overcome this obstacle, the negative sampling strategy and binary logistic regression are used to train the embedding model [52]. The negative sampling strategy replaces the global parameters update, which effectively improves the optimization efficiency.

For descriptive convenience, we denote a positive sample as (vt, v0) and suppose there are K negative samples (vt, v1), (vt, v2), ⋯, (vt, vK). Given binary logic regression, it is seen that positive and negative samples must meet (7) and (8)

According to Eqs (7) and (8), we have (9) where the . is the probability distribution of negative samples, and f(u) is the frequency of node u in the network.

For Eq (9), the gradient is (10) where is the indicative function, which has the value 1 when vt, vk is a positive sample, or 0 when (vt, vk) is a negative sample.

The potential vectors can be updated by Eq (10) until the proper embedding vectors are obtained.

Based on the above analysis, Fig 3 describes the framework of the SW-Metapath2vec algorithm proposed in this paper. As shown in Fig 3, the first block “Input” is the data preparing stage. In this step, we should design the appropriate meta-path in view of the object of link prediction. The second block is the main method “SW-Metapath2vec” algorithm, using Eq (4) to assign weight to different local structures, and then generate the meta-path traces according to , finally through the objection function (6) obtain the embedding vector of each node. The final block “Output” includes the link prediction by embedding vectors and other analysis about the SW-Metapath2vec, such as hyper-parameters sensitivity, robustness. In addition, we also provide pseudo-code for the SW Metapath2vec algorithm. The available code can be found in the repository https://github.com/pinglanchu/SW-Metapath2vec.

thumbnail
Fig 3. Framework of SW-Metapath2vec algorithm.

https://doi.org/10.1371/journal.pone.0315507.g003

Algorithm 1: SW-Metapath2vec link prediction algorithm

input : Heterogeneous graph G, link type , testing ratio r

output: node embedding X, AUC, Precision

 // Division of training and testing sets

1 testing links ; Randomly select links of the specified link type with a ratio of r from G;

2 training network Gtrain = G.remove_links(ET); Remove the selected testing links ET from G;

3 positive testing network ; Copy the nodes in network G to an empty network GNull, and then add ET to the network to build a test network;

4 negative testing network ; Randomly select links of the specified link type with a ratio of r from G′, G′ is a supplementary network of G; // Training SW-Metapath2ve

5 for node v in Gtrain(V) do

6  Calculating local structure weight Γ(v, c) by Eq (4);

7  Training node embedding X by Eq (10);

8 end

  // Testing SW-Metapath2ve

9 for link = (node_src, node_dst) in do

10  sim = cosine(X[node_src], X[node_dst]);

11  real_label = 1;

12 end

13 for link = (node_src, node_dst) in do

14  sim = cosine(X[node_src], X[node_dst]);

15  real_label = 0;

16 end

17 AUC = auc(real_label, sim), Precision = precision(real_label, sim)

Results

Datasets

To verify the effectiveness and feasibility of the SW-Metapath2vec algorithm proposed in Section: Materials and methods. In this section, we will conduct some experiments with real and synthetic heterogeneous networks.

  1. (1) ACM [53] contains various types of nodes, such as “papers”, “authors”, and “fields”, and multiple types of edges between them, all of which are from the ACM digital library. The ACM dataset we used contains 4025 papers (P), 7351 authors (A), and 72 fields (F). Also, it concludes two types of edges, which are the “paper-author”, and “paper-field”.
  2. (2) DBLP [54] is a scientific literature dataset. We extract a subset of DBLP with 4057 authors (A), 6385 papers (P), 4108 terms (T), and 4128 venues (V). The DBLP heterogeneous network contains three types of edges, which are the “paper-author”, “paper-term”, “paper-venue”. In this paper, we used the basic information in the literature to construct heterogeneous connections between “paper-author”, “paper-term”, and “paper-venue”. The specific information is shown in Table 1.
  3. (3) Last.fm [55] is music recommendation dataset. The Last.fm heterogeneous network contains 1892 users (U), 9524 artists (A), and 5612 tags (T); and includes three types of edges, which are the “user-user”, “user-artist”, and “artist-tag”.
thumbnail
Table 1. General information for the three real heterogeneous networks.

https://doi.org/10.1371/journal.pone.0315507.t001

To ensure the fairness and effectiveness of numerical experiments, heterogeneous complex networks are divided into training and testing sets based on the predicted edge type “etype” in each experiment. Among them, the nodes in the training set and the testing set are consistent. Given the proportion r of the testing set, remove the links of the “etype” from the original network, and the remaining heterogeneous complex network is the positive training set network; In addition, the network composed of non-links in the original network is taken as the negative network, and r proportion of “etype” type edges are removed from the negative network. The remaining heterogeneous complex network is the negative training set network. At this point, the removed “etype” type links constitute the positive and negative test set networks, respectively.

Fig 4 displays the “paper-author-venue” network in the ACM dataset, the “author-paper-field” network in the DBLP dataset, and the “user-artist” network in Last.fm dataset. Then the Table 1 shows the basic information about the three datasets, where the Meta-paths are manually set in this paper.

thumbnail
Fig 4. Three real heterogeneous complex networks.

(a) ACM dataset network; (b) DBLP dataset network; (c) Last.fm dataset network.

https://doi.org/10.1371/journal.pone.0315507.g004

In addition, we have generated two synthetic networks for the experiments, as seen in Table 2. |V| and |E| are the number of nodes and links respectively. < k > and < d > are average degree and average shortest path length. Finally, ρ denotes the density of the network and the #cs is the number of connected components.

thumbnail
Table 2. Topological properties of the two synthetic networks.

https://doi.org/10.1371/journal.pone.0315507.t002

Baseline algorithms

To better compare the performance of the SW-Metapath2vec algorithm proposed in this paper with other classic link prediction algorithms for heterogeneous complex networks, we selected several representative link prediction algorithms as benchmark algorithms. The following is a brief introduction to these algorithms for comparison.

(1) CN: [56] The more common neighbors two nodes have, the higher the similarity between them. (11) where and represent the neighbors set of nodes vi and vj, respectively.

(2) RA: [57] The basic idea of the RA is to evenly distribute one unit of resource to its neighbor nodes. (12) where the kz represents the degree of node z.

(3) Jaccard: [58] Building upon the CN index, the impact of degree-corrected similarities of the nodes at either end of the link was taken into consideration. (13)

(4) Node2vec: [59] The optimization goal of Node2vec is to maximize the probability of observing neighboring nodes given a particular node. (14) where f(u) is a learnable function that maps the node u to an embedding vector. NI(u) is the set of neighboring nodes of node u obtained through the sampling strategy I.

(5) DeepWalk: [60] learns the latent representation of nodes by node sequences, i.e, (15)

(6) Metapath2vec: [41] maximizes the local structure of different types of nodes, (16) where the θ is the parameter to be learned.

(7) HAN: [53] takes into account that different meta-paths have different weights on embedding learning. (17) Z represents the embedding matrix, P is the number of meta-paths, is the weight of the meta-path Φi, and is the node embedding vector in terms of meta-path Φi.

(8) GAT: [61] automatically learns the contribution levels of neighboring nodes when fusing features. (18)

(9) RGCN: [62] performs embedding learning separately for different types of links and then aggregates information from neighboring nodes. (19) represents the neighbor nodes of vi that the link type is r between them, is a link type set in a heterogeneous network; ci,r is a regularization constant, the usual value is ; is a linear transformation function.

(10) GraphSage: [63] aggregates feature information from both neighboring nodes and the node itself. (20)

Evaluation metrics

Recent studies have shown that traditional evaluation metrics for link prediction algorithms can sometimes misrepresent their true predictive performance [64, 65]. Inspired by the literature [66, 67], This paper selects threshold-free evaluation metrics AUC and Precision, where we set L equal to the number of edges in the test set when calculating Precision. In terms of the discriminative ability of evaluation indicators, AUC indicator has the strongest discriminative ability; For Precision, at this point, the Precision and Recall metrics are equal, making Precision calculations simpler and faster, and providing an additional evaluation perspective. On the other hand, Precision has been widely used in link prediction research and can still serve as a reference standard when comparing new and old algorithms or evaluation methods. Therefore, it is reasonable for this paper to choose AUC and Precision as indicators for link prediction evaluation.

In detail, the AUC is defined as follows, (21) where n′ is the number of times that the edge score in the test set is greater than the non-existing edge score, and n″ is the number of times that those equal to each other, the n is the number of times that independent comparisons.

Precision only cares about whether some important links are accurately predicted, especially if the prediction can guide specific scientific experiments. Precision is defined as the proportion of the top L predicted links that are accurately predicted. Specifically, if m out of the top L predicted links are accurately predicted, the precision is given by (22)

In this paper, we set the L as number of test links. In other words, the Precsion measures the performance of the link prediction algorithm in all missing links.

Results for SW-Metapath2vec and benchmark algorithms

The experiments of the SW-Metapath2vec algorithm consist of four main steps. Firstly, the meta-path traces are generated. Secondly, each trace is used as input for weighted embedding learning. Thirdly, link prediction is conducted using the embedding vectors. Lastly, the sensitivity analysis of the hyper-parameters is explored.

Step 1. Generate meta-path traces.

For each heterogeneous network, the meta-path traces are generated according to the meta-path shown in Table 2. For example, when we study the possible links in the DBLP network, we provide the “APTPVPA” for the meta-path as shown in Fig 1. Taking the initial node “Travis A. Bennett” as an example, its meta-path trace is shown in Fig 5. To highlight the semantic relationship of the meta-path, Fig 5 omits the “paper” (P) nodes and the “term” (T) nodes, but it is still a meta-path trace of “APTPVPA”. In Fig 5, The blue nodes represent the authors, red nodes represent the name of the journal or conference. The first node is the author “Travis” (A), the second node is the journal “Assen” (V), and the third node is the author “Natalia” (A). Thus, an “APTPVPA” meta-path trace is generated, and the rest may be deduced by analogy.

thumbnail
Fig 5. A meta-path trace for “APTPVPA” with author Travis as the initial node.

https://doi.org/10.1371/journal.pone.0315507.g005

Step 2. SW-Metapath2vec embedding learning.

After all the meta-path traces are generated, they are used as input for the SW-Metapath2vec embedding algorithm, which learns the potential representation of network nodes. As shown in the flowchart in Fig 3, the SW-Metapath2vec embedding algorithm is primarily implemented using the deep learning framework PyTorch 1.4.0.

Step 3. Link prediction.

In this experiment, we compare the SW-Metapath2vec algorithm with benchmark algorithms that have been introduced in Section: baseline algorithms. Notes that when we conduct experiments on ER networks and WS networks, we only generate the given length random walk sequence.

First, we conducted link prediction comparative experiments on two synthetic networks. The AUC values of each algorithm are shown in Table 3. The SW-Metapath2vec exhibits outstanding performance on synthetic networks. meanwhile, The embedding algorithms outperform the traditional similarity index.

thumbnail
Table 3. The AUC of the SW-Metapath2vec and the networks-based and embedding baselines (The test ratio is 0.3).

https://doi.org/10.1371/journal.pone.0315507.t003

Graph neural networks have been proven to have strong capabilities in the field of link prediction. In order to compare the link prediction performance of SW-Metapath2vec and graph neural networks, we will focus on the comparison between them. we conducted detailed link prediction experiments on different types of link in the heterogeneous complex networks, as shown in Tables 46. They show the Precision of the SW-Metapath2vec algorithm and four benchmark graph neural networks on the ACM, DBLP, and Last.fm datasets, respectively. As seen in Table 4, on the ACM heterogeneous complex network, the SW-Metapath2vec algorithm demonstrates satisfactory predictive performance, particularly for the “paper-author” link type, consistently outperforming the baseline methods.

thumbnail
Table 4. The precision of the SW-Metapath2vec and the benchmark algorithms on the ACM network.

https://doi.org/10.1371/journal.pone.0315507.t004

Table 5 presents the precision of various link prediction algorithms applied to the DBLP heterogeneous complex network. Overall, the SW-Metapath2vec algorithm demonstrates commendable predictive performance, particularly in predicting “author-paper” links, where it achieves the highest precision. However, for “paper-venue” and “paper-term” link predictions, its precision falls short compared to the HAN and RGCN graph neural network algorithms. This difference can be attributed to the relative sparsity of “paper-venue” and “paper-term” links within the DBLP network, which leaves many nodes disconnected through these link types. As a result, the SW-Metapath2vec algorithm did not effectively learn from and train the nodes associated with “paper-venue” and “paper-term” links.

thumbnail
Table 5. The precision of the SW-Metapath2vec and the benchmark algorithms on the DBLP network.

https://doi.org/10.1371/journal.pone.0315507.t005

Table 6 presents the link prediction precision of the proposed SW-Metapath2vec algorithm compared to benchmark algorithms on the Last.fm dataset. As shown, SW-Metapath2vec outperforms the benchmark graph neural network algorithms in predicting “user-artist” and “user-user” links. However, for “artist-tag” link predictions, SW-Metapath2vec performs worse than the HAN, GAT, and RGCN algorithms. This discrepancy is primarily due to the relative sparsity of “artist-tag” links, which hinders the SW-Metapath2vec algorithm’s ability to effectively train and learn from the nodes associated with these links.

thumbnail
Table 6. The precision of the SW-Metapath2vec and the benchmark algorithms on the Last.fm network.

https://doi.org/10.1371/journal.pone.0315507.t006

Overall, the proposed SW-Metapath2vec link prediction algorithm demonstrates notable advantages, especially for networks with denser links. However, when a particular type of link is relatively sparse in the network, the imbalance in node sampling can prevent some nodes from being fully trained and learned. This limitation leads to a decrease in link prediction performance for sparse link types.

Step 4. Hyper-parameters sensitivity analysis.

To investigate the sensitivity of the hyper-parameters in the proposed SW-Metapath2vec link prediction algorithm in this paper, the sensitivity analysis is conducted on DBLP and Last.fm datasets according to the four main hyper-parameters used in the numerical experiment, which contains meta-path trace or path length L, node traversal times N, node frequency threshold F, and sliding window size S. The controls of the scale of “context” for each embedding learning, while S becoming greater, the “context” information is richer. Let benchmark hyper-parameters quadruples be (11, 5, 3, 10). We can change one of the parameters in each experiment. Figs 6 and 7 show the trend of the AUC of SW-Metapath2vec and other baseline algorithms concerning L, N, F and S, respectively.

thumbnail
Fig 6. The AUC of SW-Metapath2vec and benchmark algorithms on DBLP with various hyper-parameters.

https://doi.org/10.1371/journal.pone.0315507.g006

thumbnail
Fig 7. The AUC of SW-Metapath2vec and benchmark algorithms on Last.fm with various hyper-parameters.

https://doi.org/10.1371/journal.pone.0315507.g007

As shown in Figs 6 and 7, the performance of the SW-Metapath2vec algorithm proposed in this paper surpasses that of the benchmark algorithms. With the increase in the length of the meta-path L, the AUC of SW-Metapath2vec decreases slowly, but it remains superior to the benchmark algorithms. Notably, a meta-path length of L = 5 is identified as a relatively optimal value. Increasing the number of node traversals enhances the stability of SW-Metapath2vec compared to other benchmark algorithms. In Fig 6(c), varying the size of the sliding window does not significantly affect the AUC of any of the algorithms. However, the AUC of SW-Metapath2vec declines when F = 30. Figs 6(c) and 7(c) also demonstrate that as the sliding window size increases, the AUC of the SW-Metapath2vec algorithm remains more stable and higher than that of the benchmark algorithms. From Figs 6(d) and 7(d), it is evident that appropriately increasing the threshold of node frequency helps improve the link prediction AUC for each algorithm by eliminating noise nodes, particularly in the large DBLP dataset.

Robustness of the SW-Metapath2vec algorithm

In this section, to illustrate the robustness of the SW-Metapath2vec algorithm, we will delete a certain proportion of nodes from the DBLP and Last.fm datasets. And Fig 8 displays the AUC of all the algorithms versus the fraction of nodes deleted.

thumbnail
Fig 8. The AUC of SW-Metapath2vec and benchmark algorithms with a different fraction of nodes deleted.

https://doi.org/10.1371/journal.pone.0315507.g008

From Fig 8, it is evident that the robustness of the SW-Metapath2vec algorithm is stronger than that of other baseline algorithms. In the “paper-author” network, the SW-Metapath2vec algorithm demonstrates the best robustness. As the ratio of deleted nodes ranges from 0.4 to 0.9, the AUC of the SW-Metapath2vec algorithm remains around 0.8, outperforming other algorithms. In the “user-artist” network, SW-Metapath2vec experiences slight fluctuations but still maintains greater robustness compared to the benchmark algorithms.

Discussion

In the 21st century, the natural environment, artificial environment, and human society can be described as large and diverse complex networks, specifically heterogeneous complex networks. Heterogeneous networks differ significantly from homogeneous networks due to their vast and diverse nodes and the complex links among them. As a result, many universal concepts applicable to homogeneous complex networks are no longer valid for heterogeneous networks. Traditional embedding learning methods have been crucial for link prediction in homogeneous complex networks, focusing on maximizing the likelihood of the local structure.

In heterogeneous networks, not all local links carry the same importance. To address this issue, this paper introduces a novel embedding learning objective function that assigns weight values to different local structures within heterogeneous networks and proposes a heterogeneous network link prediction algorithm called SW-Metapath2vec. The SW-Metapath2vec algorithm generates an embedding vector for each node in the heterogeneous network. It then calculates the similarity between any two nodes using the cosine similarity index, which indicates the potential for links between nodes. Experimental results on both real and synthetic datasets demonstrate that the proposed SW-Metapath2vec algorithm outperforms other benchmark algorithms in terms of performance and robustness. Specifically, the SW-Metapath2vec algorithm remains highly effective even when a large proportion of nodes are removed from the network.

Acknowledgments

The authors acknowledge the interesting comments of anonymous referees.

References

  1. 1. Strogatz S H. Exploring complex networks. nature, 2001; 410(6825): 268–276. pmid:11258382
  2. 2. Barzel B, Barabási A L. Network link prediction by global silencing of indirect correlations. Nature biotechnology, 2013; 31(8): 720–725. pmid:23851447
  3. 3. Zhou T. Progresses and challenges in link prediction. iScience, 2021; 24(11): 103217. pmid:34746694
  4. 4. Liben-Nowell D and Kleinberg J. The link-prediction problem for social networks. Journal of the American Society for Information Science and Technology. 2007; 58: 1019–1031
  5. 5. Lü L, Zhou T. Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications, 2011; 390(6): 1150–1170.
  6. 6. Daud N N, Ab Hamid S H, Saadoon M, et al. Applications of link prediction in social networks: A review. Journal of Network and Computer Applications, 2020; 166: 102716.
  7. 7. Kumar A, Singh S S, Singh K, et al. Link prediction techniques, applications, and performance: A survey. Physica A: Statistical Mechanics and its Applications, 2020; 553: 124289.
  8. 8. Ghasemian A, Hosseinmardi H, Galstyan A, et al. Stacking models for nearly optimal link prediction in complex networks. Proceedings of the National Academy of Sciences, 2020; 117(38): 23393–23400. pmid:32887799
  9. 9. Chen W, Li J, Jiang J. Heterogeneous combat network link prediction based on representation learning. IEEE Systems Journal, 2020; 15(3): 4069–4077.
  10. 10. Bukhori H A, Munir R. Inductive link prediction banking fraud detection system using homogeneous graph-based machine learning model. In: 2023 IEEE 13th Annual Computing and Communication Workshop and Conference, 2023; 0246-0251.
  11. 11. Almansoori W, Gao S, Jarada T N, et al. Link prediction and classification in social networks and its application in healthcare and systems biology. Network Modeling Analysis in Health Informatics and Bioinformatics, 2012; 1: 27–36.
  12. 12. Yu F, Zeng A, Gillard S, et al. Network-based recommendation algorithms: A review. Physica A: Statistical Mechanics and its Applications, 2016; 452: 192–208.
  13. 13. Abbas K, Abbasi A, Dong S, et al. Application of network link prediction in drug discovery. BMC bioinformatics, 2021; 22: 187. pmid:33845763
  14. 14. Berlusconi G, Calderoni F, Parolini N, et al. Link prediction in criminal networks: A tool for criminal intelligence analysis. PloS ONE, 2016; 11(4): e0154244. pmid:27104948
  15. 15. Clauset A, Moore C and Newman MEJ. Hierarchical structure and the prediction of missing links in networks. Nature, 2008; 453: 98–101. pmid:18451861
  16. 16. Guimera R and Sales-Pardo M. Missing and spurious interactions and the reconstruction of complex networks. Proceedings of the National Academy of Sciences, 2009; 106: 22073–22078. pmid:20018705
  17. 17. Wang Z, Liang J, Li R. A fusion probability matrix factorization framework for link prediction. Knowledge-Based Systems, 2018; 159: 72–85.
  18. 18. Xian X, Wu T, Qiao S, et al. NetSRE: Link predictability measuring and regulating. Knowledge-Based Systems, 2020; 196: 105800.
  19. 19. Chen G, Wang H, Fang Y, et al. Link prediction by deep non-negative matrix factorization. Expert Systems with Applications, 2022; 188: 115991.
  20. 20. Chuan P M, Son L H, Ali M, et al. Link prediction in co-authorship networks based on hybrid content similarity metri. Applied Intelligence, 2018; 48: 2470–2486.
  21. 21. Zhang Q, Tong T, Wu S. Hybrid link prediction via model averaging. Physica A: Statistical Mechanics and its Applications, 2020; 556: 124772.
  22. 22. Ghorbanzadeh H, Sheikhahmadi A, Jalili M, et al. A hybrid method of link prediction in directed graphs. Expert Systems with Applications, 2021; 165: 113896.
  23. 23. Lu L, Pan L, Zhou T, Zhang YC and Stanley HE. Toward link predictability of complex networks. Proceedings of the National Academy of Sciences, 2015; 112: 2325–2330. pmid:25659742
  24. 24. García-Pérez G, Aliakbarisani R, Ghasemi A, et al. Precision as a measure of predictability of missing links in real networks. Physical Review E, 2020; 101(5): 052318. pmid:32575233
  25. 25. Suo-Yi T, Ming-Ze Q, Jun W, et al. Link predictability of complex network from spectrum perspective. Acta Physica Sinica, 2020; 69(8).
  26. 26. Sun J, Feng L, Xie J, et al. Revealing the predictability of intrinsic structure in complex networks. Nature communications, 2020; 11(1): 574. pmid:31996676
  27. 27. Martínez V, Berzal F, Cubero J C. A survey of link prediction in complex networks. ACM computing surveys (CSUR), 2016; 49(4): 1–33.
  28. 28. Chai L, Tu L, Yu X, et al. Link prediction and its optimization based on low-rank representation of network structures. Expert Systems with Applications, 2023; 219: 119680.
  29. 29. Chen B, Li F, Chen S, et al. Link prediction based on non-negative matrix factorization. PloS ONE, 2017; 12(8): e0182968. pmid:28854195
  30. 30. Forouzandeh S, Berahmand K, Sheikhpour R, et al. A new method for recommendation based on embedding spectral clustering in heterogeneous networks (RESCHet). Expert Systems with Applications, 2023; 231: 120699.
  31. 31. Yang C, Xiao Y, Zhang Y, et al. Heterogeneous network representation learning: A unified framework with survey and benchmark. IEEE Transactions on Knowledge and Data Engineering, 2020; 34(10): 4854–4873. pmid:37915376
  32. 32. Cui Z, Xu X, XUE F, Cai X, Cao Y, Zhang W, et al. Personalized Recommendation System Based on Collaborative Filtering for IoT Scenarios. IEEE Transactions on Services Computing, 2020; 13(4): 685–695.
  33. 33. Kim J. Evaluating author name disambiguation for digital libraries: A case of DBLP. Scientometrics, 2018; 116: 1867–1886.
  34. 34. Holland DO, Shapiro BH, Xue P, Johnson ME. Protein-protein binding selectivity and network topology constrain global and local properties of interface binding networks. Scientific Reports, 2017; 7: 5631. pmid:28717235
  35. 35. Zhang N, Han J, Liu N, Shao L. Summarize and Search: Learning Consensus-Aware Dynamic Convolution for Co-Saliency Detection. in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021; 4167-4176.
  36. 36. Yang Y, Han J, Zhang D, Tian Q. Exploring rich intermediate representations for reconstructing 3D shapes from 2D images. Pattern Recognition, 2022; 122: 108295.
  37. 37. Zhang N, Han J, Liu N. Learning Implicit Class Knowledge for RGB-D Co-Salient Object Detection With Transformers. IEEE Transactions on Image Processing, 2022; 31: 4556–4570. pmid:35763477
  38. 38. Sun Y, Han J. Mining heterogeneous information networks: principles and methodologies. Morgan & Claypool Publishers, 2012.
  39. 39. Shakibian H, Charkari NM. Statistical similarity measures for link prediction in heterogeneous complex networks. Physica A: Statistical Mechanics and its Applications, 2018; 501: 248–263.
  40. 40. Shakibian H, Charkari NM, Jalili S. Multi-kernel one class link prediction in heterogeneous complex networks. Applied Intelligence, 2018; 48: 3411–3428.
  41. 41. Dong Y, Chawla NV, Swami A. Metapath2vec: Scalable Representation Learning for Heterogeneous Networks. in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, Halifax NS Canada, 2017; 135-144.
  42. 42. Pham P, Do P. W-MetaPath2Vec: The topic-driven meta-path-based model for large-scaled content-based heterogeneous information network representation learning. Expert Systems with Applications, 2019; 123: 328–344.
  43. 43. Zhang Y, Yang X, Wang L. Weighted Meta-Path Embedding Learning for Heterogeneous Information Networks. In: International Conference on Web Information Systems Engineering, Springer, 2020; 29-40.
  44. 44. Velickovic P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y, et al. Graph attention networks. stat, 2017; 1050(20): 10–48550.
  45. 45. Liu L, Wang S. Meta-path-based outlier detection in heterogeneous information network. Frontiers of Computer Science, 2020; 14(2): 388–403.
  46. 46. Kolanczyk RC, Schmieder P, Jones WJ, et al. MetaPath: An electronic knowledge base for collating, exchang-ing and analyzing case studies of xenobiotic metabolism. Regulatory Toxicology and Pharmacology, 2012; 63(1): 84–96. pmid:22414578
  47. 47. Shi C, Li Y, Yu PS, Wu B. Constrained-meta-path-based ranking in heterogeneous information network. Knowledge and Information Systems, 2016; 49: 719–747.
  48. 48. Cao X, Zheng Y, Shi C, Li j, Wu B. Meta-path-based link prediction in schema-rich heterogeneous information network. International Journal of Data Science and Analytics, 2017; 3: 285–296.
  49. 49. Chen Y, Lv YY, Wang X, Li L, Wang FY. Detecting traffic information from social media texts with deep learning approaches. IEEE Transactions on Intelligent Transportation Systems, 2018; 20(8): 3049–3058.
  50. 50. Karyaeva MS, Braslavski PI, Sokolov VA. Word Embedding for Semantically Related Words: An Experimental Study. Automatic Control and Computer Sciences, 2019; 53(7): 638–643.
  51. 51. Mohd M, Jan R, Shah M. Text document summarization using word embedding. Expert Systems with Applications, 2020; 143: 112958.
  52. 52. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013; arXiv:1301.3781.
  53. 53. Wang X, Ji H, Shi C, Wang B, Ye Y, Cui P, et al. Heterogeneous Graph Attention Network. In: The World Wide Web Conference, WWW’19, Association for Computing Machinery, New York, NY, USA, 2019; 2022-2032.
  54. 54. Tang J, Zhang J, Yao L, Li J, Zhang L, Su Z. Arnetminer: Extraction and mining of academic social networks. in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008; 990-998.
  55. 55. Bertin-Mahieux T, Ellis DPW, Whitman B, Lamere P. The Million Song Dataset. In: Proceedings of the 12th International Conference on Music Information Retrieval (ISMIR 2011), 2011.
  56. 56. Lorrain F, White HC. Structural equivalence of individuals in social networks. The Journal of Mathematical Sociology, 1971; 1(1): 49–80.
  57. 57. Zhou T, Lu L, Zhang YC. Predicting missing links via local information. The European Physical Journal B-Condensed Matter and Complex Systems, 2009; 71: 623–630.
  58. 58. Jaccard P. Etude comparative de la distribution florale dans une portion des Alpes et des Jura. Bulletin of the Torrey Botanical Club, 1901; 37: 547–579.
  59. 59. Grover A, Leskovec J. Node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining. 2016; 855-864.
  60. 60. Perozzi B, Al-Rfou R, Skiena S. Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 2014; 701-710.
  61. 61. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems, 2017.
  62. 62. Schlichtkrull M, Kipf T N, Bloem P, et al. Modeling relational data with graph convolutional networks. In: The semantic web: 15th international conference, 2018; 593-607.
  63. 63. Hamilton W, Ying Z, Leskovec J. Inductive representation learning on large graphs. Advances in neural information processing systems, 2017; 30.
  64. 64. Menand N, Seshadhri C. Link prediction using low-dimensional node embeddings: The measurement problem. Proceedings of the National Academy of Sciences, 2024; 121(8): e2312527121. pmid:38363864
  65. 65. Muscoloni A, Cannistraci C V. “Stealing fire or stacking knowledge” by machine intelligence to model link prediction in complex networks. iScience, 2023; 26(1). pmid:36570772
  66. 66. Bi Y, Jiao X, Lee Y L, et al. Inconsistency of evaluation metrics in link prediction. arXiv preprint arXiv:2402.08893, 2024.
  67. 67. Zhou T. Discriminating abilities of threshold-free evaluation metrics in link prediction. Physica A: Statistical Mechanics and its Applications, 2023; 615: 128529.