BERT4FCA: A method for bipartite link prediction using formal concept analysis and BERT

Siqi Peng; Hongyuan Yang; Akihiro Yamamoto

doi:10.1371/journal.pone.0304858

Abstract

Link prediction in bipartite networks finds practical applications in various domains, including friend recommendation in social networks and chemical reaction prediction in metabolic networks. Recent studies have highlighted the potential for link prediction by maximal bi-cliques, which is a structural feature within bipartite networks that can be extracted using formal concept analysis (FCA). Although previous FCA-based methods for bipartite link prediction have achieved good performance, they still have the problem that they cannot fully capture the information of maximal bi-cliques. To solve this problem, we propose a novel method for link prediction in bipartite networks, utilizing a BERT-like transformer encoder network to enhance the contribution of FCA to link prediction. Our method facilitates bipartite link prediction by learning more information from the maximal bi-cliques and their order relations extracted by FCA. Experimental results on five real-world bipartite networks demonstrate that our method outperforms previous FCA-based methods, a state-of-the-art Graph Neural Network(GNN)-based method, and classic methods such as matrix-factorization and node2vec.

Citation: Peng S, Yang H, Yamamoto A (2024) BERT4FCA: A method for bipartite link prediction using formal concept analysis and BERT. PLoS ONE 19(6): e0304858. https://doi.org/10.1371/journal.pone.0304858

Editor: Karthik Raman, Indian Institute of Technology Madras, INDIA

Received: February 22, 2024; Accepted: May 20, 2024; Published: June 5, 2024

Copyright: © 2024 Peng et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All datasets and codes are available from GitHub: https://github.com/HYYANG9/BERT4FCA.

Funding: This study was funded by the Japan Society for the Promotion of Science (https://www.jsps.go.jp/english/e-grants/) in the form of a Grant-in-Aid for Scientific Research(B) to SP and AY[JP21H03499]. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Bipartite link prediction is the task of predicting the absence or presence of unobserved links in a bipartite network [1–3]. A bipartite network is a structure consisting of two disjoint sets of nodes and a set of edges where every edge only connects two nodes from different sets. Many real-world relational data can be naturally modeled as bipartite networks where the two sets of nodes represent two groups of entities and the edges represent their links or relations [4, 5]. In real-world bipartite networks, some links may be missing or have not been observed yet [1, 3]. This leads to the need to predict whether an unobserved link should be a potential new one and gives rise to the task of bipartite link prediction.

Currently, there are two types of bipartite link prediction. The first type focuses on predicting the relations between nodes from different node sets. For each pair of nodes from different node sets, it predicts whether there should be a link between them if they are not linked with an edge in the original network. For example, in a chemical-disease interaction database, for a chemical and a disease whose interactions are unobserved, it predicts whether they should interact with each other. The second type focuses on predicting the relations between nodes from the same node sets. For two nodes from the same set that are not connected to the same node in the other set, it predicts whether there should be an unobserved node in the other set that is connected to both nodes. For example, in an author-paper network, for two authors that do not have a co-authorship, it predicts if they will have a new co-authorship in the future [6, 7]. Both types of bipartite link prediction have attracted increasing attention for high practical values [3, 8, 9]. Various methods have been proposed for both types of bipartite link prediction [2, 3, 10]. While some bipartite link prediction methods utilize external information, such as the properties of nodes [11], most methods make predictions based only on the structural features of the network. Notable methods that use only the structural features include pure rule-based methods like Common Neighbors [12] and node embedding methods such as Node2Vec [13] and autoencoder based embedding methods [14, 15].

Recently, research has found that the bi-cliques of a bipartite network represent important structural features and can be useful for bipartite link prediction [16–19]. A bi-clique is a complete sub-network of a bipartite network where every node in the first set is linked to every node in the second set. Bi-cliques represent clusters of strongly related entities. For example, in an author-paper network, a bi-clique represents a group of co-researchers and their publications. To capture the information of bi-cliques, research in [7] proposed object2vec and attribute2vec, which aims to embed the nodes of a bipartite network into a vector space based on their co-occurrence relationship in the maximal bi-cliques of the network. To achieve this, they used the method of formal concept analysis (FCA) [7, 20, 21]. FCA is a method for learning rules from a binary relational knowledge base, which is strongly connected with bipartite networks and bi-cliques. It aims at extracting and organizing formal concepts from a formal context. A formal context represents a collection of binary data of two disjoint sets of entities called objects and attributes and their binary relation. A formal concept represents a maximal group of objects that share the same set of attributes. If we regard the sets of objects and attributes in a formal context as the two node sets in a bipartite network, we will find that the binary relation in the formal context will be equivalent to the set of links in the bipartite network, and the formal concepts extracted from the formal context will be equivalent to the maximal bi-cliques extracted from the bipartite network. With such an equivalence, the co-occurrences of nodes in a maximal bi-clique can be converted to the co-occurrences of objects in a formal concept, which resembles the co-occurrences of words in a sentence. Thus, one can use the embedding models similar to the well-known word embedding model Word2Vec [22, 23] to embed the nodes into vectors [7]. Besides such a Word2Vec-like embedding method, in [20], the authors proposed another method for embedding the nodes based on bi-clique information with the help of Bidirectional Long Short-Term Memory (Bi-LSTM) [24]. Both methods have shown good performances [7, 20]. These two methods open up a novel strategy for bipartite link prediction, that is, to convert bipartite networks into formal contexts and use FCA to extract information on bi-cliques so that we can further process them with other methods. However, their methods have two limitations. First, they did not fully utilize all the information that FCA can extract. Second, their methods can only conduct the first type of link prediction, that is, predicting the relations between nodes from the same node set. We consider it better to design a general method that can conduct both types of bipartite link prediction.

To address these limitations, we propose a novel method called BERT4FCA, which is designed to capture more information provided by FCA and use such information to make both types of bipartite link prediction with the help of a popular method in natural language processing method called BERT or Bidirectional Encoder Representations from Transformers [24]. BERT is a method for training language models utilizing the Transformer architecture [25]. It first pre-trains a large model on unlabeled free text to learn the co-relations between words and sentences; then, it fine-tunes the model on a small labeled dataset to fit a target downstream task. We chose BERT because we found that the information provided by FCA, such as the formal concepts and their hierarchical relations, shares some similarities with the input data that BERT takes. Hence, after processing the information provided by FCA and modifying the basic network structure of BERT, we will be able to capture all such information during pre-training. Then, by fine-tuning the pre-trained model to fit our target bipartite link prediction tasks, we are expected to get better results than those embedding methods thanks to the extra information we have captured. Additionally, in the field of natural language processing, while Word2Vec requires task-specific networks for downstream tasks, BERT can be fine-tuned for multiple tasks using the same pre-trained model, which is more convenient. We believe the convenience is preserved by our method since we do not need to design two different networks for two link prediction tasks.

Our main contributions are as summarized as follows:

We propose a novel FCA-based method called BERT4FCA for bipartite link prediction, which can capture and learn more information extracted by FCA than previous FCA-based methods. We conduct ablation experiments to show that learning more information given by FCA contributes to higher link prediction performance.
Experimental results show that our method outperforms all previous FCA-based methods as well as classic non-FCA-based methods on both types of bipartite link prediction.
We demonstrate that the information on the order relation between maximal bi-cliques is beneficial for bipartite link prediction. To the best of our knowledge, no research has discovered it before.
To the best of our knowledge, we are the first to provide a general method for using a BERT-like transformer architecture to learn information from the concept lattices provided by FCA. Although in this paper, we only discuss its application in bipartite link prediction, we plan to conduct further research on applying it to other FCA-related tasks.

The rest part of this paper is organized as follows. In the “Preliminaries” section, we start with some preliminaries, including bipartite networks, bi-cliques, FCA, and BERT. In the “Problem Formulation and Related Work” section, we present the problem formulation of bipartite link prediction and provide an overview of related work. In “The Proposed Method” section, we introduce and analyze our method BERT4FCA. In the “Experiment” section, we describe our experiments on five real-world datasets and discuss the results. Finally, in the “Conclusion and Future Work” section, we draw a conclusion and discuss our plans for future work.

Preliminaries

Bipartite networks and bi-cliques

Bipartite networks.

A bipartite network C is a triple (U, V, E) where U, V are two disjoint sets of nodes and E is a set of edges. Each edge connects a node u ∈ U to another node v ∈ V and is denoted as (u, v). That is, we should have U ∩ V = ∅ and E ⊆ U × V. Fig 1 gives an example of a bipartite network, where all nodes u in U are colored blue, and all nodes v in V are colored red. We can observe that edges only exist between pairs of nodes from different sets of nodes.

Download:

Fig 1. An example of a bipartite network (U, V, E) and two of its bi-cliques.

The nodes in blue form the node set U, and the nodes in red form the node set V. The two sub-networks framed in green and purple are two maximal bi-cliques of the network.

https://doi.org/10.1371/journal.pone.0304858.g001

Sub-networks.

A bipartite network C₁ = (U₁, V₁, E₁) is a sub-network of C = (U, V, E) if both sets of nodes and the set of edges of C₁ are subsets of those of C, i.e., U₁ ⊆ U, V₁ ⊆ V, and E₁ ⊆ E.

Bi-cliques.

A bipartite network C₁ = (U₁, V₁, E) is a bi-clique of C = (U, V, E) if C₁ is a sub-network of C and there is an edge between every node pair from different subsets of nodes in C₁, i.e., U₁ ⊆ U, V₁ ⊆ V, E₁ ⊆ E and U₁ × V₁ = E₁.

Maximal bi-cliques.

A bi-clique C₁ = (U₁, V₁, E₁) is a maximal bi-clique of a bipartite network C = (U, V, E) if it is not a sub-network of any other bi-cliques of the bipartite network, i.e., ∀C₂ = (U₂, V₂, E₂) such that U₁ ⊆ U₂ ⊆ U and V₁ ⊆ V₂ ⊆ V, U₂ × V₂ = E₂ is satisfied if and only if U₁ = U₂, V₁ = V₂, and E₁ = E₂. Fig 1 gives some examples of maximal bi-cliques.

Formal concept analysis (FCA)

Formal concept analysis (FCA) is a method for learning rules from a binary relational knowledge base called a formal context. Given a formal context, it aims to extract formal concepts and their hierarchical structures, which constitute concept lattices. In this paper, we only briefly introduce the minimal necessary notions of FCA. For a detailed introduction, please refer to [26–29].

Formal contexts.

A formal context is a triple , where G is a set of objects, M is a set of attributes, and I ⊆ G × M is a binary relation called incidence that expresses which object has which attribute. We write gIm or (g, m) ∈ I to express that the object g ∈ G has the attribute m ∈ M.

Formal contexts are illustrated in binary tables, as exemplified in the left of Fig 2, where rows correspond to objects and columns to attributes, and a cell is marked with a cross if the object in its row has the attribute in its column. In the context shown in the left panel of Fig 2, the marked cell represents that the object listed in the row possesses the corresponding attribute in the column.

Download:

Fig 2.

Left: A sample formal context. Right: The concept lattice corresponding to the formal context in the left panel. The nodes in yellow are neighbors and the nodes in red are not neighbors.

https://doi.org/10.1371/journal.pone.0304858.g002

Formal concepts.

In a context , for subsets of objects and attributes A ⊆ G and B ⊆ M, (A, B) is called a formal concept if ∀(A₁, B₁) such that A ⊆ A₁ ⊆ G and B ⊆ B₁ ⊆ M, A₁ × B₁ ⊆ I is satisfied if and only if A = A₁ and B = B₁. If (A, B) is a formal concept, A is also called an extent, and B is also called an intent.

Concept lattices.

Given a context , the concept lattice of context , denoted by , is the structure that organizes the set of all concepts extracted from context with the hierarchical order <. For two concepts (A₁, B₁) and (A₂, B₂), we write (A₁, B₁) < (A₂, B₂) if A₁ ⊂ A₂ (which mutually implies B₂ ⊂ B₁).

Concept lattices are usually figuralized with line diagrams. For example, the line diagram shown in the right panel of Fig 2 represents the concept lattice of the context represented in the left panel of the same figure. In the diagram, nodes represent formal concepts and lines represent hierarchical orders.

Neighbor of concepts.

In a concept lattice , for two concepts (A₁, B₁) and (A₂, B₂) such that (A₁, B₁) < (A₂, B₂), (A₁, B₁) is defined to be the lower neighbor of (A₂, B₂) if such that (A₁, B₁) < (A₃, B₃) < (A₂, B₂) [26]. In this case, (A₂, B₂) is dually called the upper neighbor of (A₁, B₁) or we may also directly say that (A₁, B₁) and (A₂, B₂) are neighbors or that they have the neighboring relation.

In the line diagram representation of a concept lattice, if two concepts are neighbors, they will be shown as two nodes directly connected by a line. For example, in the concept lattice shown on the right of Fig 2, the nodes in yellow are neighbors, and the nodes in red are not neighbors.

FCA and bipartite networks

By comparing the definitions above, we can easily find that the definition of a formal context is equivalent to that of a bipartite network, and the definition of a maximal bi-clique is equivalent to that of a formal concept. That is, for every bipartite network C = (U, V, E), if we consider the two node sets of a bipartite network U and V as the object set and the attribute set, and the edge set E as the binary relation of the objects and attributes, we may easily find that (U, V, E) should also be a formal context. Also, if C₁ = (A₁, B₁, E₁) is found to be a maximal bi-clique of C, it is certain that (A₁, B₁) is also a formal concept in . Fig 3 gives an example of such an equivalence between bipartite networks and formal contexts, as well as the equivalence between maximal bi-cliques and formal concepts.

Download:

Fig 3. A depiction of the equivalence between bipartite networks and formal contexts, as well as the equivalence between maximal bi-cliques and formal concepts.

The bipartite network to the left can be represented as the formal context to the right. The sub-network circled in purple and green are maximal bi-cliques in the bipartite networks to the left, which can be represented into two formal concepts framed in the corresponding colors in the formal context to the right.

https://doi.org/10.1371/journal.pone.0304858.g003

Many previous research has spotted and discussed the two types of equivalences. However, besides these two types of equivalences, we may also find that the order relation and neighboring relation between formal concepts also correspond to the cover relations between maximal bi-cliques. The hierarchical order and neighboring relation between formal concepts are considered important in FCA and are represented by concept lattices, so in bipartite networks, the cover relations between maximal bi-cliques may also offer valuable insights for understanding the networks’ structures. However, to the furthest of our knowledge, no previous studies on bipartite network theory have discussed on it. Hence, we believe that if we can design a model for learning the cover relations between bi-cliques, it will hopefully capture more information on the networks’ structure and thus have a better performance in bipartite link prediction.

BERT

BERT, short for Bidirectional Encoder Representations from Transformers [30], is a method for training a language model via the Transformer architecture [25]. It works in a “pre-train first and fine-tune next” mode. First, it pre-trains a large language model with two general tasks on a large amount of unlabeled free text. Then, it fine-tunes the pre-trained model with a specific downstream task on a small labeled dataset.

The two general tasks used in the pre-training phase are mask language model (MLM) and next sentence prediction (NSP), which are defined as follows.

MLM: MLM is the task that predicts the full sentence from a sentence where some words are randomly masked with a special token “[MASK]”. For example, if we have a sentence like “the quick brown fox jumps over the lazy dog”, the model will take a masked version of the sentence like “the [MASK] brown fox jumps [MASK] the lazy dog” as input and the original sentence as target for output. The task helps the model learn the co-occurrence relationship between words in the same sentence [30].

NSP: NSP is the task that predicts whether two sentences are the subsequent sentences or not. The model takes a sentence pair (A, B) as input and is expected to output TRUE if B is the sentence directly following A in the corpus, or FALSE otherwise. The task helps the model to understand the relationships between sentences.

After pre-training, the values of the weight matrices of the pre-trained model are used as initial values for that of the fine-tune model and updated in the fine-tuning phase [31]. This is possible because pre-training and fine-tuning should use exactly the same network architecture except for the output layer. The downstream task in the fine-tuning phase should be the final target task. Hence, after fine-tuning, we get the final model which can be directly used for our target task.

BERT and FCA

As mentioned above, BERT is a method for training language models, which takes sentences in natural languages as inputs. Although formal concepts are not sentences in natural languages, they do share some similarities—If we regard objects and attributes as words, then the extents and intents of a concept can be regarded as sentences in a language that has an order-free syntax. With such similarities, we may expect that the pre-training phase of BERT, which was originally designed for learning the features of words and sentences, can also be used for learning the features of objects, attributes, and formal concepts from a formal context and its corresponding concept lattice. We may also expect that bipartite link prediction can be a suitable downstream task after such a pre-training because the information of nodes (a.k.a. objects and attributes), and maximal bi-cliques (a.k.a. formal concepts) learned in the pre-training phase is helpful in increasing the accuracy of predictions.

Note that in most natural languages, the syntax is not order-free, indicating that different word orderings usually have completely different meanings. Hence, BERT has a special mechanism called position embedding for capturing the order of words in the input sentence. However, since the extents and intents of formal concepts are unordered, there is no need to keep track of the order. Hence, in our method, this mechanism is removed from BERT.

Problem formulation and related work

Problem formulation

The research deals with the problem of bipartite link prediction, which consists of two different tasks—the first task predicts the missing or unknown relation between two nodes from the same node set of a bipartite network, and the other task predicts the missing or unknown relation between two nodes from different node sets. As analyzed above, a bipartite network is equivalent to a formal context, and the two node sets are equivalent to the object set and the attribute set. Hence, for ease of understanding, we hereby name the two tasks as the object-object task or the O-O task and the object-attribute task or the O-A task. The two node sets of a bipartite network are also directly called the objects and attributes in the rest part of the paper.

The formal definitions of the two tasks are defined as follows.

O-O task: Given an original network C = (U, V, E) and a target network C′ = (U, V′, E′), the O-O task aims to predict if a group of nodes G = {u₁, u₂, ⋯, u_|G|} ⊆ U that does not have object-object links in C, should have object-object links in C′. For a group of nodes G ⊆ U, they are considered to have object-object links in a network (U, V, E) if ∃v ∈ V such that ∀u ∈ G, (u, v) ∈ E.

O-A task: Given an original network C = (U, V, E) and a target network C′ = (U, V, E′), the O-A task aims to predict if two nodes u ∈ U and v ∈ V that does not have an object-attribute link in C, should have an object-attribute link in C′. For two nodes u ∈ U and v ∈ V, they are considered to have an object-attribute link in a network (U, V, E) if (u, v) ∈ E.

Note that the O-O task can be regarded as a special case of another task hyperedge prediction. For more details, please refer to related publications [32, 33].

The proposed method

To address the limitations of the previous FCA-based methods for bipartite link prediction, we propose a novel method named BERT4FCA, which is designed to better learn the information of concept lattices and can conduct both the O-O task and the O-A task. We name our method BERT4FCA because it provides a general framework for using BERT to learn and utilize the information of concept lattices, so it is expected to be generally applicable to all tasks related to the context. Although in this research, we only discuss its application in the two bipartite link prediction tasks, we plan to study the possibility of applying our method to other tasks in the future.

Similar to all previous FCA-based bipartite link prediction methods, the objective of our method is to learn information from concept lattices and use it to make two link prediction tasks. However, the information our method aims to learn not only includes the extent and intent of a formal concept, but also includes the neighboring relations between formal concepts. Note that here we choose to learn the neighboring relations instead of the general order relations because, to reconstruct the concept lattice, the neighboring relations are enough [26].

Our method consists of 4 steps: data preparation with FCA, input tokenization, BERT pre-training, and BERT fine-tuning. An overview of the workflow of our method is shown in Fig 4.

Download:

Fig 4. An overview of the working flow of our method.

https://doi.org/10.1371/journal.pone.0304858.g004

Data preparation

In this step, we convert the bipartite network into a formal context, extract all formal concepts, and construct the concept lattice. Then, we extract the neighboring relations between concepts from the concept lattice. We use Z-TCA [36] for extracting all formal concepts from a formal context, and use a topological sorting algorithm to extract neighboring relations from the concept lattice. The details of the algorithm are presented in Algorithm 1. After this step, we will obtain all extents and intents and neighboring concepts in the concept lattice.

Algorithm 1 Get all neighboring relations using topological sorting.

Input A formal context and its concept lattice .

Output {N(C)}, a list of the lower neighbors of concepts. Here N(C) represents the lower neighbor of .

1: for do

2:

3: end for

4: Create Q as an empty queue.

5: for do

6: if D(C) = 0 then

7: Push C into the back of Q.

8: end if

9: end for

10: while Q is not empty do

11: Fetch C from the front of Q and pop it.

12: for such that C < C₁ do

13: D(C₁) ← D(C₁) − 1

14: if D(C₁) = 0 then

15: N(C₁) ← C

16: Push C₁ into the back of Q.

17: end if

18: end for

19: end while

Input tokenization

In this step, the objects and attributes are tokenized into one-hot vectors. These one-hot vectors are further converted into dense vectors through the input embedding so that they can be processed by BERT.

The input embedding is the sum of two parts: tokenization embedding and segment embedding. Tokenization embedding is the general-sense embedding that uses a full-connect layer to embed the tokenized one-hot vectors into a dense vector space. Segment embedding is used to embed the tokens in the segment-info sequence into the vector space of the same dimension as the tokenization embedding. In BERT, sometimes the training sample may be a concatenation of two different sequences (details will be introduced later). In this case, the segment-info sequence is generated to distinguish the two different sequences within the same training sample. For example, in a training sample, the input is a sequence of seven tokens, where the first four tokens belong to the first sequence, and the last three tokens belong to the second sequence. Then, the segment-info sequence of this training sample should be (0, 0, 0, 0, 1, 1, 1).

Note that, as mentioned above, in the original framework of BERT, there is another input embedding called position embedding used for learning the order of words in a sentence. In BERT4FCA, however, it is removed because the input sequences are all unordered.

BERT pre-training

In this step, we use the BERT framework to pre-train two models—the object model and the attribute model—on all extents and intents, correspondingly. Here, we only introduce the pre-training process of the object model for example. The input of a training sample of the pre-training of the object model is a pair of sequences, each representing an extent. Since the lengths of extents may be different, we pad short extents to make all extents have the same length with a special token “[PAD]”. We also add a special token, “[SEP]”, between two sequences in order to separate them.

The model is trained with two tasks: masked token prediction (MTP) and neighboring concepts prediction (NCP), derived from the MLM and NSP tasks in the original version of BERT, correspondingly.

MTP is the task that helps the model learn the co-occurrence relationships between objects within the same extent. In the task, we randomly select a certain percentage of objects in both extents to be masked. For each object to be masked, we replace them into a special token “[MASK]” with 80% probability, or replace it with a random object with 10% probability, or keep it unchanged with 10% probability. Then, the model takes the masked pair of extents as input and the unmasked pair of extents as the target for output. That is, it is trained to predict the masked objects in the extents.

NCP is the task that helps the model learn the neighboring relations between formal concepts in the concept lattice. In the task, the model takes a pair of extents as input and is expected to output TRUE if the pair of extents corresponds to a pair of formal concepts that have neighboring relations or output FALSE otherwise. Note that in this task, clearly, the number of negative samples is much larger than the positive samples. To get a balanced training set, we randomly select a small portion of negative samples and keep the number of positive and negative samples to be the same.

The two tasks are trained simultaneously—the training loss is the sum of the losses of the two tasks. After pre-training, the pre-trained models are expected to have captured and stored information on the concept lattice, including relationships between objects and formal concepts.

BERT Fine-tuning: In this step, we fine-tune the pre-trained object model and attribute model to make them fit our target bipartite link prediction tasks.

For the O-O task, we fine-tune the pre-trained object model with the training samples generated from the original bipartite network (U, V, E). The training samples are generated with the following steps. First, we determine l_m, the maximal length of a group of objects we want to predict. Then, we enumerate all object subsets U₁ ⊆ U such that |U₁| ≤ l_m. For each U₁, we create a training sample that takes the tokenized and padded sequence of U₁ as input, and the label for the sample is set to TRUE if the objects in U₁ have object-object links and FALSE otherwise. After generation, the training samples are fed into a network that has the same structure as the network used in the pre-training step, except that in the last layer, the hidden states are not fed into the output layer used for the two pre-training tasks, but are fed into an output layer specified to the O-O task. Suppose the final output of the fine-tuning network for the O-O task is P_O−O; the last hidden state of the basic BERT network, i.e., the “[CLS]” representation [30] is , then the O-O-task-specified output layer can be described as follows: (1) where σ(⋅) is the sigmoid function; ReLU(⋅) is the Rectified Linear Unit (ReLU) [37]; W_[CLS] and W are weight matrices.

For the O-A task, we fine-tune both the pre-trained object model and the attribute model together with training samples generated from the original bipartite network. The training samples are generated with the following steps. We first enumerate all objects u ∈ U and attributes v ∈ V. For each pair of (u, v), we create a training sample that takes the tokens of u and v as input, and the label for the sample is set to TRUE if (u, v) ∈ E and FALSE otherwise. After generation, the tokenized object and tokenized attribute are fed into two separate BERT networks—except for the last output layer, the first network has the same structure as the network used in pre-training the object model, and the second network has the same structure as that used in pre-training the attribute model. The last hidden states of both networks are concatenated and fed through a single output layer specified for the O-A task. Suppose the final output of the fine-tuning network for the O-A task is P_O−A; the last hidden states of the two BERT networks are and , then the O-A-task-specified output layer can be described as follows: (2) where a^⌢b means the concatenation of vectors a and b.

Above is the whole workflow of our method. From above, we can see that our method has three advantages. First, in the pre-training step, our method can learn more information from concept lattices compared to the previous FCA-based embedding methods like object2vec. As shown in Fig 5, in object2vec, when embedding an object, it uses only the information of a small set of objects within the same extent. In the pre-training step of our method, however, we learn the feature of an object using the information from all objects in the entire extent as well as the objects in the extents of its neighbor concepts in concept lattice.

Download:

Fig 5. The comparison of how much information from a concept lattice is learned and used when predicting an object by two methods, with object2vec shown on the left and BERT4FCA shown on the right.

The target object to be predicted is circled in blue. The information used for predicting the object is shown in red.

https://doi.org/10.1371/journal.pone.0304858.g005

Second, our method can directly extract useful features for link prediction from the formal context, in addition to learning from the concept lattice. During fine-tuning step, we train the model with samples generated from the formal context, which enables the model to learn directly from formal contexts. Therefore, even in datasets where the concept lattice cannot provide much useful information for link prediction, our method can still learn from the formal contexts for link prediction. In contrast, previous FCA-based methods do not possess the capability of directly learning from the formal context. Additionally, previous non-FCA-based methods for bipartite link prediction cannot make use of bi-cliques for link prediction.

Third, our method works in a pre-train first and fine-tune next mode that first pre-trains two large models and then fine-tunes them on various downstream tasks related to the formal context, meaning that if we wish to conduct new tasks on a formal context where we have pre-trained the object model and attribute model, we can skip the first three steps and directly conduct the final fine-tuning step. Previous methods, however, may need to re-train the models when coming to a new task.

Experiments

Datasets

We conduct experiments on five real-world datasets: ICFCA, BMS-POS, Keyword-Paper, Review and iJO1366. All datasets are used for both the O-O task and the O-A task. We depict the features of these datasets in Table 1. A detailed description of each follows.

Download:

Table 1. The features of the datasets.

https://doi.org/10.1371/journal.pone.0304858.t001

ICFCA.

The ICFCA dataset is an author-paper network provided by [7]—the objects represent the authors, the attributes represent the publications, and each edge (a.k.a. relation) represents the author is in the author list of the publication. This dataset is generated from Digital Bibliography & Library Project (DBLP) dump on 1st Aug 2019 which is available at https://dblp.uni-trier.de/xml/.

For the O-O task, we are to simulate the practical case in which we wish to predict future co-authorships or seek potential co-authors from an author-paper network at a certain time point. Hence, we generate a history network from the full network as the input network for this task by removing the authors, publications, and author-paper edges after 1st Jan 2016; we generate the current network as the target network for this task from the full network by removing authors who had no publication before 31st Dec 2015 and their corresponding edges.

For the O-A task, we are to simulate the practical case that some parts of the network are missing, and we wish to use the known edges in the network to predict the potentially missing edges. We generate the input network for this task from the full network by randomly removing 10% of author-paper edges; we use the full network as the target network for this task.

BMS-POS.

The BMS-POS dataset is a product purchased transactions network provided by KDD Cup 2000—the objects represent the products, the attributes represent purchasing transactions, and an edge represents that the product is bought in a certain purchasing transaction. The original data is very large, so in this research, we only select the first 1946 transactions. The dataset is available at https://kdd.org/kdd-cup/view/kdd-cup-2000.

For the O-O task, we are to simulate the practical case in which we wish to predict two products will be likely to be bought by the same customer. Hence, we generate the input network from the full network by randomly removing 10% of product-transaction edges; we use the full network as the target network for this task.

For the O-A task, we are to simulate the same practical case as the ICFCA dataset. We use the same input network and target network as those used in the O-O task of this dataset.

Keyword-paper.

The Keyword-Paper dataset is an original dataset generated by us—the objects represent the keywords, the attributes represent the publications, and each edge represents the paper has the keyword. It is generated from the DBLP dump on 31st Jan 2023. From the dump, we select the top 162 most frequent keywords and all 5640 publications after 1st Jan 2010 to create the keyword-paper network.

For the O-O task, we are to simulate the practical case in which we wish to predict potentially related keywords, which may give inspiration for new research. For example, while “BERT” and “FCA” were never used as keywords of the same paper before, if they are predicted to be related, researchers may get inspired and draft a new study similar to ours. The generation of the input network and target network are similar to that of the ICFCA dataset, and the date for the history network is also set to 31st Dec 2015.

For the O-A task, we are to simulate the same practical case as the previous two datasets. We also use the same way to generate the input and target networks as the way we used in the O-A tasks of the previous two datasets. Note that in this dataset, after removing 10% of edges, some attributes will have no edge connecting to them, so the number of attributes in the networks used for the O-A task is smaller than that used for the O-O task.

Review.

The Review dataset is a peer review data from a top computer science conference. It is collected from [38] and previously used as a test dataset in [39]. The objects represent reviewers; the attributes represent manuscripts; and each edge represent that the reviewer suggests accepting the manuscript.

For both the O-O and the O-A tasks, we are to simulate the practical cases similar to those of the BMS-POS dataset. Hence, the input networks and the target networks are also generated using the same procedures as for the BMS-POS dataset.

iJO1366.

The iJO1366 dataset is a metabolite-reaction network processed from a genome scale metabolic network of the Escherichia coli. It is collected from the BiGG dataset [40] and previously used as a test dataset in [41]. We extract the metabolites as objects, and the reactions as attributes, and each edge represents that the metabolite participates in the reaction.

For both the O-O and the O-A tasks, we are to simulate the practical cases similar to those of the BMS-POS dataset. Hence, the input networks and the target networks are also generated using the same procedures as for the BMS-POS dataset.

Generation of labeled training and test samples

For each dataset, we have generated an input network C = (U, V, E) and the target network C′ = (U, V, E′). We are then to generate labeled training and test samples from the two networks with the following procedure.

The O-O task.

For the training samples, we first enumerate all object-object links in C which contains no more than ε_p objects, with ε_p being a pre-determined threshold. These links are labeled as positive training samples. That is, we are to enumerate every object group G ⊆ U satisfying that ‖G‖ ≤ ε_p and ∃v ∈ V such that ∀u ∈ G, (u, v) ∈ E. Then, we are to randomly choose the same number of negative training samples samples as the positive training samples, with each negative sample being an object-object link which has no more than ε_p objects and does exists in C. Formally, each negative sample should be an object group G ⊆ U such that ‖G‖ ≤ ε_p and ∄ v ∈ V such that ∀u ∈ G, (u, v) ∈ E. The proportion of object-object links with different sizes in the set of negative training samples is kept the same as that in the set of positive training samples. That is, suppose the set of all positive training samples is T₁, and the set of all negative training samples is T₂, we should have ‖T₁‖ = ‖T₂‖ and for each i = 2, 3, ⋯, ε_p, we should have ‖{G ∈ T₁‖G‖ = i}‖ = ‖{G′ ∈ T₂‖G′‖ = i}‖.

For the test samples, we enumerate all object-object links in C′ which contains no more than ε_p nodes and does not appear in C. These links are labeled as positive test samples. That is, we are to extract every object group G ⊆ U satisfying that ‖G‖ ≤ ε_p and ∃v ∈ V such that ∀u ∈ G, (u, v) ∈ E′−E. Then, we are to randomly choose the same number of negative test samples samples as the positive test samples, with each negative test sample being an object-object link which has no more than ε_p objects and does exists in C′. Formally, each negative test sample is a group G ⊆ U such that ‖G| ≤ ε_p and such that ∀u ∈ G, (u, v) ∈ E′. The proportion of object-object links with different sizes in the set of negative test samples is also kept the same as that in the set of positive test samples.

In this research, the value of ε_p is set to 5 for the ICFCA and Review datasets; it is set to 3 for the other three datasets. Although theoretically, our method as well as all previous methods can predict object-object links containing any number of objects, we set this limit to prevent the number of training/test samples from exponentially growing.

The O-A task.

For the training samples, we first enumerate all object-attribute links in C and label them as positive training samples. That is, we are to enumerate every object-attribute pair (u, v) such that u ∈ U, v ∈ V and (u, v) ∈ E. Then, we are to randomly generate the same number of negative training samples as the positive training samples, with each negative training sample being an object-attribute link which does not exist in C. Formally, each negative training sample (u, v) should satisfy u ∈ U, v ∈ V and (u, v) ∉ E.

For the test samples, we first enumerate all object-attribute links in C′ which does not exist in C. These links are all labeled as positive test samples. That is, we are to enumerate every pair (u, v) such that u ∈ U, v ∈ V and (u, v) ∈ E′ − E. Then, we are to randomly generate the same number of negative training samples as the positive training samples, with each negative training sample being an object-attribute link which does not exist in C′. Formally, each negative test sample (u, v) should satisfy u ∈ U, v ∈ V and (u, v) ∉ E′.

Evaluation metrics

To give a fair and comprehensive evaluation, we use the following three measures: F₁ score, AUC score, and AUPR score. All three scores are estimated with the four basic values: TP, TN, FP, and FN. TP represents the number of samples that are positive and are predicted positive. FN represents the number of samples that are negative but are falsely predicted to be positive. TN represents the number of samples that are negative but are predicted negative. FN represents the number of samples that are positive but falsely predicted negative.

The F₁ score is the harmonic mean of the precision and recall. Precision, recall, and F₁ are estimated as follows: (3)

The F₁ value may vary as we change the threshold for the prediction score. Hence, in this research, for each test case, we try 20 different thresholds with the following procedure—the initial threshold is set to 0, and at each trial, we add the threshold by 0.05. After all trials, we report the highest F₁ value we get.

The AUC (Area Under the Curve) score is estimated by computing the area under the ROC (Receiver Operating Characteristic) curve. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. TPR and FPR are estimated as follows: (4)

The AUPR (Area Under the Precision-Recall Curve) score is estimated by computing the area under the Precision-Recall curve, which is created by plotting the precision rate against the recall rate at various threshold settings.

Experimental environment and parameters

Experimental environment.

The experiments are conducted on a Windows 11 server with 64GB RAM, an AMD Ryzen 9 7900X CPU, and an NVIDIA GeForce RTX 4090 GPU.

The codes for the deep-learning part of our method are implemented with Python 3.8.18 and Pytorch 2.1.0. The code for the algorithm for extracting formal concepts is written in C++ and complied by MinGW 13.2.0.

Parameters.

The dimension of input embeddings of the Transformer encoder is 768, the dimension of hidden layer in Transformer encoder is 3072.

For the BERT pre-training step, the maximum number of masked tokens used in MTP is set to 4, the dimension of embeddings is set to 768, the number of heads is set to 12, the number of Transformer encoders is set to 6.

For the BERT fine-tuning step, the dimension of the hidden layer is set to 512. The batch size is set to 24.

The experiment of the O-O task

We evaluate the performance of BERT4FCA in the O-O task on the five datasets. The competitor methods used in this experiment include an FCA-based method object2vec [7], a widely-used classic method Node2Vec [13], two widely-used classic hyperedge prediction methods, common neighbors [42] and Katz index [43], and a relatively new hyperedge prediction method HPRA [32]. Note that for the object2vec method, both CBoW and SG models are tested. For the Node2Vec method, after obtaining embedding vectors for each object and attribute from the input network, we train a downstream Logistic Regression classifier using the labeled training samples generated with the procedure introduced above. For each sample G, its feature is set to the concatenation of the embedding vectors of every object g ∈ G. The objects are sorted in the lexicographical order before concatenation. If a sample has fewer than ε_p objects, its feature vector is padded with zeros to make sure that all samples’ feature vectors are of the same dimension. For the three hyperedge prediction methods, as introduced previously, they can be directly applied to the O-O task simply because the O-O task is a special case of hyperedge prediction.

The results are shown in Table 2.

Download:

Table 2. The results for the O-O tasks.

https://doi.org/10.1371/journal.pone.0304858.t002

The results show that BERT4FCA outperforms the other models across all datasets. We have also discovered that although Node2Vec has achieved high scores close to our method on the BMS-POS dataset, its performances on the other datasets are much lower than our method. This shows the stability of our method, i.e., we can have a stable and high performance across different datasets. Also, as can be seen from the result, the three hyperedge prediction methods’ performance are far better than object2vec, while our BERT4FCA can still outperform all of them. This implies that the ideology of utilizing the information in concept lattice for link prediction is valuable, while object2vec does not utilizes it very well. Our BERT4FCA, on the other hand, has fully learned and utilizes the information from the concept lattice and thus has achieved good results. We will conduct extra ablation experiments later in this section to prove that our inference is correct.

The experiment of the O-A task

We evaluate the performance of BERT4FCA in the O-A task on all datasets. We compare BET4FCA with an FCA-based method, Structure Hole [17], two widely-used non-FCA-based classic methods—Node2Vec and matrix factorization with singular value decomposition (MF-SVD), and a relatively new deep-learning method, SBGNN [39]. For the Node2Vec method, after obtaining embedding vectors for each object and attribute from the input network, we train a downstream Logistic Regression classifier using the labeled training samples generated with the procedure introduced above. For each sample (u, v), its feature is set to the concatenation of the embedding vectors of object u and attribute v.

The results are shown in Table 3.

Download:

Table 3. The results for the O-A tasks.

https://doi.org/10.1371/journal.pone.0304858.t003

From the results, we can see that among all five datasets, BERT4FCA has the best overall performance. Its performance is notably better than Structure Hole and generally better than SBGNN. For MF-SVD and Node2Vec, although they have achieved higher scores than BERT4FCA in some metrics in some dataset, their overall performance is mixed, while our BERT4FCA has an overall good performance acorss five datasets. The reason why these two methods get mixed performance is analyzed below.

In the BMS-POS dataset, MF-SVD exhibits higher AUC scores than other methods but with very low AUPR scores. A low AUPR score indicates that the model’s high prediction scores do not correlate well with being in the positive class, suggesting that the model has difficulty achieving high precision. Since in link prediction, we focus on predicting the generation of new links, which is predicting the positive samples but not the negative samples, so if a model exhibits a low AUPR, it is considered to have low performance in link prediction. The low performance of MF-SVD may suggest that the underlying latent features of the networks and the relationships between nodes are intricate and challenging for MF to capture effectively. Additionally, since the BMS-POS dataset is extracted from a large dataset, maybe the relationships between nodes in this network are not consistent and stable, resulting in low prediction performance of MF.

In the iJO1366 dataset, Node2Vec exhibits higher AUPR and AUC scores than our method. However, its scores in the Review dataset is far lower than our method, indicating that the performance of Node2Vec varies greatly across different datasets. According to previous research [44], the sparser the network is, the better the process can capture the features of the network. Hence, it is expected to work better on sparse networks. Since iJO1366 is the most sparse data set, Node2Vec achieved to gain a better performance, but for relative dense networks, its performances become notably lower. Hence, we consider our method to be more practical than Node2Vec thanks to its overall good performance in all metrics across all datasets.

Finally, we have also found that the performance of the other FCA-based method, Structure Hole, is notably lower on these five datasets than the datasets used in [17]. As analyzed before, Structure Hole is a rule-based method that uses the same simple rule to extract information from the concept lattices and make predictions on all datasets, which will be highly likely to have lower performance on some datasets because not all datasets fit the rule well, and our experimental results just proved it. This again shows the importance of developing a method like our BERT4FCA that can automatically capture the information from concept lattices using statistical machine-learning techniques.

Ablation experiments

To verify if our original mechanisms are functioning well, we conduct two ablation experiments—the first one checks if our method can indeed learn the neighboring relations of formal concepts; the second one checks if the information in concept lattices our method learned is indeed helpful for making link predictions.

In the first ablation experiment, we only use 80% of the training samples to pre-train both the object model and the attribute model on all datasets. The remaining 20% of the original training samples are kept as test samples for the NCP task—these intents/extents pairs are not presented in the training samples, so if our method can correctly predict whether they are neighbors or not, it should indicate that our method has learned the structure of concept lattice well. The results are shown in Table 4.

Download:

Table 4. The results for the first supplementary experiment.

https://doi.org/10.1371/journal.pone.0304858.t004

The results suggest that BERT4FCA indeed learned the neighboring relations from the concept lattice on all datasets well. We have also noticed that the results of the object models are generally better than the attribute models across all datasets. We analyze it because, in these datasets, the average lengths of intents are longer than that of the extents, making it potentially more challenging to effectively capture the neighboring relations of intents.

In the second ablation experiment, we check if the information we learned from concept lattices indeed contributes to better link prediction results. Since the information of the concept lattices is learned in the NCP and MTM tasks in the pre-training step, in this experiment, we skip the pre-training step and directly fine-tune the models with randomly initialized weights. By comparing the results with that of our full method, we will know whether our specially designed tasks for learning the information of concept lattices are functioning well. The results for the O-O tasks are shown in Table 5, and the results for the O-A tasks are shown in Table 6.

Download:

Table 5. The results for the O-O tasks of the second ablation experiment.

https://doi.org/10.1371/journal.pone.0304858.t005

Download:

Table 6. The results for the O-A tasks of the second ablation experiment.

https://doi.org/10.1371/journal.pone.0304858.t006

The results show that learning more information from concept lattices indeed contributes to the improvement of performances in both tasks of bipartite link prediction, while the degree of improvement varies across different datasets and tasks. The overall degree of improvement in O-O tasks is higher than that in the O-A task. We analyze it because compared to the attribute models, the object models better capture the information on concept lattices, such as the neighboring relations, as demonstrated in the first ablation experiment. The difference in improvements shows that better learning the neighboring relations contributes to the prediction performance.

Also, we find that in the O-O task on the BMS-POS dataset, learning the information from concept lattices contributes to the least degree of improvement. This may be because the performance of our method without learning the information is already sufficiently high, as the AUC and AUPR scores have reached 96%. In such a case, it is considered hard to improve the performance further, even with an effective mechanism.

Furthermore, we observe that even without the pre-training step, our method can still have a relatively good performance. This indicates the BERT framework is capable of extracting information to make predictions even directly from the raw bipartite network. This discovery may provide hints for further research on the possibility of training a light-weight model with a small part of the network utilizing the effective BERT-like transformer network structure. Nevertheless, when there is an urge requirement of stability and quality, it is recommended to apply our full BERT4FCA model as the information from maximal bi-cliques extracted by our proposed MTP and NCP tasks can indeed further improve the performance.

Conclusion and future work

In this paper, we proposed BERT4FCA, a novel FCA-based method for bipartite link prediction. It can learn more information from concept lattices and then use it to make bipartite link prediction. The experimental results demonstrated that our methods outperform previous FCA-based methods like object2Vec and non-FCA-based classic methods like MF and Node2Vec in both the O-O and the O-A tasks on five different datasets. The results have shown that our method is stabler than previous methods. We also demonstrated by ablation experiments that neighboring relations between maximal bi-cliques are well learned by the model as expected and that such information contributes to better link prediction results. Furthermore, we have shown that BERT4FCA provides a general framework for employing BERT to learn the information extracted by FCA. Hence, in addition to bipartite link prediction, we believe BERT4FCA can also be further applied to a broader range of real-world tasks. We plan to explore the potential applications in the future.

However, we have also found some points of our method that may be further refined and improved. First, limited by computational resources, we were unable to pre-train an all-in-one model for both objects and attributes in this research. If such a model can be trained well, it will be more convenient as all downstream tasks will only need one single pre-trained model. Second, although BERT4FCA outperforms previous methods in prediction results, the pre-training step is time-consuming, especially when dealing with datasets with a large number of formal concepts. With our current computational resources, to finish the pre-training step in a reasonable time, the maximum number of concepts should be around 15,000. Although there have been methods for reducing the number of formal concepts in concept lattices, it is uncertain whether training on a reduced concept lattice will result in lower performances. If so, we may need to work out a technique to make a trade-off between the training time and the performance. We plan to address these two points in our future work.

References

1. Getoor L, Diehl CP. Link mining: a survey. Acm Sigkdd Explorations Newsletter. 2005;7(2):3–12.
- View Article
- Google Scholar
2. Martínez V, Berzal F, Cubero J. A survey of link prediction in complex networks. ACM computing surveys (CSUR). 2016;49(4):1–33.
- View Article
- Google Scholar
3. Wang P, Xu B, Wu Y, Zhou X. Link prediction in social networks: the state-of-the-art. arXiv preprint arXiv:14115118. 2014;.
4. Shang M, Lü L, Zhang Y, Zhou T. Empirical analysis of web-based user-object bipartite networks. Europhysics Letters. 2010;90(4):48006.
- View Article
- Google Scholar
5. Asratian AS, Denley TM, Häggkvist R. Bipartite graphs and their applications. vol. 131; 1998.
- View Article
- Google Scholar
6. Chen B, Li F, Chen S, Hu R, Chen L. Link prediction based on non-negative matrix factorization. PloS one. 2017;12(8):e0182968. pmid:28854195
- View Article
- PubMed/NCBI
- Google Scholar
7. Dürrschnabel D, Hanika T, Stubbemann M. Fca2vec: Embedding techniques for formal concept analysis. In: Complex Data Analytics with Formal Concept Analysis. Cham: Springer; 2021. p. 47–74.
8. Menon AK, Elkan C. Link prediction via matrix factorization. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part II 22. Springer; 2011. p. 437–452.
9. Fokoue A, Sadoghi M, Hassanzadeh O, Zhang P. Predicting drug-drug interactions through large-scale similarity-based link prediction. In: The Semantic Web. Latest Advances and New Domains: 13th International Conference, ESWC 2016, Heraklion, Crete, Greece, May 29–June 2, 2016, Proceedings 13. Springer; 2016. p. 774–789.
10. Kunegis J, De Luca EW, Albayrak S. The link prediction problem in bipartite networks. In: International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems. Springer; 2010. p. 380–389.
11. Nasiri E, Berahmand K, Rostami M, Dabiri M. A novel link prediction algorithm for protein-protein interaction networks by attributed graph embedding. Computers in Biology and Medicine. 2021;137:104772. pmid:34450380
- View Article
- PubMed/NCBI
- Google Scholar
12. Lü L, Zhou T. Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications. 2011;390(6):1150–1170.
- View Article
- Google Scholar
13. Grover A, Leskovec J. node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining; 2016. p. 855–864.
14. Berahmand K, Daneshfar F, Salehi ES, Li Y, Xu Y. Autoencoders and their applications in machine learning: a survey. Artificial Intelligence Review. 2024;57(2):28.
- View Article
- Google Scholar
15. Zhang G, Liu Y, Jin X. A survey of autoencoder-based recommender systems. Frontiers of Computer Science. 2020;14:430–450.
- View Article
- Google Scholar
16. Ma C, Zhou T, Zhang H. Playing the role of weak clique property in link prediction: A friend recommendation model. Scientific Reports. 2016;6(1):30098. pmid:27439697
- View Article
- PubMed/NCBI
- Google Scholar
17. Zhao J, Sun M, Chen F, Chiu P. Missbin: Visual analysis of missing links in bipartite networks. In: 2019 IEEE Visualization Conference (VIS). IEEE; 2019. p. 71–75.
18. Xia S, Dai B, Lim E, Zhang Y, Xing C. Link prediction for bipartite social networks: The role of structural holes. In: 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. IEEE; 2012. p. 153–157.
19. Peng S, Yamamoto A, Ito K. Link prediction on bipartite networks using matrix factorization with negative sample selection. Plos one. 2023;18(8):e0289568. pmid:37585433
- View Article
- PubMed/NCBI
- Google Scholar
20. Marquer E, Kulkarni A, Couceiro M. Embedding Formal Contexts Using Unordered Composition. In: FCA4AI-8th International Workshop “What can FCA do for Artificial Intelligence?” (co-located with ECAI2020); 2020.
21. Gaume B, Navarro E, Prade H. Clustering bipartite graphs in terms of approximate formal concepts and sub-contexts. International Journal of Computational Intelligence Systems. 2013;6(6):1125–1142.
- View Article
- Google Scholar
22. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013;.
23. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. 2013;26.
- View Article
- Google Scholar
24. Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural networks. 2005;18(5-6):602–610. pmid:16112549
- View Article
- PubMed/NCBI
- Google Scholar
25. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
- View Article
- Google Scholar
26. Bernhard Ganter RW. Formal concept analysis: mathematical foundations. 1st ed. Berlin Heidelberg: Springer-Verlag; 1999.
27. Ignatov DI. Introduction to formal concept analysis and its applications in information retrieval and related fields. Information Retrieval: 8th Russian Summer School, RuSSIR 2014, Nizhniy Novgorod, Russia, August 18-22, 2014, Revised Selected Papers 8. 2015; p. 42–141.
- View Article
- Google Scholar
28. Poelmans J, Kuznetsov SO, Ignatov DI, Dedene G. Formal concept analysis in knowledge processing: A survey on models and techniques. Expert systems with applications. 2013;40(16):6601–6623.
- View Article
- Google Scholar
29. Poelmans J, Ignatov DI, Kuznetsov SO, Dedene G. Formal concept analysis in knowledge processing: A survey on applications. Expert systems with applications. 2013;40(16):6538–6560.
- View Article
- Google Scholar
30. Devlin J, Chang M, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018;.
31. Estmark A. Text Block Prediction and Article Reconstruction Using BERT; 2021.
32. Kumar T, Darwin K, Parthasarathy S, Ravindran B. HPRA: Hyperedge prediction using resource allocation. In: Proceedings of the 12th ACM conference on web science; 2020. p. 135–143.
33. Maurya D, Ravindran B. Hyperedge prediction using tensor eigenvalue decomposition. Journal of the Indian Institute of Science. 2021;101:443–453.
- View Article
- Google Scholar
34. Lazega E. Structural holes: the social structure of competition; 1995.
35. Kingma DP, Welling M. Auto-encoding variational bayes. arXiv preprint arXiv:13126114. 2013;.
36. Peng S, Yamamoto A. Z-TCA: Fast Algorithm for Triadic Concept Analysis Using Zero-suppressed Decision Diagrams. Journal of Information Processing. 2023;31:722–733.
- View Article
- Google Scholar
37. Shang W, Sohn K, Almeida D, Lee H. Understanding and improving convolutional neural networks via concatenated rectified linear units. In: international conference on machine learning. PMLR; 2016. p. 2217–2225.
38. Derr T, Johnson C, Chang Y, Tang J. Balance in signed bipartite networks. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management; 2019. p. 1221–1230.
39. Huang J, Shen H, Cao Q, Tao S, Cheng X. Signed bipartite graph neural networks. In: Proceedings of the 30th ACM international conference on information & knowledge management; 2021. p. 740–749.
40. King ZA, Lu J, Dräger A, Miller P, Federowicz S, Lerman JA, et al. BiGG Models: A platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Research. 2015;44(D1):D515–D522. pmid:26476456
- View Article
- PubMed/NCBI
- Google Scholar
41. Yadati N, Nitin V, Nimishakavi M, Yadav P, Louis A, Talukdar P. Nhp: Neural hypergraph link prediction. In: Proceedings of the 29th ACM international conference on information & knowledge management; 2020. p. 1705–1714.
42. Newman ME. Clustering and preferential attachment in growing networks. Physical review E. 2001;64(2):025102. pmid:11497639
- View Article
- PubMed/NCBI
- Google Scholar
43. Katz L. A new status index derived from sociometric analysis. Psychometrika. 1953;18(1):39–43.
- View Article
- Google Scholar
44. Dalmia A, Gupta M. Towards interpretation of node embeddings. In: Companion Proceedings of the The Web Conference 2018; 2018. p. 945–952.

[ref1] 1. Getoor L, Diehl CP. Link mining: a survey. Acm Sigkdd Explorations Newsletter. 2005;7(2):3–12.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Martínez V, Berzal F, Cubero J. A survey of link prediction in complex networks. ACM computing surveys (CSUR). 2016;49(4):1–33.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Wang P, Xu B, Wu Y, Zhou X. Link prediction in social networks: the state-of-the-art. arXiv preprint arXiv:14115118. 2014;.

[ref4] 4. Shang M, Lü L, Zhang Y, Zhou T. Empirical analysis of web-based user-object bipartite networks. Europhysics Letters. 2010;90(4):48006.
View Article
Google Scholar

[9] View Article

[10] Google Scholar

[ref5] 5. Asratian AS, Denley TM, Häggkvist R. Bipartite graphs and their applications. vol. 131; 1998.
View Article
Google Scholar

[12] View Article

[13] Google Scholar

[ref6] 6. Chen B, Li F, Chen S, Hu R, Chen L. Link prediction based on non-negative matrix factorization. PloS one. 2017;12(8):e0182968. pmid:28854195
View Article
PubMed/NCBI
Google Scholar

[15] View Article

[16] PubMed/NCBI

[17] Google Scholar

[ref7] 7. Dürrschnabel D, Hanika T, Stubbemann M. Fca2vec: Embedding techniques for formal concept analysis. In: Complex Data Analytics with Formal Concept Analysis. Cham: Springer; 2021. p. 47–74.

[ref8] 8. Menon AK, Elkan C. Link prediction via matrix factorization. In: Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part II 22. Springer; 2011. p. 437–452.

[ref9] 9. Fokoue A, Sadoghi M, Hassanzadeh O, Zhang P. Predicting drug-drug interactions through large-scale similarity-based link prediction. In: The Semantic Web. Latest Advances and New Domains: 13th International Conference, ESWC 2016, Heraklion, Crete, Greece, May 29–June 2, 2016, Proceedings 13. Springer; 2016. p. 774–789.

[ref10] 10. Kunegis J, De Luca EW, Albayrak S. The link prediction problem in bipartite networks. In: International Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems. Springer; 2010. p. 380–389.

[ref11] 11. Nasiri E, Berahmand K, Rostami M, Dabiri M. A novel link prediction algorithm for protein-protein interaction networks by attributed graph embedding. Computers in Biology and Medicine. 2021;137:104772. pmid:34450380
View Article
PubMed/NCBI
Google Scholar

[23] View Article

[24] PubMed/NCBI

[25] Google Scholar

[ref12] 12. Lü L, Zhou T. Link prediction in complex networks: A survey. Physica A: statistical mechanics and its applications. 2011;390(6):1150–1170.
View Article
Google Scholar

[27] View Article

[28] Google Scholar

[ref13] 13. Grover A, Leskovec J. node2vec: Scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining; 2016. p. 855–864.

[ref14] 14. Berahmand K, Daneshfar F, Salehi ES, Li Y, Xu Y. Autoencoders and their applications in machine learning: a survey. Artificial Intelligence Review. 2024;57(2):28.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref15] 15. Zhang G, Liu Y, Jin X. A survey of autoencoder-based recommender systems. Frontiers of Computer Science. 2020;14:430–450.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref16] 16. Ma C, Zhou T, Zhang H. Playing the role of weak clique property in link prediction: A friend recommendation model. Scientific Reports. 2016;6(1):30098. pmid:27439697
View Article
PubMed/NCBI
Google Scholar

[37] View Article

[38] PubMed/NCBI

[39] Google Scholar

[ref17] 17. Zhao J, Sun M, Chen F, Chiu P. Missbin: Visual analysis of missing links in bipartite networks. In: 2019 IEEE Visualization Conference (VIS). IEEE; 2019. p. 71–75.

[ref18] 18. Xia S, Dai B, Lim E, Zhang Y, Xing C. Link prediction for bipartite social networks: The role of structural holes. In: 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. IEEE; 2012. p. 153–157.

[ref19] 19. Peng S, Yamamoto A, Ito K. Link prediction on bipartite networks using matrix factorization with negative sample selection. Plos one. 2023;18(8):e0289568. pmid:37585433
View Article
PubMed/NCBI
Google Scholar

[43] View Article

[44] PubMed/NCBI

[45] Google Scholar

[ref20] 20. Marquer E, Kulkarni A, Couceiro M. Embedding Formal Contexts Using Unordered Composition. In: FCA4AI-8th International Workshop “What can FCA do for Artificial Intelligence?” (co-located with ECAI2020); 2020.

[ref21] 21. Gaume B, Navarro E, Prade H. Clustering bipartite graphs in terms of approximate formal concepts and sub-contexts. International Journal of Computational Intelligence Systems. 2013;6(6):1125–1142.
View Article
Google Scholar

[48] View Article

[49] Google Scholar

[ref22] 22. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781. 2013;.

[ref23] 23. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. 2013;26.
View Article
Google Scholar

[52] View Article

[53] Google Scholar

[ref24] 24. Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural networks. 2005;18(5-6):602–610. pmid:16112549
View Article
PubMed/NCBI
Google Scholar

[55] View Article

[56] PubMed/NCBI

[57] Google Scholar

[ref25] 25. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Advances in neural information processing systems. 2017;30.
View Article
Google Scholar

[59] View Article

[60] Google Scholar

[ref26] 26. Bernhard Ganter RW. Formal concept analysis: mathematical foundations. 1st ed. Berlin Heidelberg: Springer-Verlag; 1999.

[ref27] 27. Ignatov DI. Introduction to formal concept analysis and its applications in information retrieval and related fields. Information Retrieval: 8th Russian Summer School, RuSSIR 2014, Nizhniy Novgorod, Russia, August 18-22, 2014, Revised Selected Papers 8. 2015; p. 42–141.
View Article
Google Scholar

[63] View Article

[64] Google Scholar

[ref28] 28. Poelmans J, Kuznetsov SO, Ignatov DI, Dedene G. Formal concept analysis in knowledge processing: A survey on models and techniques. Expert systems with applications. 2013;40(16):6601–6623.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref29] 29. Poelmans J, Ignatov DI, Kuznetsov SO, Dedene G. Formal concept analysis in knowledge processing: A survey on applications. Expert systems with applications. 2013;40(16):6538–6560.
View Article
Google Scholar

[69] View Article

[70] Google Scholar

[ref30] 30. Devlin J, Chang M, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805. 2018;.

[ref31] 31. Estmark A. Text Block Prediction and Article Reconstruction Using BERT; 2021.

[ref32] 32. Kumar T, Darwin K, Parthasarathy S, Ravindran B. HPRA: Hyperedge prediction using resource allocation. In: Proceedings of the 12th ACM conference on web science; 2020. p. 135–143.

[ref33] 33. Maurya D, Ravindran B. Hyperedge prediction using tensor eigenvalue decomposition. Journal of the Indian Institute of Science. 2021;101:443–453.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref34] 34. Lazega E. Structural holes: the social structure of competition; 1995.

[ref35] 35. Kingma DP, Welling M. Auto-encoding variational bayes. arXiv preprint arXiv:13126114. 2013;.

[ref36] 36. Peng S, Yamamoto A. Z-TCA: Fast Algorithm for Triadic Concept Analysis Using Zero-suppressed Decision Diagrams. Journal of Information Processing. 2023;31:722–733.
View Article
Google Scholar

[80] View Article

[81] Google Scholar

[ref37] 37. Shang W, Sohn K, Almeida D, Lee H. Understanding and improving convolutional neural networks via concatenated rectified linear units. In: international conference on machine learning. PMLR; 2016. p. 2217–2225.

[ref38] 38. Derr T, Johnson C, Chang Y, Tang J. Balance in signed bipartite networks. In: Proceedings of the 28th ACM International Conference on Information and Knowledge Management; 2019. p. 1221–1230.

[ref39] 39. Huang J, Shen H, Cao Q, Tao S, Cheng X. Signed bipartite graph neural networks. In: Proceedings of the 30th ACM international conference on information & knowledge management; 2021. p. 740–749.

[ref40] 40. King ZA, Lu J, Dräger A, Miller P, Federowicz S, Lerman JA, et al. BiGG Models: A platform for integrating, standardizing and sharing genome-scale models. Nucleic Acids Research. 2015;44(D1):D515–D522. pmid:26476456
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref41] 41. Yadati N, Nitin V, Nimishakavi M, Yadav P, Louis A, Talukdar P. Nhp: Neural hypergraph link prediction. In: Proceedings of the 29th ACM international conference on information & knowledge management; 2020. p. 1705–1714.

[ref42] 42. Newman ME. Clustering and preferential attachment in growing networks. Physical review E. 2001;64(2):025102. pmid:11497639
View Article
PubMed/NCBI
Google Scholar

[91] View Article

[92] PubMed/NCBI

[93] Google Scholar

[ref43] 43. Katz L. A new status index derived from sociometric analysis. Psychometrika. 1953;18(1):39–43.
View Article
Google Scholar

[95] View Article

[96] Google Scholar

[ref44] 44. Dalmia A, Gupta M. Towards interpretation of node embeddings. In: Companion Proceedings of the The Web Conference 2018; 2018. p. 945–952.

Figures

Abstract

Introduction

Preliminaries

Bipartite networks and bi-cliques

Bipartite networks.

Sub-networks.

Bi-cliques.

Maximal bi-cliques.

Formal concept analysis (FCA)

Formal contexts.

Formal concepts.

Concept lattices.

Neighbor of concepts.

FCA and bipartite networks

BERT

BERT and FCA

Problem formulation and related work

Problem formulation

Related work on FCA-based bipartite link prediction

The proposed method

Data preparation

Input tokenization

BERT pre-training

Experiments

Datasets

ICFCA.

BMS-POS.

Keyword-paper.

Review.

iJO1366.

Generation of labeled training and test samples

The O-O task.

The O-A task.

Evaluation metrics

Experimental environment and parameters

Experimental environment.

Parameters.

The experiment of the O-O task

The experiment of the O-A task

Ablation experiments

Conclusion and future work

References