SCGG: A deep structure-conditioned graph generative model

Faezeh Faez; Negin Hashemi Dijujin; Mahdieh Soleymani Baghshah; Hamid R. Rabiee

doi:10.1371/journal.pone.0277887

Abstract

Deep learning-based graph generation approaches have remarkable capacities for graph data modeling, allowing them to solve a wide range of real-world problems. Making these methods able to consider different conditions during the generation procedure even increases their effectiveness by empowering them to generate new graph samples that meet the desired criteria. This paper presents a conditional deep graph generation method called SCGG that considers a particular type of structural conditions. Specifically, our proposed SCGG model takes an initial subgraph and autoregressively generates new nodes and their corresponding edges on top of the given conditioning substructure. The architecture of SCGG consists of a graph representation learning network and an autoregressive generative model, which is trained end-to-end. More precisely, the graph representation learning network is designed to compute continuous representations for each node in a graph, which are not only affected by the features of adjacent nodes, but also by the ones of farther nodes. This network is primarily responsible for providing the generation procedure with the structural condition, while the autoregressive generative model mainly maintains the generation history. Using this model, we can address graph completion, a rampant and inherently difficult problem of recovering missing nodes and their associated edges of partially observed graphs. The computational complexity of the SCGG method is shown to be linear in the number of graph nodes. Experimental results on both synthetic and real-world datasets demonstrate the superiority of our method compared with state-of-the-art baselines.

Citation: Faez F, Hashemi Dijujin N, Soleymani Baghshah M, Rabiee HR (2022) SCGG: A deep structure-conditioned graph generative model. PLoS ONE 17(11): e0277887. https://doi.org/10.1371/journal.pone.0277887

Editor: Sathishkumar V E, Hanyang University, KOREA, REPUBLIC OF

Received: September 14, 2022; Accepted: November 6, 2022; Published: November 21, 2022

Copyright: © 2022 Faez et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All codes and data used in this manuscript are available at https://gitlab.com/FaezehFaez/scgg.

Funding: Hamid R. Rabiee was partially supported by Iran National Science Foundation (INSF), Grant No. 96006077. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

1 Introduction

With the ever-increasing growth of data collection and production technologies, large amounts of data are readily accessible. In many cases, some relationship exists between data entities, which, if taken into consideration, can lead to more precise data analyses. Such relationships are mostly represented by graph data structures, and that is why graph-related research has become a widely discussed topic in many areas, including chemistry [1], medical applications [2], social network studies [3], and knowledge graph-related research [4]. Most recent studies are dedicated to graph representation learning [5–9], aiming to obtain suitable representations of nodes, edges, or the entire graph in continuous space to be further utilized by downstream tasks.

Graph generation is another important branch of graph-related research, which often benefits from the results of graph representation learning studies. This research field has a history of several decades. It has recently been revived by receiving renewed attention from scholars, mainly due to the advances in machine learning, and in particular deep learning techniques. Graph generation aims to provide models that can generate new graph samples from the desired data distributions. Thus, similar to generative methods in other data domains such as image [10], text [11], and speech [12], graph generative approaches can bring substantial capacity for graph data modeling to address various real-world problems such as drug design [13], understanding and modeling the interactions in social networks [14], and human diseases diagnosis [15].

One of the desired and essential properties of generative methods is their ability to carry out the generation procedure in a controlled manner so that the produced samples comply with predetermined conditions by having the required characteristics. In this regard, numerous studies have been conducted to develop conditional generative models in different data domains, such as image [16] and text [17]. Initial steps [18–22] have also been taken to make graph generators conditional. However, compared to the work performed in other data domains and the needs and capacities of this field, much remains to be done.

In addition to what we have discussed so far, there is a common problem manifesting itself when working with different types of data. Specifically, in many cases, the data is not completely available, which can be caused due to various reasons such as limitations of data collection tools, issues related to privacy, or inadequacy of storage space. This can significantly degrade the performance of data analysis methods. Therefore, it is often crucial to recover the missing part of the data before processing it; hence, various methods have been proposed in different data domains to address this challenge. Regarding the graph data, many methods have also been developed for years [23, 24] to predict missing links between graph nodes, and researchers are still seriously pursuing a solution for it [25]. However, an intrinsically more complicated challenge arises when the graph nodes are missing. We will refer to this problem as graph completion, which, unlike the widely investigated problem of link prediction, has been much less addressed despite its importance and pervasiveness.

To address the issues mentioned above, we propose Structure-Conditioned Graph Generator (SCGG), an end-to-end deep learning-based conditional graph generative approach. The SCGG model takes an initial subgraph as the structural condition. It then autoregressively performs the graph generation procedure by adding new nodes and predicting the inter-links between the new nodes and those in the conditioning subgraph, as well as the intra-links between the new nodes. In this way, our generative model ensures the existence of desired subgraphs in the final generated graphs, which can have several applications in both molecular and non-molecular domains. Specifically, for designing molecular graphs, the existence of desired chemical substructures can bring certain chemical properties to the final molecules. Moreover, regarding the non-molecular graphs, the SCGG model can be best utilized to solve the graph completion problem in which some graph nodes and their corresponding edges are totally missing. Our study focuses on the latter application, but the proposed SCGG model can be easily extended to be used in molecular applications as well. In this regard, a partially observed graph is given to the model as a structural condition. Then the generated nodes by the model and their associated edges will be treated as the recovered missing nodes and the edges connecting them to each other, as well as to the partially observed graph nodes.

In summary, we present the following contributions in this work:

We introduce SCGG, a conditional graph generation approach, which autoregressively generates graphs based on a given structural condition.
The architecture of our SCGG model consists of a graph representation learning network and a recurrent neural network (RNN), where the former is mainly used to take into account the structural condition, and the latter captures the generation history.
We use our proposed SCGG model to address the graph completion problem to benefit from the power and potential of a deep generative model for solving an inherently difficult and complex issue, which as a result, has been relatively less investigated so far. To the extent of our knowledge, this is the first time that a completely deep learning-based model is designed in such a way that it can specifically tackle this problem.
We conduct extensive experiments on both synthetic and real-world datasets to compare the performance of our proposed model against the baselines. The experimental results indicate that the SCGG model outperforms the state-of-the-art baselines in terms of the distance between the generated graphs and the ground-truth ones.

The rest of the paper is organized as follows. In Section 2, we review the previous work related to our research. In Section 3, we introduce the notations used in the paper and define the problem. In Section 4, we explain our proposed SCGG model in detail. Experimental details and results are discussed in Section 5. Finally, in Section 6 we conclude the paper.

2 Related work

In line with what we discussed earlier, our proposed SCGG model is a structure-based conditional graph generation approach, and one of its main applications is graph completion. Therefore, we review the literature in two related areas in the following.

2.1 Graph generation

Graph generation is a field of research seeking to generate new graph structures with certain characteristics, which dates back to several decades ago, and is still a hot topic for research. In contrast to the early methods [26–29], which relied on manually-designed procedures to construct graphs with predetermined statistical properties, the more recent ones are data-driven, utilizing the available graph samples in datasets to train models that can more effectively generate new graphs. The latter approaches typically employ different deep learning techniques and generation strategies, and accordingly, they can be classified into several categories [30]. The autoregressive approaches, which adopt step-by-step strategies for generating graphs, are the most relevant methods to our research. DeepGMG [31] is an example of them proposing a repetitive decision-making process to generate graphs gradually. GraphRNN [32] is among the well-known and influential approaches, which first maps each graph into a sequence of nodes and then processes one node per time step using RNNs to model the distribution of the resulting sequences. The method has inspired several subsequent approaches like MolecularRNN [33], which extends GraphRNN to generate molecular graphs with specific chemical features. Bacciu et al. [34], GraphGen [35], and GHRNN [36], on the other hand, convert graphs to sequences of edges instead of nodes, and then go through distribution modeling with RNNs. Besides, some other autoregressive methods utilize the attention mechanism to empower their generative models. Regarding this, GRAN [37] proposes to add a block of new nodes in each step, and to compute the representations of the graph nodes, it employs an attentive message passing mechanism.

In addition to the methods mentioned above that are more related to our proposed approach, there are other categories of modern graph generation approaches, the most noteworthy of which are autoencoder-based methods [18, 38–42], RL-based approaches [43–45], GAN-based generating strategies [15, 19, 46], and flow-based models [47, 48].

A key point to notice is that regardless of what category these methods fall into and what techniques they employ to solve the problem, an important capability of them is to consider desired conditions during generation so that the resulting graphs meet the expected characteristics. Hence, the problem of conditional graph generation arises. In this regard, GraphVAE [18] conditions both the encoder and the decoder of its VAE on a label vector for the molecular graph generation. CONDGEN [19] adopts a similar approach (i.e., concatenating a condition vector to the VAE latent variable) to incline the model towards generating graphs with desired characteristics. Lim et al. [20], and HierVAE [21] guarantee the existence of intended chemical substructures in the output molecular graphs. CCGG [22] makes the GRAN [37] model class-conditional, allowing it to generate graphs of desired classes. However, despite the efforts that have gone into conditional graph generation, there is still a vital need to develop more and more approaches that can capture various types of conditions. In this regard, the SCGG model is a generative method designed to handle special conditions, which are of structural type.

2.2 Graph completion

In many cases, a part of a graph structure is unavailable for various reasons. Hence, it is necessary to reconstruct the missing information before further processing. Most of the methods developed for this purpose try to perform link prediction [49–51], although a more complicated problem arises when the graph nodes are missing. Therefore, due to the complexity of addressing this problem, which we refer to as graph completion, few methods have been presented to solve it. Regarding this, KronEM [52] utilizes a combination of the Expectation-Maximization framework and the Kronecker graphs model to infer the missing nodes and their corresponding edges. SAMI [53] adopts a clustering approach for solving the missing node problem by heavily relying on the existence of missing node indicators, which are often unattainable in real scenarios. Masrour et al. [54], and JCSL [55] utilize side information about the graph nodes to perform network completion; however, this information may not be accessible in all cases. More recently, DeepNC [56] was introduced, which first learns the likelihood of the data by training the GraphRNN [32] model. It then uncovers the missing parts of a graph by proposing a greedy optimization algorithm, aiming to maximize the obtained likelihood. Although DeepNC is an innovative approach that has obtained satisfactory results, it is not learning-based, so it cannot directly learn from the data for the specific task of graph completion. However, our proposed method trains an end-to-end model to address this problem. Furthermore, unlike some graph completion methods mentioned above, the SCGG does not depend on the existence of side information, which may not be reachable in many situations.

3 Notations and problem definition

In this section, we define the notations used in the paper and present the problem definition. For convenience, we summarize the notations in Table 1.

Download:

Table 1. Notations in this paper.

https://doi.org/10.1371/journal.pone.0277887.t001

We denote an initial graph as G₀ = (V₀, E₀), where V₀ and E₀ are the node and the edge sets, respectively, and |V₀| = n. Under an ordering π_n of these n nodes, we represent the i-th node’s links by the following sequence: (1) where x_k takes value of 1 if the i-th node is connected to the k-th node and 0 otherwise.

Considering G₀ as the structural condition, the objective of our research is to learn to sample from the conditional probability distribution in order to generate graph G = (V, E), which includes G₀ as a subgraph, i.e., V₀ ⊂ V and E₀ ⊂ E. This can be done by first adding the node set , with and . Then, to connect new nodes, the edge set will be generated, where . More specifically, consists of: 1. the inter-connections between new nodes and those in G₀, 2. the intra-connections between the new nodes themselves. To represent the inter-connections between new nodes and the i-th node of G₀ under the ordering π_n, we use the below sequence: (2) where we consider a node ordering π_m of the m new nodes, and x_l is 1 if the i-th node of G₀ has a link to the l-th new node and 0 otherwise. Moreover, regarding the intra-connections, we denote the j-th new node’s connections to the nodes in by the following sequence: (3) where similarly to the previous formulas, x_p takes the value of 1 if there is a link connecting the j-th and the p-th new nodes (under the ordering π_m) and 0 otherwise.

4 SCGG: Structure-conditioned graph generator

We approach a specific type of structure-conditioned graph generation that takes an initial substructure and starts to generate new nodes and their associated edges on top of the given conditioning substructure. To this end, we propose the SCGG model, whose architecture is composed of a graph representation learning network and an autoregressive generative model, which is trained end-to-end. In this section, we present the details of the SCGG model. We first elucidate the problem formulation and the model architecture in this regard. Next, we describe the procedure employed to prepare the data for model training. Then, we discuss the training and inference phases and elaborate on the implementation details.

4.1 Formulation

As mentioned in the Section 3, in this work we intend to learn to sample from the distribution to conditionally generate the graph G given an arbitrary initial graph G₀. To do so, our SCGG model first estimates this conditional probability distribution and then samples from the resulting estimated distribution. As it is not easy to work directly in the graph space, we reformulate the problem to deal with the following distribution: (4) where and are the notational abbreviations for and , respectively, and the new problem formulation relates to the original one through the below equation: (5)

To further decompose the probability in Eq 4, we follow the chain rule and therefore this conditional probability can be rewritten as follows: (6)

Our proposed SCGG method trains a novel network architecture in an end-to-end manner to model the complex distribution in Eq 6.

4.2 Model architecture

The model architecture of SCGG consists of two main components, namely, a graph representation learning network and an autoregressive generative model (i.e., an RNN). In the following, we explain these components in detail and discuss the role each plays in the task of structure-conditioned graph generation.

4.2.1 Graph feature learning network.

The SCGG method needs appropriate representations of graph nodes beforehand to perform distribution modeling. Therefore, it utilizes a graph representation learning network that employs both a graph convolutional network (GCN) and a Transformer network to learn meaningful node features. Below, we give a brief background of GCNs and Transformers. Furthermore, we elaborate on how each of them contributes to obtaining the final nodes’ features in our model.

Graph Convolutional Network (GCN)
It is often difficult to work directly in the complex and discrete graph space. Therefore, in many cases, obtaining continuous representations of nodes, edges, or the whole graph is necessary before any upcoming tasks. Employing Graph Convolutional Networks addresses this problem. The main idea of GCNs originates from the fact that a node’s representation can be obtained by taking into account the features of its own and its neighbors. This is because the neighbors in a graph (i.e., directly or indirectly connected nodes) usually share common characteristics and information.
Formally, the layer-wise propagation rule of GCNs can be generally formulated as below: (7) where is the nodes’ feature matrix at the l-th GCN layer, N is the number of graph nodes, D_l is the number of features obtained for a node by the previous GCN layer, and X⁰ is set to be the initial feature matrix given as input to the GCN; is the adjacency matrix [57, 58] or a variant of it [59, 60]; is the learnable parameter matrix of the l-th GCN layer, which maps D_l feature channels to D_l+1 channels; ϕ is a non-linear activation function; is the output feature matrix produced by the l-th GCN layer.
Considering this background, our proposed Graph Feature Learning Network first applies L layers of GCN to the input graph. This way, a continuous representation is computed for each graph node based on its neighbors’ information.
Transformer network
In this work, we intend to autoregressively model the distribution in Eq 4, which is conditioned on . We do so by feeding the representations of graph nodes one at a time into the RNN. Thus, to perform conditional distribution modeling in this way, it is necessary to learn rich node representations so that all the graph nodes can make their own contribution to compute each node’s embedding. In other words, we need the representation of a node not only to contain the information of its close neighbors, but also to include the information of relatively distant nodes that share similar characteristics. However, an L-layer GCN only considers information in L-hop neighborhoods to obtain node representations, even if there are some dependencies between farther nodes. Therefore, our proposed Graph Feature Learning Network utilizes a Transformer encoder, which has shown promising results in contextualized representation learning. The following gives a quick overview of its architecture and workflow.
According to [61], the Transformer encoder layer consists of a multi-head attention block and a feedforward network, each followed by a residual addition and a layer normalization. A multi-head attention block consists of multiple attention heads, each working in a separate subspace to compute new contextualized representations corresponding to different aspects of dependencies between data entities. To be more precise, each attention head takes as input (in our case it is the feature matrix computed for graph’s nodes by applying L layers of GCN) and projects it into three matrices , , and (i.e., query, key, and value, respectively), where W_q, and are learnable matrices. Then, the attention scores for each query are computed over the value matrix V rows by performing an inner product of that query and all the key matrix K rows. By doing so, a new contextualized representation is calculated for each query as a weighted summation of the value matrix rows.

Considering these remarks regarding the GCN and the Transformer, the final nodes’ representations are obtained via concatenating the features computed by each of the two networks. Fig 1 shows an overview of the proposed Graph Feature Learning Network.

Download:

Fig 1. An illustration of the graph feature learning network and its workflow.

(a) An input graph. (b) The Graph Convolutional Network. (c) Continuous representations learned for graph nodes by the GCN. (d) The Transformer network takes the node embeddings computed by the GCN as input and outputs new contextualized features of graph nodes. (e) The node features learned by the Transformer network (shown using small squares colored with radial gradients). (f) The final representations of graph nodes acquired by concatenating the embeddings computed by the GCN and the Transformer network. Here, dashed arrows are drawn to easily track what sub-features a final node feature consists of.

https://doi.org/10.1371/journal.pone.0277887.g001

4.2.2 Autoregressive generative model.

As mentioned earlier, we want to model the conditional distribution in Eq 4. To do so, we decompose it as the product of n + m conditional distributions in Eq 6, and then go through modeling them. Each condition in Eq 6 can be divided into two parts: (a) that is the initial structural condition regarding to G₀ and (b) The remaining part of the condition that is derived by applying the chain rule, which relates to the generation history. The former is primarily captured by our Graph Feature Learning Network, and the latter is handled using an autoregressive generative model, namely an RNN. More specifically, the embeddings obtained by the Graph Feature Learning Network are fed into the RNN one at a time, and the RNN proceeds. This way, the RNN keeps the generation history such that at each step, the corresponding hidden state maintains the information of the graph generated until that time.

4.3 Data preparation

Making the data suitable as an input to our SCGG model is a prerequisite for training. Therefore, we perform a data preparation procedure before feeding it to the model. This procedure includes determining the set of new nodes , identifying the resulting initial graph G₀, and applying orderings on these two sets of nodes. An example of the data preparation procedure before model training is illustrated in Fig 2. First, m nodes are randomly selected from the main graph G to form the set of new nodes. Therefore, the n unselected nodes and those edges connecting them are further treated as the initial graph G₀. The reason behind this random node selection is that each subset of n nodes (i.e., the unselected ones) from the original graph has the chance to contribute to the model training as an initial graph. Thus, the model gains the ability to perform structure-conditioned graph generation given an arbitrary graph G₀ at test time. Afterwards, orderings are applied to the nodes such that the initial graph nodes are ordered by π_n, and the new nodes follow the order specified by π_m.

Download:

Fig 2. An illustration of the procedure of preparing the training data.

(a) An input graph. (b) A number of m nodes are selected randomly to be further treated as the new nodes. In this picture, m = 2, and the selected nodes (i.e., the green and the purple ones) are shown with thick borders. Furthermore, the inter-connections between new nodes and those in G₀ are depicted by blue lines, and the only intra-connection between the new nodes is shown using a red line. (c) An ordering π_n is applied to the nodes in G₀. Moreover, another node ordering, denoted by π_m, is applied to the new nodes.

https://doi.org/10.1371/journal.pone.0277887.g002

4.4 Training

To train the SCGG model, we first give it two versions of each graph G. The first version corresponds to the initial graph G₀. The second version, which we denote by G′, is obtained by removing the intra-connections between pairs of nodes belonging to . The Graph Feature Learning Network takes these two graphs as inputs and separately calculates nodes’ representations for each of them, as formulated below: (8) (9)

Next, a subset of the computed representations are fed into the RNN one by one in the order specified by π_n and π_m. More precisely, the RNN first takes the representations of G₀’s nodes computed based on the first version of the graph. Then, it receives as input the representations of the new nodes obtained by feeding the G′ into the Graph Feature Learning Network. To put it another way, the final representations to be fed into the RNN are as follows: (10)

The reason for this is that at test time, we only have access to an initial graph G₀ knowing nothing about how the set of new nodes are connected to each other as well as to the rest of the graph, but as the RNN proceeds, it predicts the inter-connections between the new nodes and the nodes of G₀. Thus, when the RNN finishes processing the last node of G₀, all inter-connections have been predicted, and G′ can be constructed on top of G₀. At this point, it is time to complete the graph structure by predicting the intra-links between the new nodes. This requires that we have a proper representation for each new node, which can be obtained based on the most complete available version of the graph structure, i.e., the G′.

Moreover, each cell of the RNN takes as its second input the ground truth labels of the previous cell. Therefore, the input for the i-th RNN cell is obtained as follows: (11) where is the representation of the i-th node and is the vector of ground truth labels determining whether the i − 1-th node has links to each of the new nodes or not. Next, by considering both the current input x_i and the previous hidden state h_i−1, the RNN outputs probabilities regarding the link existence between the current node and each new node. This is done using two functions f_RNN and f_out according to the following formulations: (12) (13) where is the i-th step probabilistic output. Furthermore, the step loss L_i is a binary cross entropy (BCE) between the predicted outputs and the ground truth labels, which is formulated in the below equation: (14)

The whole network, including the Graph Feature Learning Network and the RNN, is trained in an end-to-end manner. Algorithm 1 summarizes the training procedure of our SCGG model.

Algorithm 1 Training Algorithm of SCGG Model

Input: Dataset of training graphs , number of new nodes m

Output: Learned functions f_emb, f_RNN, and f_out

1: for do

2: Build G₀ and G′ from the graph G

3: end for

4: for number of training iterations do

5: for do

6: R = [r₁, r₂, ⋯, r_n] = f_emb(G₀)

7:

8:

9: s₀ = sos; Initialize h₀; L = 0

10: for i from 1 to n + m do

11:

12: h_i = f_RNN(x_i, h_i−1)

13: ϕ_i = f_out(h_i)

14: L = L + BCE(ϕ_i, s_i)

15: end for

16: L = L / (n + m)

17: Update model parameters by performing backpropagation to minimize the loss function L

18: end for

19: end for

An example showing the SCGG model at training time is presented in Figs 3 and 4, where the graph of Fig 2 is used as training data. First, the representations of the nodes in both G₀ and G′ are computed by the Graph Feature Learning Network, which is illustrated in Fig 3. Then, the obtained representations for the G₀’s nodes (see the left half of Fig 3(d)) are given to the RNN in the order specified by π_n. Accordingly, as depicted in Fig 4, in the first RNN step, it is the turn of node 1 (indicated by a yellow circle) to be processed, and thus its features are passed on to the first recurrent unit. The network then estimates the conditional probability distribution , i.e., the probability of connecting the yellow node to each of the new nodes (the green and the purple ones). Afterwards, the step loss is calculated by taking the network output and the true labels (the first label is 1 because the yellow and the purple nodes are connected, and the second label is 0 because there is no edge between the yellow and the green nodes). In the second step, the second node’s features (indicated by orange color), along with the true labels of the previous (yellow) node and the previous hidden state, are given to the recurrent cell. Then the network outputs an estimation of the . The same procedure continues until all nodes of G, including the ones in G₀ and the set of new nodes (i.e., ), are fed into the network. Thus, in the third step, the network outputs the probability of by taking into account the features of the third (pink) node in graph G₀. In the subsequent step, when all the initial graph’s nodes have been processed, it is time to go through the new nodes in the order specified by π_m. Thus, the features computed for the first new node (displayed in purple color in the right half of Fig 3(d)) is given to the RNN to generate the probability . Next, in the fifth step, the second new node’s features (indicated in green) are fed into the recurrent network to produce the probability distribution .

Download:

Fig 3. An overview of the workflow employed to obtain the required nodes’ features in the training phase.

(a) An input training graph after applying the preparation procedure shown in Fig 2(b) Two versions are made from the main graph. The one on the left will be treated as the initial graph (i.e., the G₀), and the graph on the right, which we denote in the paper by G′, is obtained from the original graph by removing the intra-connection between the new nodes, i.e., the red link. (c) The Graph Feature Learning Network, whose architecture is illustrated in detail in Fig 1. (d) The features computed for each node of the graphs. The ones around which blue dashed ovals are drawn will be further used by the RNN.

https://doi.org/10.1371/journal.pone.0277887.g003

Download:

Fig 4. An example of the SCGG model at training time.

For each graph node, including those in the initial graph (i.e., the G₀) and the ones in the set of new nodes (i.e., the ), the model outputs a probability distribution of link existence between that node and each new node (grey squares depict the probabilistic outputs, and the darker the colors, the higher the probabilities). To do this, at each step, a recurrent unit takes the features computed for one of the graph nodes (see Fig 3), as well as the previous node’s true connections and the hidden state of the previous recurrent unit. In this regard, the nodes of G₀ (ordered by π_n) are first fed into the model, followed by the new nodes (ordered by π_m). Thus, the model learns to first generate the inter-links between the new nodes and those of G₀, and then predict the intra-links between the new nodes. The parameters of both the Graph Feature Learning Network and the RNN are updated by minimizing the total loss L that is obtained via aggregating the step losses L_i.

https://doi.org/10.1371/journal.pone.0277887.g004

To elaborate a bit more on Fig 4, it is worth mentioning that each step’s hidden state contains the information of a subgraph of the main graph (i.e., G). This subgraph includes the already processed graph nodes, the links connecting them, and their connections to each new node. It also includes links between the current node and the previous ones. For example, in Fig 4, in the third training step, two nodes (i.e., the yellow and the orange ones) have been processed and the pink node’s features are fed into the recurrent unit as part of its input. Hence, the hidden state h₃ maintains a subgraph containing the link between the yellow and the orange nodes as well as the links between these nodes and the new nodes (shown by blue lines). It also retains the links between the current (pink) node and both the yellow and orange ones that have been fed into the network in the first two steps.

4.5 Inference

In the inference stage, an initial graph G₀ is given as the structural condition. Then, using the learned functions f_emb, f_RNN, and f_out, the model starts generating graph G by adding new nodes to G₀ and predicting the inter-links between the new nodes and those of G₀, as well as the intra-links between the new nodes themselves. Algorithm 2 describes the steps of the SCGG model at inference time. Moreover, Fig 5 illustrates the inference workflow of the SCGG by a toy example.

Algorithm 2 Inference Algorithm of SCGG Model

Input: f_emb, f_RNN, f_out, m, G₀

Output: G

1: R = [r₁, r₂, ⋯, r_n] = f_emb(G₀)

2: s₀ = sos; Initialize h₀

3: for i from 1 to n do

4: x_i = Concat(r_i, s_i−1)

5: h_i = f_RNN(x_i, h_i−1)

6: ϕ_i = f_out(h_i)

7: s_i ∼ ϕ_i ⊳ Sample the inter-connections between the i-th node of G₀ and the set of new nodes

8: end for

9: Construct graph G′ on top of G₀ using the sampled links [s₁, s₂, ⋯, s_n]

10:

11: for j from n + 1 to n + m do

12:

13: h_j = f_RNN(x_j, h_j−1)

14: ϕ_j = f_out(h_j)

15: s_j ∼ ϕ_j ⊳ Sample the intra-connections between the j − n-th new node and each of the new nodes

16: end for

17: Construct graph G on top of G′ using the sampled links [s_n+1, s_n+2, ⋯, s_n+m]

Download:

Fig 5. An example illustrating the SCGG model at inference time.

In this example, m = 3 and a graph G₀ consisting of two nodes is given to the model as the structural condition. At first, the Graph Feature Learning Network computes representations for the G₀’s nodes, which are then used as part of the RNN input. Next, the RNN proceeds for two steps and outputs the probabilities of the inter-connections between these two nodes and each new node. Therefore, all the inter-links are generated by sampling from the produced probabilities. At this point, it is time to construct graph G′ based on the G₀ and the generated links. Next, G′ is passed into the Graph Feature Learning Network to calculate the representations of its nodes. In this step, the representations of the new nodes are given to the RNN one by one to generate the intra-connections. Finally, the graph G is constructed on top of the G′ by considering the generated intra-links.

https://doi.org/10.1371/journal.pone.0277887.g005

4.6 Implementation details

The proposed model is implemented using the PyTorch Library [62]. As previously discussed, the function f_emb consists of a graph convolutional network (GCN) and a Transformer network. In this regard, we use a two-layer GCN with the embedding size of each layer equaling 16. ReLU activation followed by a batch normalization layer are used between the two GCN layers. Besides, our Transformer has one encoder layer with 8 attention heads and a dropout of 0.1. We use 4 layers of GRU cells with a 128-dimensional hidden state to implement the function f_RNN. For the function f_out, a two-layer multilayer perceptron (MLP) is employed with 64 hidden units in the middle and a ReLU nonlinearity between the layers. Further, the Adam optimizer is used with a learning rate of 0.003, and the model is trained for 100 epochs with a minibatch size of 32. Moreover, for the choice of π_n and π_m, we use uniform random orderings to maximize an approximation of the marginal likelihood in Eq 5, which becomes intractable to compute exactly as the size of graphs increases. To see details regarding hyperparameter tuning, please refer to S1 File.

5 Experiments

In this section, we first elaborate on both the synthetic and the real-world datasets we used for evaluation purposes. Then, we outline the state-of-the-art baselines with which we compare our SCGG model. Next, the evaluation metric is explained, followed by describing the experimental setup. Finally, we discuss the results of our proposed approach, as well as the ones of the competitor methods.

5.1 Datasets

We evaluate the performance of our proposed method on a variety of synthetic and real-world datasets. In the following, we provide a brief description of each dataset. Moreover, Table 2 summarizes the key statistics of them.

Grid: It is a synthetic dataset consisting of standard 2D grid graphs.
IMDBBINARY: This dataset consists of ego-networks derived from actor/actress collaborations based on the information of movies belonging to the Action and Romance genres on IMDB. For each graph, nodes represent actors/actresses, and if a pair of them appears in the same movie, a link connects their corresponding nodes in the graph.
IMDBMULTI: The same explanation given for the IMDBBINARY dataset is valid for this dataset as well, except that the movies belong to the Comedy, Romance, and Sci-Fi genres.
Enzymes: This dataset consists of graphs each representing a protein tertiary structure from the BRENDA enzyme database [63]. More precisely, a graph’s nodes represent secondary structure elements (SSEs). An edge connects two nodes if their corresponding SSEs are neighbors along the amino acid sequence or one of the three nearest neighbors in space.
NCI1: It is a biological graph dataset published by the National Cancer Institute (NCI). Each graph in the dataset represents a chemical compound screened for its activity against the growth of human tumors.
Protein: This dataset contains protein graphs [64]. Each graph represents a protein with nodes corresponding to amino acids. If the distance between two amino acids of a protein is less than 6 Angstroms, their corresponding nodes are connected in the graph.

Download:

Table 2. Statistics of datasets used in the experiments.

https://doi.org/10.1371/journal.pone.0277887.t002

5.2 State-of-the-art approaches

We compare our approach with several well-known state-of-the-art methods, explanations of which are provided in the following.

KronEM [52]. This is an old and well-known network completion method that combines the Expectation-Maximization (EM) framework with the Kronecker graphs model [65] to infer missing nodes and their corresponding edges in partially observed graphs. To do this, in each EM iteration, the method first utilizes the observed part of a graph to estimate model parameters (the M-step). Then it infers the missing part of that graph using the estimated model (the E-step).
GraphRNN-S [32]. This is a very famous autoregressive deep graph generator that first transforms graphs into sequences and then models the corresponding data distribution using RNNs. At each step, the method adds a new node to the currently generated graph and predicts the links connecting it to the previous nodes. Aside from that, GraphRNN-S makes a simplistic assumption that a node’s links are independent of each other, and therefore models them by a multi-layer perceptron.
GraphRNN [32]. This is the full GraphRNN model, which is relatively similar to GraphRNN-S, with the difference that it does not take into account the edge independence as a simplifying assumption. Therefore, to capture the interdependencies between a node’s edges, it employs another recurrent neural network called the edge-level RNN.
DeepNC [56]. This is the most recent graph completion baseline that utilizes a deep generative model of graphs, namely GraphRNN-S, to infer the missing parts of a partially observable network. To this end, the method first learns a likelihood over data by training the GraphRNN-S model. Then, it proposes a sequence of algorithmic steps to recover the network in a greedy fashion, trying to maximize the learned likelihood. The fact needed to be noted here is that although this method uses the probabilities generated by a deep generative model of graphs to make algorithmic decisions, it is not considered a totally deep learning-based approach. However, if a model is specifically trained to address the problem of graph completion, it can achieve higher performance.
EvoGraph [66]. This is a graph upscaling method, which expands an initial input graph G₀ = (V₀, E₀) in K stages by adding |E₀| new edges at each stage. The method considers a set of candidate new nodes in every expansion phase and adds each new edge by choosing one of its endpoints from the current nodes and the other from the candidate ones. To provide a fair comparison between EvoGraph and other methods, we make a slight change to its upscaling process by terminating it right after the insertion of the m-th new node.

5.3 Comparison of time complexities

In this subsection, and after explaining the baselines with which we compare our model, we analyze the computational complexity of these competing approaches. To begin, we examine the complexity of our model. As discussed earlier, SCGG predicts m probabilities at each inference stage to determine how the current node should connect to each new node. As there are n + m inference steps, the total complexity of SCGG is . In light of the fact that m ≪ n, which DeepNC also considers, the complexity of SCGG can be rewritten as . For the DeepNC approach, it is proved in [56] that the lower and upper bounds of its computational complexity are Ω(n) and , respectively, which shows that DeepNC can be computationally more expensive than SCGG.

The complexity of both GraphRNN-S and GraphRNN, as stated by their authors, is , which can be simplified into when considering m ≪ n. Therefore, depending on the choice of M, which is typically estimated by an empirical upper bound, we can say that the complexity of SCGG is less than or equal to that of GraphRNN-S and GraphRNN.

According to what has been mentioned in the previous subsection, the slightly modified version of EvoGraph can generate at maximum n(m − 1) edges before the insertion of the m-th new node, which connects nodes in the initial graph to the new ones. Considering that each edge insertion operation is of complexity, the total computational complexity of EvoGraph is , which as discussed earlier, can be simplified as . Therefore, this method is of comparable computational complexity to our SCGG model.

Finally, the runtime of kronEM, according to what is stated by its authors, scales as with |E| representing the number of edges in the final graph G. Assuming that |E| is at least of order , which is the case in many graphs, including the ones in our datasets (please see S2 File for more details), the complexity of kronEM is at least . As a result, KronEM is more computationally expensive than our approach.

5.4 Evaluation metric

Similar to [56], we use Graph Edit Distance (GED) [67] as the evaluation metric to assess the performance of our SCGG method and the baselines. In this regard, if we denote a generated or completed graph by and its corresponding ground truth graph by G, the GED between these two graphs, which shows how dissimilar they are, can be formulated as follows: (15) where is the set of all edit paths converting to a graph that is isomorphic to G. Moreover, c(e_i) is the cost of an edit operation e_i, which in the same way as [56], we set it to 1 for all operations. Additionally, as with [56], we normalize the GED computed for each pair of graphs by the average of their sizes.

Along with our brief overview of GED, one important point to note is that enumerating all the discussed edit paths requires employing a combinatorial search procedure with exponential time complexity, and therefore the exact solution to this problem is NP-complete [68]. Hence, we utilize an approximation approach [69] for computing GED scores.

5.5 Experimental setup

In addition to what we have explained in Section 4.6 concerning the details of implementing our SCGG model, in this subsection, we elaborate on the remainder of the details regarding the experimental setup. In this respect, we select a random subset of 80% of the graphs in each dataset to train our model. A similar approach is also followed to train other learning-based baselines (i.e., GraphRNN-S and GraphRNN). We then make use of the remaining 20% of graphs for model testing. More specifically, for each graph G in the test set, we perform the following two steps for 10 iterations:

We randomly choose a number of m nodes from the original test graph G and remove these nodes and their associated edges to acquire a subgraph G₀.
We then feed the obtained subgraph to all the competing methods and compare their results to the ground truth graph G.

An illustrative example depicting our SCGG model at the evaluation time is provided in S3 File.

Afterwards, for each graph in the test data, we average the GED scores calculated in 10 iterations and compute their standard deviation. Finally, for each parameter value m, we report the average GED scores, as well as the average standard deviations computed over the whole test set.

5.6 Results and discussion

In this subsection, the experiments conducted to evaluate the performance of our proposed method against the baselines are presented in three parts. In the first part, we set the maximum possible value for the parameter m such that the competing methods can be evaluated on all datasets. Then, we compare the obtained results and report the gain of SCGG over the baselines. In the second part, we discretely change the value of m from the lowest to the highest possible amount so that all datasets can be utilized for model testing. Then we study how the performances of various methods are affected by increasing the value of m. Finally, in the third part, we raise m to much higher values and evaluate the efficacy of all approaches on the dataset that offers this possibility.

We first analyze the performance of different methods for the case where m = 10. The reason for choosing this value for m is that, as outlined in Table 2, the minimum number of nodes among graphs of all datasets is 11. Hence, to construct initial graphs G₀, a maximum of 10 nodes can be removed from the original graphs. We report the obtained results in Table 3, from which it is evident that for all datasets, SCGG is the best-performing method in terms of the lowest average GED score. More precisely, SCGG obtains an average gain of 51.74% over other approaches based on the experiments conducted on all datasets, with the lowest gain value of 2.65% and the highest gain of 88.15%. Furthermore, in most cases, the standard deviations of our results are less than those of the baselines.

Download:

Table 3. Comparison of SCGG with its competitors for m = 10 in terms of GED (Avg. ± Std.).

https://doi.org/10.1371/journal.pone.0277887.t003

Besides, the results of Table 3 reveal that KronEM does not perform well in general, so that, unlike other methods, its average GED has never been lower than 0.52. There can be several reasons for this. First, unlike SCGG, GraphRNN-S, GraphRNN, and to some extent DeepNC, this method is not trained on a dataset of graphs, but rather, it processes each graph in the test set separately, i.e., it completes the structure of each partially observed graph based solely on the available part of it. Another reason for the underperformance of KronEM might be due to the fact that the Kronecker graphs model generates graphs with 2^x nodes. Therefore, when an initial graph G₀ is given to KronEM, it increases the number of its nodes to the nearest power of 2. This can lead to a significant difference between the ground truth and the completed graph regarding the number of nodes, thereby causing the GED score to be raised.

In addition to what we have discussed so far regarding the results in Table 3, they also indicate that EvoGraph considerably underperforms on the IMDBBINARY and IMDBMULTI datasets. This is because the upscaling process of EvoGraph tends to establish connections with new nodes that have not yet been linked to the graph. In other words, adding new edges is performed with a high priority to connect new nodes to the already generated graph, meaning that setting up more connections between the previously added nodes and the nodes of the initial graph G₀ is carried out with a relatively low priority. Thus, it is not surprising that the graphs produced by EvoGraph generally contain fewer edges than the ones belonging to the IMDBBINARY or IMDBMULTI datasets, which according to the statistics listed in Table 2, have low edge sparsity. In light of this, we can expect a decrease in the performance of EvoGraph on these two datasets.

In the second part of the experiments, we vary the value of m discretely from 1 to 10 and study the performance of different methods as a function of the parameter m. In this regard, Figs 6–11 demonstrate the obtained results on the Grid, IMDBBINARY, IMDBMULTI, Enzymes, NCI1, and Protein datasets, respectively. Moreover, since a part of the results are somewhat visually overlapped, which may affect their readability, we provide the readers with another view of them. In this regard, a pairwise comparison between our SCGG approach and each of the baselines is depicted in a separate subplot for all datasets. Accordingly, the second appearance of the results in Figs 6–11 can be found in S1–S6 Figs, respectively. In the following, we discuss the results obtained on each dataset.

Download:

Fig 6. Performance comparison on the Grid dataset in terms of GED (lower is better) as a function of the number of new nodes to be added (i.e., m).

https://doi.org/10.1371/journal.pone.0277887.g006

Download:

Fig 7. Performance comparison on the IMDBBINARY dataset in terms of GED (lower is better) as a function of the number of new nodes to be added (i.e., m).

https://doi.org/10.1371/journal.pone.0277887.g007

Download:

Fig 8. Performance comparison on the IMDBMULTI dataset in terms of GED (lower is better) as a function of the number of new nodes to be added (i.e., m).

https://doi.org/10.1371/journal.pone.0277887.g008

Download:

Fig 9. Performance comparison on the Enzymes dataset in terms of GED (lower is better) as a function of the number of new nodes to be added (i.e., m).

https://doi.org/10.1371/journal.pone.0277887.g009

Download:

Fig 10. Performance comparison on the NCI1 dataset in terms of GED (lower is better) as a function of the number of new nodes to be added (i.e., m).

https://doi.org/10.1371/journal.pone.0277887.g010

Download:

Fig 11. Performance comparison on the Protein dataset in terms of GED (lower is better) as a function of the number of new nodes to be added (i.e., m).

https://doi.org/10.1371/journal.pone.0277887.g011

Fig 6 shows the effect of increasing the value of m on the performance of various methods on the Grid dataset. According to these results, the GED values of most methods (i.e., SCGG, GraphRNN-S, GraphRNN, and EvoGraph) increase almost uniformly with the growth of m, which makes sense since as m increases, the task becomes more difficult. A noteworthy point here is that our proposed SCGG approach performs the best (lowest GED score). In addition, as m gets higher values, the GED of our approach increases with a lower slope. This figure also demonstrates the poor performance of KronEM (both in terms of the relatively high average GED score and the high standard deviations), which is in accordance with what we discussed before. The results also indicate that DeepNC underperforms on the Grid dataset. This may be because DeepNC, unlike other competitors, does not conduct its processing steps by taking into account the whole initial graph G₀ at once. To put it another way, other methods receive an initial graph G₀ and start adding new nodes on top of it. Meanwhile, DeepNC starts constructing the graph from scratch, and at each stage, it randomly decides whether to choose the next node from the set of initial graph nodes or add a new one. Therefore, since the graphs of the Grid dataset follow a highly regular structural pattern, not considering complete information of initial graphs at once before processing can lead to the performance drop of DeepNC by constructing graphs that are substantially different from the expected ones.

Figs 7 and 8 show the results obtained on the IMDBBINARY and IMDBMULTI datasets, respectively. They reveal that for all values of m, the SCGG outperforms the baselines. It is also evident from these results that EvoGraph has achieved the worst performance among other competitors. This, as explained earlier, can be due to the tendency of EvoGraph to complete the graph structures by adding a small number of edges to G₀, which is in contrast to the non-sparsity of the graphs belonging to these two datasets.

The results on datasets Enzymes and NCI1 are depicted in Figs 9 and 10, respectively. Since these two datasets share relatively similar statistical properties, as listed in Table 2, somewhat similar results are observed on them. In this regard, our SCGG approach achieves the best performance compared to other methods. Specifically, it offers the lowest average GED score in almost all cases. Moreover, in the vast majority of circumstances, the standard deviations of the results obtained by our method are lower compared to the other approaches. These results also demonstrate that GraphRNN-S and GraphRNN perform the worst as the value of m increases. This is because these two are general graph generation approaches, which are not specifically designed to solve problems such as structure-conditioned graph generation or graph completion. Therefore, although they have achieved acceptable performance in some cases, it is not surprising that in some other cases, they perform poorly compared to the baselines.

Finally, Fig 11 depicts the results on the Protein dataset, in which the value of m varies discretely from 1 to 10. The results indicate that the SCGG method obtains a lower GED than the baseline methods in almost all cases, and as the value of m goes up, this performance superiority more clearly manifests itself. In addition, the weak performance of KronEM can be seen in these results, the reasons for which have been discussed in detail previously.

In the third part of the experiments, we study the performance of all competing approaches in the case where a much larger number of nodes are supposed to be added to initial graphs G₀. Accordingly, we conduct the experiments on the Protein dataset, which gives us this opportunity due to the large size of its graphs. More precisely, we increase the value of the parameter m from 10 to 90 (i.e., the maximum possible value that does not exceed the minimum number of nodes in this dataset) in steps of 10. The results of these experiments are illustrated in Fig 12. As we can see, our method achieves the best results in terms of the lowest GED score for all values of m. Furthermore, in most cases, and especially as m gets higher values, our results show smaller standard deviations than those of other approaches. We can also observe that for the higher values of m, for which both the tasks of graph completion and structure-conditioned graph generation become much more challenging, the performances of GraphRNN-S, GraphRNN, and EvoGraph deteriorate rapidly. This can be interpreted according to the fact that these approaches are not particularly designed to address such tasks. Conversely, as the parameter m rises to its highest values, SCGG, DeepNC, and KronEM offer the best results, respectively.

Download:

Fig 12. Performance comparison on the Protein dataset in terms of GED (lower is better) as a function of the parameter m, which varies discretely from 10 to 90 in steps of 10.

https://doi.org/10.1371/journal.pone.0277887.g012

Another perspective of the results in Fig 12 can be found in S7 Fig, providing the readers with a pairwise comparison of our SCGG model and each of the baselines.

As shown by the experimental results, our SCGG method outperforms its competitors in almost all cases. At the end of this subsection, we summarize the reasons for its better performance. In our opinion, this is due to the fact that our SCGG model is the first deep learning-based generative approach to exclusively address structure-conditioned graph generation, one of whose main applications is solving the graph completion problem. We have compared the performance of our proposed method with the best and most well-known baselines closely related to our research. These baselines can broadly be classified into three categories. The first category is modern graph generation approaches. GraphRNN-S and GraphRNN fall into this category. Despite the impressive results achieved by these two methods in terms of distribution-related metrics for generating graphs in general, our method outperforms them for the task of structure-conditioned graph generation. This is primarily because the SCGG imposes structural conditions on the generation procedure mainly through its Graph Feature Learning Network, whereas GraphRNN-S and GraphRNN do not include such mechanisms. The second category of baselines relates to graph completion methods. The KronEM and DeepNC approaches fall under this category, where KronEM processes each graph separately. In more precise terms, KronEM predicts the missing nodes of a graph solely based on the observed parts of that graph. Contrary to this, our approach is based on learning from graph datasets, enabling it to solve the graph completion problem more efficiently. Another limitation significantly affecting the performance of KronEM is the number of nodes in its resulting graphs, which needs to be a power of two. The second graph completion baseline with which we have compared our model is the recently proposed DeepNC approach. This method, as mentioned earlier, first trains a well-known graph generative model (i.e., the GraphRNN-S) and then proposes a set of algorithms to complete partially observed graphs in a manner that maximizes the likelihood of the trained model. Despite the innovative algorithms and promising results of DeepNC against its competitors, it is not a learning-based approach specifically trained for graph completion, which could be the key reason for its inferior performance compared with SCGG. Lastly, we have compared our method to EvoGraph, which falls under the category of graph upscaling methods. As with the DeepNC approach, this method adopts an algorithmic strategy without being trained on graph datasets, which results in its poorer performance than the SCGG method.

6 Conclusions

In this work, we have presented SCGG, a novel structure-conditioned graph generation approach that autoregressively generates a graph by adding new nodes and their corresponding edges on top of a given initial substructure G₀. Specifically, the architecture of our model consists of a specific graph representation learning network, which is the main responsible for considering the conditioning substructure, and an autoregressive generative model (i.e., a recurrent neural network) that mostly maintains the generation history. We then have employed this model to address the intrinsically hard-to-solve problem of network completion, in which the goal is to complete the structure of a partially observed graph, some of whose nodes are unknown. To demonstrate the superiority of our proposed SCGG model, we have conducted intensive experiments on both synthetic and real-world datasets and compared the performance of our method against state-of-the-art baselines for the task of graph completion. The experimental results illustrate that SCGG outperforms the baselines in terms of the GED score, which indicates that the graphs generated by our model, on average, are the closest to the ground truth graphs. To the best of our knowledge, this is the first time a completely deep learning-based approach addresses the graph completion problem.

The present study does not address all the problems regarding structure-conditioned graph generation. First, our SCGG model tackles a particular type of condition by generating several new nodes and their corresponding edges on top of a given conditioning substructure. It is possible, however, to envisage other structural conditions. For example, in some situations, it may be desirable that structure-related statistical properties of generated graphs (e.g., average node degree, average clustering coefficient) be in a specific value range. Secondly, as with many other graph generation approaches, our method decides based on the structural information of graphs (i.e., the way nodes are connected), and does not take into account different types of graphs or additional information associated with them. There are, however, many types of graphs. For example, heterogeneous graphs with multi-typed nodes and edges are widely common in real-life. Molecular graphs are another example with significant applications in the pharmaceutical industry. Lastly, all the current graph generation techniques, including SCGG, suffer from scalability issues, making them inapplicable for graphs with millions or even tens of thousands of nodes. Here, a question arises as to whether we can transcend these limitations. As future research pathways, we intend to investigate the answer by further expanding the framework established by SCGG.