SCGG: A deep structure-conditioned graph generative model

Deep learning-based graph generation approaches have remarkable capacities for graph data modeling, allowing them to solve a wide range of real-world problems. Making these methods able to consider different conditions during the generation procedure even increases their effectiveness by empowering them to generate new graph samples that meet the desired criteria. This paper presents a conditional deep graph generation method called SCGG that considers a particular type of structural conditions. Specifically, our proposed SCGG model takes an initial subgraph and autoregressively generates new nodes and their corresponding edges on top of the given conditioning substructure. The architecture of SCGG consists of a graph representation learning network and an autoregressive generative model, which is trained end-to-end. More precisely, the graph representation learning network is designed to compute continuous representations for each node in a graph, which are not only affected by the features of adjacent nodes, but also by the ones of farther nodes. This network is primarily responsible for providing the generation procedure with the structural condition, while the autoregressive generative model mainly maintains the generation history. Using this model, we can address graph completion, a rampant and inherently difficult problem of recovering missing nodes and their associated edges of partially observed graphs. The computational complexity of the SCGG method is shown to be linear in the number of graph nodes. Experimental results on both synthetic and real-world datasets demonstrate the superiority of our method compared with state-of-the-art baselines.


Introduction
With the ever-increasing growth of data collection and production technologies, large amounts of data is readily accessible.In many cases, some kind of relationship exists between data entities, which, if taken into consideration, can lead to more precise data analyses.Such relationships are mostly represented by graph data structures, and that is why graph-related research has become a widely discussed topic in many areas including chemistry [1], medical applications [2], social network studies [3], and knowledge graph-related research [4].Most recent studies are dedicated to graph representation learning [5,6], aiming to obtain suitable representations of nodes, edges or the entire graph in continuous space to be further utilized by downstream tasks.
Graph generation is another important branch of graph-related research, which often benefits from the results of graph representation learning studies.This research field has a history of several decades.It has recently been revived by receiving renewed attention from scholars, mainly due to the advances in machine learning, and in particular deep learning techniques.The goal of graph generation is to provide models that can generate new graph samples from the desired data distributions.Thus, similar to generative methods in other data domains such as image [7], text [8], and speech [9], graph generative approaches can bring substantial capacity for graph data modeling to address various real-world problems such as drug design [10], understanding and modeling the interactions in social networks [11], and human diseases diagnosis [12].
One of the desired and essential properties of generative methods is their ability to carry out the generation procedure in a controlled manner so that the produced samples comply with predetermined conditions by having the required characteristics.In this regard, numerous studies September 21 have been conducted to develop conditional generative models in different data domains, such as image [13] and text [14].Initial steps [15][16][17][18][19] have also been taken to make graph generators conditional, however, compared to the work performed in other data domains and also compared to the needs and capacities of this field, much remains to be done.
In addition to what we have discussed so far, there is a common problem manifesting itself when working with different types of data.Specifically, in many cases, the data is not completely available, which can be caused due to various reasons such as limitations of data collection tools, issues related to privacy, or inadequacy of storage space.This can significantly degrade the performance of data analysis methods.Therefore, it is often crucial to recover the missing part of the data before processing it; hence, various methods have been proposed in different data domains to address this challenge.Regarding the graph data, many methods have also been developed for years [20,21] to predict missing links between graph nodes, and researchers are still seriously pursuing a solution for it [22].However, an intrinsically more complicated challenge arises when the graph nodes are missing.We will refer to this problem as graph completion, which, unlike the widely investigated problem of link prediction, has been much less addressed despite its importance and pervasiveness.
To address the issues mentioned above, we propose Structure-Conditioned Graph Generator (SCGG), an end-to-end deep learning-based conditional graph generative approach.The SCGG model takes an initial subgraph as the structural condition.It then autoregressively performs the graph generation procedure by adding new nodes and predicting the inter-links between the new nodes and those in the conditioning subgraph, as well as the intra-links between the new nodes themselves.In this way, our generative model ensures the existence of desired subgraphs in final generated graphs, which can have several applications in both molecular and non-molecular domains.Specifically, for designing molecular graphs, the existence of desired chemical substructures can bring certain chemical properties to the final molecules.Moreover, regarding the non-molecular graphs, the SCGG model can be best utilized to solve the graph completion problem in which some graph nodes and their corresponding edges are totally missing.Our study focuses on the latter application, but the proposed SCGG model can be easily extended to be used in molecular applications as well.In this regard, a partially observed graph is given to the model as a structural condition.Then the generated nodes by the model and their associated edges will be treated as the recovered missing nodes and the edges connecting them to each other, as well as to the partially observed graph nodes.
In summary, we present the following contributions in this work: • We introduce SCGG, a conditional graph generation approach, which autoregressively generates graphs based on a given structural condition.
• The architecture of our SCGG model consists of a graph representation learning network and a recurrent neural network (RNN), where the former is mainly used to take into account the structural condition, and the latter captures the generation history.
• We use our proposed SCGG model to address the graph completion problem to benefit from the power and potential of a deep generative model for solving an inherently difficult and complex problem, which as a result has been relatively less investigated so far.To the extent of our knowledge, this is the first time that a completely deep learning-based model is designed in such a way that it can specifically tackle this problem.
problem.In Section 4, we explain our proposed SCGG model in detail.Experimental details and results are discussed in Section 5. Finally, in Section 6 we conclude the paper.

Related work
In line with what we discussed earlier, our proposed SCGG model is a structure-based conditional graph generation approach that one of its main applications is graph completion.Therefore, in the following, we review the literature in two related areas.

Graph generation
Graph generation is a field of research seeking to generate new graph structures with certain characteristics, which dates back to several decades ago, and is still a hot topic for research.In contrast to the early methods [23][24][25][26], which relied on manually-designed procedures to construct graphs with predetermined statistical properties, the more recent ones are data-driven, utilizing the available graph samples in datasets to train models that can more effectively generate new graphs.The latter approaches typically employ different deep learning techniques and generation strategies, and accordingly, they can be classified into several categories [27].
The autoregressive approaches, which adopt step-by-step strategies for generating graphs, are the most relevant methods to our research.DeepGMG [28] is an example of them proposing a repetitive decision-making process to generate graphs gradually.GraphRNN [29] is among the well-known and influential approaches, which first maps each graph into a sequence of nodes and then processes one node per time step using RNNs to model the distribution of the resulting sequences.The method has inspired a number of subsequent approaches like MolecularRNN [30], which extends GraphRNN to generate molecular graphs with specific chemical features.Bacciu et al. [31], GraphGen [32], and GHRNN [33], on the other hand, convert graphs to sequences of edges instead of nodes, and then go through distribution modeling with RNNs.Besides, there are some other autoregressive methods that utilize the attention mechanism to empower their generative models.Regarding this, GRAN [34] proposes to add a block of new nodes in each step, and to compute the representations of the graph nodes, it employs an attentive message passing mechanism.
A key point to notice is that regardless of what category these methods fall into and what techniques they employ to solve the problem, an important capability of them is to consider desired conditions during generation so that the resulting graphs meet the expected characteristics.Hence, the problem of conditional graph generation arises.In this regard, GraphVAE [15] conditions both the encoder and the decoder of its VAE on a label vector for the molecular graph generation.CONDGEN [16] adopts a similar approach (i.e., concatenating a condition vector to the VAE latent variable) to incline the model towards generating graphs with desired characteristics.Lim et al. [17] and HierVAE [18] guarantee the existence of intended chemical substructures in the output molecular graphs.CCGG [19] makes the GRAN [34] model class-conditional, allowing it to generate graphs of desired classes.However, despite the efforts that have gone into conditional graph generation, there is still a vital need to develop more and more approaches that can capture various types of conditions.In this regard, the SCGG model is a generative method designed to handle special conditions, which are of structural type.

Graph completion
In many cases, a part of a graph structure is unavailable for various reasons.Hence, it is necessary to reconstruct the missing information prior to further processing.Most of the methods developed for this purpose try to perform link prediction [46][47][48], although a more complicated problem arises when the graph nodes are missing.Therefore, due to the complexity of addressing this problem, which we refer to as graph completion, so far few methods have been presented to solve it.Regarding this, KronEM [49] utilizes a combination of the Expectation-Maximization framework and the Kronecker graphs model to infer the missing nodes and their corresponding edges.SAMI [50] adopts a clustering approach for solving the missing node problem by heavily relying on the existence of missing node indicators, which are often unattainable in real scenarios.Masrour et al. [51] and JCSL [52] utilize side information about the graph nodes to perform network completion; however, this information may not be accessible in all cases.More recently, DeepNC [53] was introduced, which first learns the likelihood of the data by training the GraphRNN [29] model.It then uncovers the missing parts of a graph by proposing a greedy optimization algorithm, aiming to maximize the obtained likelihood.Although DeepNC is an innovative approach that has obtained satisfactory results, it is not learning-based, so it cannot directly learn from the data for the specific task of graph completion.However, our proposed method trains an end-to-end model to address this problem.Furthermore, unlike some graph completion methods mentioned above, the SCGG does not depend on the existence of side information, which may not be reachable in many situations.

Notations and problem definition
In this section, we define the notations used in the paper and present the problem definition.For convenience, we summarize the notations in Table 1.
We denote an initial graph as 0 = ( 0 , 0 ), where 0 and 0 are the node and the edge sets, respectively, and | 0 | = .Under an ordering of these nodes, we represent the -th node's links by the following sequence: where takes value of 1 if the -th node is connected to the -th node and 0 otherwise.Considering 0 as the structural condition, the objective of our research is to learn to sample from the conditional probability distribution (  | 0 ) in order to generate graph = ( , ), which includes 0 as a subgraph, i.e., 0 ⊂ and 0 ⊂ .This can be done by first adding the node set ̃ , with | ̃ | = and ̃ = − 0 .Then, to connect new nodes, the edge set ̃ will be generated, where ̃ = − 0 .More specifically, ̃ consists of: 1. the inter-connections between new nodes and those in 0 , 2. the intra-connections between the new nodes themselves.To represent the inter-connections between new nodes and the -th node of 0 under the ordering , we use the below sequence: where we consider a node ordering of the new nodes, and is 1 if the -th node of 0 has a link to the -th new node and 0 otherwise.Moreover, regarding the intra-connections, we denote the -th new node's connections to the nodes in ̃ by the following sequence: where similarly to the previous formulas, takes the value of 1 if there is a link connecting the -th and the -th new nodes (under the ordering ) and 0 otherwise.The node set of 0 . 0 The edge set of 0 .
n The number of nodes in 0 , n=| 0 |.
An ordering of 0 nodes.
The sequence representing how the -th node of 0 under the ordering connects to 0 's nodes.
The graph that contains an initial graph 0 as a subgraph.The node set of .
The edge set of .

  
The random variable associated with graph structures.

̃
The set of new nodes added to 0 to form the graph .

̃
The set of edges connecting the new nodes to each other, as well as to those nodes in 0 .
The number of new nodes, m=| ̃ |.An ordering of the new nodes ̃ .

̃
The sequence representing the links connecting the -th node of 0 under the ordering to the new nodes ordered by .

̃
The sequence representing the links between the -th new node and each of the new nodes under the ordering .

̃ ,
The notational abbreviation for The graph induced from by removing the intra-connections between the set of new nodes

SCGG: Structure-Conditioned Graph Generator
We approach a specific type of structure-conditioned graph generation that takes an initial substructure and starts to generate new nodes and their associated edges on top of the given conditioning substructure.To this end, we propose the SCGG model, whose architecture is composed of a graph representation learning network and an autoregressive generative model, which is trained in an end-to-end manner.In this section, we present the details of the SCGG model.In this regard, we first elucidate the problem formulation and the model architecture.
Next, we describe the procedure employed to prepare the data for model training.Then, we discuss the training and inference phases and elaborate on the implementation details.

Formulation
As mentioned in the Section 3, in this work we intend to learn to sample from the distribution (  | 0 ) to conditionally generate the graph given an arbitrary initial graph 0 .To do so, our SCGG model first estimates this conditional probability distribution and then samples from the resulting estimated distribution.As it is not easy to work directly in the graph space, we reformulate the problem to deal with the following distribution: where 0 and ̃ , are the notational abbreviations for 1 , ⋯ , ̃ }, respectively, and the new problem formulation relates to the original one through the below equation: To further decompose the probability in Eq. 4, we follow the chain rule and therefore this conditional probability can be rewritten as follows: Our proposed SCGG method trains a novel network architecture in an end-to-end manner to model the complex distribution in Eq. 6.

Model architecture
The model architecture of SCGG consists of two main components, namely, a graph representation learning network and an autoregressive generative model (i.e., an RNN).In the following, we explain these components in detail and discuss the role each plays in the task of structure-conditioned graph generation.September 21, 2022 6/33

Graph Feature Learning Network
The SCGG method needs appropriate representations of graph nodes beforehand to perform distribution modeling.Therefore, it utilizes a graph representation learning network that employs both a graph convolutional network (GCN) and a Transformer network to learn meaningful node features.Below, we give a brief background of GCNs and Transformers.Furthermore, we elaborate on how each of them contributes to obtaining the final nodes' features in our model.

• Graph Convolutional Network (GCN)
It is often difficult to directly work in the complex and discrete graph space.Therefore, in many cases, obtaining continuous representations of nodes, edges, or the whole graph is necessary prior to any upcoming tasks.Employing Graph Convolutional Networks addresses this problem.The main idea of GCNs originates from the fact that a node's representation can be obtained by taking into account the features of its own and its neighbors.This is because the neighbors in a graph (i.e., directly or indirectly connected nodes) usually share some common characteristics and information.Formally, the layer-wise propagation rule of GCNs can be generally formulated as below: where ∈ R × is the nodes' feature matrix at the -th GCN layer, is the number of graph nodes, is the number of features obtained for a node by the previous GCN layer, and 0 is set to be the initial feature matrix given as input to the GCN; ∈ R × is the adjacency matrix [54,55] or a variant of it [56,57]; ∈ R × +1 is the learnable parameter matrix of the -th GCN layer, which maps feature channels to +1 channels; is a non-linear activation function; +1 ∈ R × +1 is the output feature matrix produced by the -th GCN layer.Considering this background, our proposed Graph Feature Learning Network first applies layers of GCN to the input graph.This way, a continuous representation is computed for each graph node based on its neighbors' information.

• Transformer network
In this work, we intend to autoregressively model the distribution in Eq. 4, which is conditioned on { 1 , ⋯ , }.We do so by feeding the representations of graph nodes one at a time into the RNN.Thus, in order to perform conditional distribution modeling in this way, it is necessary to learn rich node representations so that all the graph nodes can make their own contribution to compute each node's embedding.In other words, we need the representation of a node not only to contain the information of its close neighbors, but also to include the information of relatively distant nodes that share some similar characteristics with it.However, an -layer GCN only considers information in -hop neighborhoods to obtain node representations, even if there are some dependencies between farther nodes.Therefore, our proposed Graph Feature Learning Network utilizes a Transformer encoder, which has shown promising results in contextualized representation learning.The following gives a quick overview of its architecture and workflow.
According to [58], the Transformer encoder layer consists of a multi-head attention block and a feedforward network, each followed by a residual addition and a layer normalization.A multi-head attention block consists of multiple attention heads, each working in a separate subspace to compute new contextualized representations corresponding to different aspects of dependencies between data entities.To be more precise, each attention head takes as input ∈ R × (in our case it is the feature matrix computed for graph's nodes by applying layers of GCN) and projects it into three matrices matrices.Then, the attention scores for each query are computed over the rows of the value matrix by performing an inner product of that query and all the key matrix rows.By doing so, a new contextualized representation is calculated for each query as a weighted summation of the value matrix rows.
Considering these remarks regarding the GCN and the Transformer, the final nodes' representations are obtained via concatenating the features computed by each of the two networks.Fig. 1 shows an overview of the proposed Graph Feature Learning Network.

Autoregressive generative model
As mentioned earlier, we want to model the conditional distribution in Eq. 4. To do so, we decompose it as the product of + conditional distributions in Eq. 6, and then go through modeling them.Each condition in Eq. 6 can be divided into two parts: (a) { 1 , ⋯ , } that is the initial structural condition regarding to 0 and (b) The remaining part of the condition derived by applying the chain rule, which relates to the generation history.The former is primarily captured by our Graph Feature Learning Network, and the latter is handled using an autoregressive generative model, namely an RNN.More specifically, the embeddings obtained by the Graph Feature Learning Network are fed into the RNN one at a time, and the RNN proceeds.This way, the RNN keeps the generation history such that at each step, the corresponding hidden state maintains the information of the graph generated until that time.

Data preparation
Making the data suitable as an input to our SCGG model is a prerequisite for training.Therefore, we perform a data preparation procedure before feeding it to the model.This procedure includes determining the set of new nodes ̃ , identifying the resulting initial graph 0 , and applying orderings on these two sets of nodes.An example of the data preparation procedure before model training is illustrated in Fig. 2. First, nodes are randomly selected from the main graph to form the set of new nodes.Therefore, the unselected nodes and those edges connecting them to each other are further treated as the initial graph 0 .The reason behind this random node selection is that each subset of nodes (i.e., the unselected ones) from the original graph has the chance to contribute to the model training as an initial graph.Thus, the model gains the ability to perform structure-conditioned graph generation given an arbitrary graph 0 at test time.Afterwards, orderings are applied to the nodes such that the initial graph nodes are ordered by , and the new nodes follow the order specified by .A number of nodes are selected at random to be further treated as the new nodes.In this picture, = 2 and the selected nodes (i.e., the green and the purple ones) are shown with thick borders.Furthermore, the inter-connections between new nodes and those in 0 are depicted by blue lines, and the only intra-connection between the new nodes is shown using a red line.(c) An ordering is applied to the nodes in 0 .Moreover, another node ordering, denoted by , is applied to the new nodes.

Training
To train the SCGG model, we first give it two versions of each graph .The first version corresponds to the initial graph 0 .The second version, which we denote by ′ , is obtained by removing the intra-connections between pairs of nodes belonging to ̃ .The Graph Feature Learning Network takes these two graphs as inputs and separately calculates nodes' representations for each of them, as formulated below: Next, a subset of the computed representations are fed into the RNN one by one in the order specified by and .More precisely, the RNN first takes the representations of 0 's nodes computed based on the first version of the graph.Then, it receives as input the representations of the new nodes obtained by feeding the ′ into the Graph Feature Learning Network.To put it another way, the final representations to be fed into the RNN are as follows: The reason for this is that at test time, we only have access to an initial graph 0 knowing nothing about how the set of new nodes are connected to each other as well as to the rest of the graph, but as the RNN proceeds, it predicts the inter-connections between the new nodes and the nodes of 0 .Thus, when the RNN finishes processing the last node of 0 , all inter-connections have been predicted and ′ can be constructed on top of 0 .At this point, it is time to complete the graph structure by predicting the intra-links between the new nodes.This requires that we have a proper representation for each new node, which can be obtained based on the most complete available version of the graph structure, i.e., the ′ .
September 21, 2022 9/33 Moreover, each cell of the RNN takes as its second input the ground truth labels of the previous cell.Therefore, the input for the -th RNN cell is obtained as follows: where ′′ is the representation of the -th node and −1 ∈ R is the vector of ground truth labels determining whether the − 1-th node has links to each of the new nodes or not.Next, by considering both the current input and the previous hidden state ℎ −1 , the RNN outputs probabilities regarding the link existence between the current node and each new node.This is done using two functions and according to the following formulations: where ∈ R is the -th step probabilistic output.Furthermore, the step loss is a binary cross entropy (BCE) between the predicted outputs and the ground truth labels, which is formulated in the below equation: The whole network, including the Graph Feature Learning Network and the RNN, is trained in an end-to-end manner.Algorithm 1 summarizes the training procedure of our SCGG model.for ∀ ∈  do end for 19: end for An example showing the SCGG model at training time is presented in Figures 3 and 4, where the graph of Fig. 2 is used as training data.First, the representations of the nodes in both 0 and ′ are computed by the Graph Feature Learning Network, which is illustrated in Fig. 3.Then, the obtained representations for the 0 's nodes (see the left half of Fig. 3 (d)) are given to the September 21, 2022 10/33 RNN in the order specified by .Accordingly, as depicted in Fig. 4, in the first RNN step, it is the turn of node 1 (indicated by a yellow circle) to be processed, and thus its features are passed on to the first recurrent unit.The network then estimates the conditional probability distribution ( ̃ 1 | 1 , 2 , 3 ), i.e., the probability of connecting the yellow node to each of the new nodes (the green and the purple ones).Afterwards, the step loss is calculated by taking the network output and the true labels (the first label is 1 because the yellow and the purple nodes are connected, and the second label is 0 as there is no edge between the yellow and the green nodes).In the second step, the second node's features (indicated by orange color) along with the true labels of the previous (yellow) node and the previous hidden state are given to the recurrent cell.Then the network outputs an estimation of the ( ̃ 2 | 1 , 2 , 3 , ̃ 1 ).The same procedure continues until all nodes of , including the ones in 0 and the set of new nodes (i.e., ̃ ), are fed into the network.Thus, in the third step, the network outputs the probability of ) by taking into account the features of the third (pink) node in graph 0 .In the subsequent step, when all the initial graph's nodes have been processed, it is time to go through the new nodes in the order specified by .Thus, the features computed for the first new node (displayed in purple color in the right half of Fig. 3 (d)) is given to the RNN to generate the probability Next, in the fifth step, the second new node's features (indicated in green) are fed into the recurrent network to produce the probability distribution . In order to elaborate a bit more on Fig. 4, it is worth mentioning that each step's hidden state contains the information of a subgraph of the main graph (i.e., ).This subgraph includes the already processed graph nodes and the links connecting them to each other as well as their connections to each of the new nodes.It also includes links between the current node and the previous ones.For example, in Fig. 4, in the third training step, two nodes (i.e., the yellow and the orange ones) have been processed and the pink node's features are fed into the recurrent unit as part of its input.Hence, the hidden state ℎ 3 maintains a subgraph containing the link between the yellow and the orange nodes as well as the links between these nodes and the new nodes (shown by blue lines).It also retains the links between the current (pink) node and both the yellow and the orange ones that have been fed into the network in the first two steps.

Inference
In the inference stage, an initial graph 0 is given as the structural condition.Then, using the learned functions , , and , the model starts generating graph by adding new nodes to 0 and predicting the inter-links between the new nodes and those of 0 , as well as the intra-links between the new nodes themselves.Algorithm 2 describes the steps of the SCGG model at inference time.Moreover, Fig. 5 illustrates the inference workflow of the SCGG by a toy example.

Implementation details
The proposed model is implemented using the PyTorch Library [59].As previously discussed, the function consists of a graph convolutional network (GCN) and a Transformer network.In this regard, we use a two-layer GCN with the embedding size of each layer equaling 16.ReLU activation followed by a batch normalization layer are used between the two GCN layers.Besides, our Transformer has one encoder layer with 8 attention heads and a dropout of 0.1.We use 4 layers of GRU cells with a 128-dimensional hidden state to implement the function .For the function , a two-layer multilayer perceptron (MLP) is employed with 64 hidden units in the middle and a ReLU nonlinearity between the layers.Further, the Adam optimizer is used with the learning rate of 0.003, and the model is trained for 100 epochs with a minibatch size of 32.Moreover, for the choice of and , we use uniform random orderings to maximize an approximation of the marginal likelihood in Eq. 5, which becomes intractable to compute exactly as the size of graphs increases.September 21, 2022 11/33

Experiments
In this section, we first elaborate on both the synthetic and the real-world datasets we used for evaluation purpose.Then, we outline the state-of-the-art baselines with which we compare our SCGG model.Next, the evaluation metric is explained, followed by describing the experimental setup.Finally, we discuss the results of our proposed approach, as well as the ones of the competitor methods.

Datasets
We evaluate the performance of our proposed method on a variety of synthetic and real-world datasets.In the following, we provide a brief description of each dataset.Moreover, Table 2 summarizes the key statistics of them.
• Grid: It is a synthetic dataset consisting of standard 2D grid graphs.
• IMDBBINARY: This dataset consists of ego-networks derived from actor/actress collaborations based on the information of movies belonging to the Action and Romance genres on IMDB.For each graph, nodes represent actors/actresses, and if a pair of them appears in the same movie, a link connects their corresponding nodes in the graph.
September 21, 2022 12/33 For each graph node, including those in the initial graph (i.e., the 0 ) and the ones in the set of new nodes (i.e., the ̃ ), the model outputs a probability distribution of link existence between that node and each new node (the probabilistic outputs are depicted by grey squares, and the darker the colors, the higher the probabilities).To do this, at each step, a recurrent unit takes the features computed for one of the graph nodes (see Fig. 3), as well as the previous node's true connections and the hidden state of the previous recurrent unit.In this regard, the nodes of 0 (ordered by ) are first fed into the model, which are then followed by the new nodes (ordered by ).Thus, the model learns to first generate the inter-links between the new nodes and those of 0 , and then predict the intra-links between the new nodes.The parameters of both the Graph Feature Learning Network and the RNN are updated by minimizing the total loss that is obtained via aggregating the step losses .∼ ⊳ Sample the inter-connections between the -th node of 0 and the set of new nodes 8: end for 9: Construct graph ′ on top of 0 using the sampled links [ 1 , 2 , ⋯ , ] 10: 13: ℎ = ( , ℎ −1 ) 14:

= (ℎ )
15: ∼ ⊳ Sample the intra-connections between the − -th new node and each of the new nodes 16: end for 17: Construct graph on top of ′ using the sampled links [ +1 , +2 , ⋯ , + ] • Enzymes: This dataset consists of graphs each representing a protein tertiary structure from the BRENDA enzyme database [60].More precisely, a graph's nodes represent secondary structure elements (SSEs) and an edge connects two nodes if their corresponding SSEs are neighbors along the amino acid sequence or one of the three nearest neighbors in space.
• NCI1: It is a biological graph dataset published by the National Cancer Institute (NCI).Each graph in the dataset represents a chemical compound screened for its activity against the growth of human tumors.
• Protein: This dataset contains protein graphs [61].Each graph represents a protein with nodes corresponding to amino acids.If the distance between two amino acids of a protein is less than 6 Angstroms, their corresponding nodes are connected in the graph.

State-of-the-art approaches
We compare our approach with several well-known state-of-the-art methods, explanations of which are provided in the following.
• KronEM [49].This is an old and well-known network completion method that combines the Expectation-Maximization (EM) framework with the Kronecker graphs model [62] to infer missing nodes and their corresponding edges in partially observed graphs.To do this, in each EM iteration, the method first utilizes the observed part of a graph to estimate model parameters (the M-step), and then it infers the missing part of that graph using the estimated model (the E-step).
• GraphRNN-S [29].This is a very famous autoregressive deep graph generator that first transforms graphs into sequences and then models the corresponding data distribution using RNNs.At each step, the method adds a new node to the currently generated graph and predicts the links connecting it to the previous nodes.Aside from that, GraphRNN-S September 21, 2022 14/33 An example illustrating the SCGG model at inference time.In this example, = 3 and a graph 0 consisting of two nodes is given to the model as the structural condition.At first, the Graph Feature Learning Network computes representations for the 0 's nodes, which are then used as part of the RNN input.Next, the RNN proceeds for two steps and outputs the probabilities of the inter-connections between these two nodes and each of the new nodes.Therefore, all the inter-links are generated by sampling from the produced probabilities.At this point, it is time to construct graph ′ based on the 0 and the generated links.Next, ′ is passed into the Graph Feature Learning Network to calculate the representations of its nodes.In this step, the representations of the new nodes are given to the RNN one by one in order to generate the intra-connections.Finally, the graph is constructed on top of the ′ by considering the generated intra-links.makes a simplistic assumption that a node's links are independent of each other, and therefore models them by a multi-layer perceptron.
• GraphRNN [29].This is the full GraphRNN model, which is relatively similar to GraphRNN-S, with the difference that it does not take into account the edge independence simplifying assumption.Therefore, to capture the interdependencies between a node's edges, it employs another recurrent neural network called the edge-level RNN.
• DeepNC [53].This is the most recent graph completion baseline that utilizes a deep generative model of graphs, namely GraphRNN-S, to infer the missing parts of a partially observable network.To this end, the method first learns a likelihood over data by training the GraphRNN-S model.Then, it proposes a sequence of algorithmic steps to recover the network in a greedy fashion, trying to maximize the learned likelihood.The fact needed to be noted here is that although this method uses the probabilities generated by a deep generative model of graphs to make algorithmic decisions, it is not considered a totally September 21, 2022 15/33 deep learning-based approach.However, if a model is specifically trained to address the problem of graph completion, it can achieve higher performance.
• EvoGraph [63].This is a graph upscaling method, which expands an initial input graph 0 = ( 0 , 0 ) in stages by adding | 0 | new edges at each stage.The method considers a set of candidate new nodes in every expansion phase, and adds each new edge by choosing one of its endpoints from the current nodes and the other from the candidate ones.In order to provide a fair comparison between EvoGraph and other methods, we make a slight change to its upscaling process by terminating it right after the insertion of the -th new node.

Evaluation metric
Similar to [53], we use Graph Edit Distance (GED) [64] as the evaluation metric to assess the performance of our SCGG method and the baselines.In this regard, if we denote a generated or completed graph by ̂ and its corresponding ground truth graph by , the GED between these two graphs, which shows how dissimilar they are, can be formulized as follows: where ( ̂ , ) is the set of all edit paths converting ̂ to a graph that is isomorphic to .Moreover, ( ) is the cost of an edit operation , which in the same way as [53], we set it to 1 for all operations.Additionally, as with [53], we normalize the GED computed for each pair of graphs by the average of their sizes.Along with our brief overview of GED, one important point to note is that enumerating all the discussed edit paths requires employing a combinatorial search procedure with exponential time complexity, and therefore the exact solution to this problem is NP-complete [65].Hence, we utilize an approximation approach [66] for computing GED scores.

Experimental setup
In addition to what we have explained in Section 4.6 concerning the details of implementing our SCGG model, in this subsection, we elaborate on the remainder of the details regarding the experimental setup.In this respect, to train our model, we select a random subset of 80% of the graphs in each dataset.A similar approach is also followed to train other learning-based baselines (i.e., GraphRNN-S and GraphRNN).We then make use of the remaining 20% of graphs for model testing.More specifically, for each graph in the test set, we perform the following two steps for 10 iterations: • We randomly choose a number of nodes from the original test graph and remove these nodes and their associated edges to acquire a subgraph 0 .
• We then feed the obtained subgraph to all the competing methods and compare their results to the ground truth graph .
Afterwards, for each graph in the test data, we average the GED scores calculated in 10 iterations and compute their standard deviation.Finally, for each value of the parameter , we report the average of the GED scores, as well as the average of standard deviations computed over the whole test set.

Results and discussion
In this subsection, the experiments conducted to evaluate the performance of our proposed method against the baselines are presented in three parts.In the first part, we set the maximum September 21, 2022 16/33 possible value for the parameter such that the competing methods can be evaluated on all datasets.Then, we compare the obtained results and report the gain of SCGG over the baselines.
In the second part, we discretely change the value of from the lowest to the highest possible amount in such a way that all datasets can be utilized for model testing.Then we study how the performances of various methods are affected by increasing the value of .Finally, in the third part, we raise to much higher values and evaluate the efficacy of all approaches on the dataset that offers this possibility.
We first analyze the performance of different methods for the case where = 10.The reason for choosing this value for is that, as outlined in Table 2, the minimum number of nodes among graphs of all datasets is 11.Hence, to construct initial graphs 0 , a maximum of 10 nodes can be removed from the original graphs.We report the obtained results in Table 3, from which it is evident that for all datasets, SCGG is the best performing method in terms of the lowest average GED score.More precisely, SCGG obtains an average gain of 51.74% over other approaches based on the experiments conducted on all datasets, with the lowest gain value of 2.65% and the highest gain of 88.15%.Furthermore, in most cases, the standard deviations of our results are less than those of the baselines.
Besides, the results of Table 3 reveal that KronEM does not perform well in general, so that, unlike other methods, its average GED has never been lower than 0.52.There can be several reasons for this.First, unlike SCGG, GraphRNN-S, GraphRNN, and to some extent DeepNC, this method is not trained on a dataset of graphs, but rather it processes each graph in the test set separately, i.e., it completes the structure of each partially observed graph based solely on the available part of it.Another reason for the underperformance of KronEM might be due to the fact that the Kronecker graphs model generates graphs with 2 nodes.Therefore, when an initial graph 0 is given to KronEM, it increases the number of its nodes to the nearest power of 2. This can lead to a significant difference between the ground truth and the completed graph September 21, 2022 17/33 regarding the number of nodes, thereby causing the GED score to be raised.
In addition to what we have discussed so far regarding the results in Table 3, they also indicate that EvoGraph considerably underperforms on the IMDBBINARY and IMDBMULTI datasets.This is because the upscaling process of EvoGraph tends to establish connections with new nodes that have not yet been linked to the graph.In other words, adding new edges is performed with a high priority to connect new nodes to the already generated graph, meaning that setting up more connections between the previously added nodes and the nodes of the initial graph 0 is carried out with a relatively low priority.Thus, it is not surprising that the graphs produced by EvoGraph generally contain fewer edges than the ones belonging to the IMDBBINARY or IMDBMULTI datasets, which according to the statistics listed in Table 2, have low edge sparsity.In light of this, we can expect a decrease in the performance of EvoGraph on these two datasets.
In the second part of the experiments, we vary the value of discretely from 1 to 10 and study the performance of different methods as a function of the parameter .In this regard, Figures 6, 7, 8, 9, 10, and 11 demonstrate the obtained results on the Grid, IMDBBINARY, IMDBMULTI, Enzymes, NCI1, and Protein datasets, respectively.Moreover, since a part of the results are somewhat visually overlapped, which may affect their readability, we provide the readers with another view of them.In this regard, a pairwise comparison between our SCGG approach and each of the baselines is depicted in a separate subplot for all datasets.Accordingly, the second appearance of the results in Figures 6, 7 Fig. 6 shows the effect of increasing the value of on the performance of various methods on the Grid dataset.According to these results, the GED values of most methods (i.e., SCGG, GraphRNN-S, GraphRNN, and EvoGraph) increase almost uniformly with the growth of , which makes sense since as increases, the task becomes more difficult.A noteworthy point here is that our proposed SCGG approach performs the best (lowest GED score).In addition, as gets higher values, the GED of our approach increases with a lower slope.This figure also demonstrates the poor performance of KronEM (both in terms of the relatively high average GED score and the high standard deviations), which is in accordance with what we discussed before.The results also indicate that DeepNC underperforms on the Grid dataset.This may be due to the fact that DeepNC, unlike other competitors, does not conduct its processing steps by taking into account the whole initial graph 0 at once.To put it another way, other methods receive an initial graph 0 and start adding new nodes on top of it.Meanwhile, DeepNC starts constructing the graph from scratch, and at each stage, it randomly decides whether to choose the next node from the set of initial graph nodes, or add a new one.Therefore, since the graphs of the Grid dataset follow a highly regular structural pattern, not considering whole information of initial graphs at once prior to processing can lead to the performance drop of DeepNC by constructing graphs that are substantially different from the expected ones.
Figures 7 and 8 show the results obtained on the IMDBBINARY and IMDBMULTI datasets, respectively.They reveal that for all values of , the SCGG outperforms the baselines.It is also evident from these results that EvoGraph has achieved the worst performance among other competitors.This, as explained earlier, can be due to the tendency of EvoGraph to complete the graph structures by adding a small number of edges to 0 , which is in contrast to the non-sparsity of the graphs belonging to these two datasets.
The results on datasets Enzymes and NCI1 are depicted in Figures 9 and 10, respectively.Since these two datasets share relatively similar statistical properties, as listed in Table 2, somewhat similar results are observed on them.In this regard, our SCGG approach achieves the best performance compared to other methods.Specifically, in almost all cases it offers the lowest average GED score.Moreover, in the vast majority of circumstances, the standard deviations of the results obtained by our method are lower compared to the other approaches.These results also demonstrate that GraphRNN-S and GraphRNN perform the worst as the value   the baseline methods in almost all cases, and as the value of goes up, this performance superiority more clearly manifests itself.In addition, the weak performance of KronEM can be evidently seen in these results, the reasons for which have been discussed in detail previously.
In the third part of the experiments, we study the performance of all competing approaches in the case where a much larger number of nodes are supposed to be added to initial graphs 0 .Accordingly, we conduct the experiments on the Protein dataset, which, due to the large size of September 21, 2022 20/33   its graphs, gives us this opportunity.More precisely, we increase the value of the parameter from 10 to 90 (i.e., the maximum possible value that does not exceed the minimum number of nodes in this dataset) in steps of 10.The results of these experiments are illustrated in Fig. 12.
As we can see, our method achieves the best results in terms of the lowest GED score for all values of .Furthermore, in the majority of cases and especially as gets higher values, our results show smaller standard deviations than those of other approaches.We can also observe that for the higher values of , for which both the tasks of graph completion and structure-conditioned graph generation become much more challenging, the performance of GraphRNN-S, GraphRNN, and EvoGraph deteriorate rapidly.This can be interpreted according to the fact that these approaches are not particularly designed to address such tasks.Conversely, as the parameter rises to its highest values, SCGG, DeepNC, and KronEM offer the best results, respectively.Another perspective of the results in Fig. 12

Conclusions
In this work, we have presented SCGG, a novel structure-conditioned graph generation approach that autoregressively generates a graph by adding new nodes and their corresponding edges on top of a given initial substructure 0 .Specifically, the architecture of our model consists of a specific graph representation learning network, which is the main responsible for considering the conditioning substructure, and an autoregressive generative model (i.e., a recurrent neural network) that mostly maintains the generation history.We then have employed this model to address the intrinsically hard-to-solve problem of network completion, in which the goal is to complete the structure of a partially observed graph, some of whose nodes are totally unknown.
To demonstrate the superiority of our proposed SCGG model, we have conducted intensive experiments on both synthetic and real-world datasets and compared the performance of our method against state-of-the-art baselines for the task of graph completion.The experimental results illustrate that SCGG outperforms the baselines in terms of the GED score, which indicates that the graphs generated by our model, on average, are the closest to the ground truth graphs.To the best of our knowledge, this is the first time a completely deep learning-based approach addresses the graph completion problem.Potential research pathways to be explored in the future include extending the SCGG model in such a way that it can be used for molecular graph generation, in which the existence of September 21, 2022 22/33 predetermined chemical substructures in the final designed molecules confers specific chemical properties to them.Furthermore, another future research direction is to enhance model scalability, so that the SCGG can generate even much larger graphs.

Fig 1 .
Fig 1.An illustration of the Graph Feature Learning Network and its workflow.(a) An input graph.(b) The Graph Convolutional Network.(c) Continuous representations learned for graph nodes by the GCN.(d) The Transformer network that takes the node embeddings computed by the GCN as input and outputs new contextualized features of graph nodes.(e) The node features learned by the Transformer network (shown using small squares colored with radial gradients).(f) The final representations of graph nodes acquired by concatenating the embeddings computed by the GCN and the Transformer network.Here, dashed arrows are drawn to easily track what sub-features a final node feature consists of.

Fig 2 .
Fig 2.An illustration of the procedure of preparing the training data.(a) An input graph.(b)A number of nodes are selected at random to be further treated as the new nodes.In this picture, = 2 and the selected nodes (i.e., the green and the purple ones) are shown with thick borders.Furthermore, the inter-connections between new nodes and those in 0 are depicted by blue lines, and the only intra-connection between the new nodes is shown using a red line.(c) An ordering is applied to the nodes in 0 .Moreover, another node ordering, denoted by , is applied to the new nodes.

Algorithm 1 2 :
Training Algorithm of SCGG Model Input: Dataset of training graphs , number of new nodes Output: Learned functions , , and 1: for ∀ ∈  do Build 0 and ′ from the graph 3: end for 4: for number of training iterations do 5:

Fig 3 .
Fig 3.An overview of the workflow employed to obtain the required nodes' features in the training phase.(a) An input training graph after applying the preparation procedure shown in Fig. 2 (b) Two versions are made from the main graph.The one on the left will be treated as the initial graph (i.e., the 0 ), and the graph on the right, which we denote in the paper by ′ , is obtained from the original graph by removing the intra-connection between the new nodes, i.e., the red link.(c) The Graph Feature Learning Network, whose architecture is illustrated in detail in Fig.1.(d) The features computed for each node of the graphs.The ones around which blue dashed ovals are drawn will be further used by the RNN.

Fig 4 .
Fig 4.An example of the SCGG model at training time.For each graph node, including those in the initial graph (i.e., the 0 ) and the ones in the set of new nodes (i.e., the ̃ ), the model outputs a probability distribution of link existence between that node and each new node (the probabilistic outputs are depicted by grey squares, and the darker the colors, the higher the probabilities).To do this, at each step, a recurrent unit takes the features computed for one of the graph nodes (see Fig.3), as well as the previous node's true connections and the hidden state of the previous recurrent unit.In this regard, the nodes of 0 (ordered by ) are first fed into the model, which are then followed by the new nodes (ordered by ).Thus, the model learns to first generate the inter-links between the new nodes and those of 0 , and then predict the intra-links between the new nodes.The parameters of both the Graph Feature Learning Network and the RNN are updated by minimizing the total loss that is obtained via aggregating the step losses .

0 ) 2 :
0 = sos; Initialize ℎ 0 3: for from 1 to do Fig 5.An example illustrating the SCGG model at inference time.In this example, = 3 and a graph 0 consisting of two nodes is given to the model as the structural condition.At first, the Graph Feature Learning Network computes representations for the 0 's nodes, which are then used as part of the RNN input.Next, the RNN proceeds for two steps and outputs the probabilities of the inter-connections between these two nodes and each of the new nodes.Therefore, all the inter-links are generated by sampling from the produced probabilities.At this point, it is time to construct graph ′ based on the 0 and the generated links.Next, ′ is passed into the Graph Feature Learning Network to calculate the representations of its nodes.In this step, the representations of the new nodes are given to the RNN one by one in order to generate the intra-connections.Finally, the graph is constructed on top of the ′ by considering the generated intra-links.

Fig 9 .
Fig 9.Performance comparison on the Enzymes dataset in terms of GED (lower is better) as a function of the number of new nodes to be added (i.e., ).

Fig 10 .
Fig 10.Performance comparison on the NCI1 dataset in terms of GED (lower is better) as a function of the number of new nodes to be added (i.e., ).

Fig 11 .
Fig 11.Performance comparison on the Protein dataset in terms of GED (lower is better) as a function of the number of new nodes to be added (i.e., ).
can be found in S7 Fig, providing the readers with a pairwise comparison of our SCGG model and each of the baselines.

Fig 12 .
Fig 12. Performance comparison on the Protein dataset in terms of GED (lower is better) as a function of the parameter , which varies discretely from 10 to 90 in steps of 10.

Table 1 . Notations in this paper. Notation Description
0

Table 2 . Statistics of datasets used in the experiments.
The same explanation given for the IMDBBINARY dataset is valid for this dataset as well, except that the movies belong to the Comedy, Romance, and Sci-Fi genres.

Table 3 . Comparison of SCGG with its competitors for
Performance comparison on the Grid dataset in terms of GED (lower is better) as a function of the number of new nodes to be added (i.e., ).Performance comparison on the IMDBBINARY dataset in terms of GED (lower is better) as a function of the number of new nodes to be added (i.e., ).of increases.This is because these two are general graph generation approaches, which are not specifically designed to solve problems such as structure-conditioned graph generation or graph completion.Therefore, although they have achieved acceptable performance in some cases, it is not surprising that in some other cases they perform poorly compared to the baselines.Finally, Fig.11depicts the results on the Protein dataset, in which the value of varies discretely from 1 to 10.The results indicate that the SCGG method obtains a lower GED than Performance comparison on the IMDBMULTI dataset in terms of GED (lower is better) as a function of the number of new nodes to be added (i.e., ).