Learning Temporal Attention in Dynamic Graphs with Bilinear Interactions

Graphs evolving over time are a natural way to represent data in many domains, such as social networks, bioinformatics, physics and finance. Machine learning methods for graphs, which leverage such data for various prediction tasks, have seen a recent surge of interest and capability. In practice, ground truth edges between nodes in these graphs can be unknown or suboptimal, which hurts the quality of features propagated through the network. Building on recent progress in modeling temporal graphs and learning latent graphs, we extend two methods, Dynamic Representation (DyRep) and Neural Relational Inference (NRI), for the task of dynamic link prediction. We explore the effect of learning temporal attention edges using NRI without requiring the ground truth graph. In experiments on the Social Evolution dataset, we show semantic interpretability of learned attention, often outperforming the baseline DyRep model that uses a ground truth graph to compute attention. In addition, we consider functions acting on pairs of nodes, which are used to predict link or edge representations. We demonstrate that in all cases, our bilinear transformation is superior to feature concatenation, typically employed in prior work. Source code is available at https://github.com/uoguelph-mlrg/LDG.


Introduction
Graph structured data arise from fields as diverse as social network analysis, epidemiology, finance, and physics, among others [1,2,3,4]. A graph G = (V, E) is comprised of a set of N nodes, V, and the edges, E, between them. For example, a social network graph may consist of a set of people (nodes), and the edges may indicate whether two people are friends. Recently, graph neural networks (GNNs) (e.g., [5,6,7,8,9]) have emerged as a key modeling technique for learning representations of such data. These models use recursive neighborhood aggregation to learn latent features, Z (l+1) ∈ R N ×c , of nodes for some layer, l + 1, given node features, Z (l) ∈ R N ×d , of the previous layer, l [7]: where A ∈ R N ×N is an adjacency matrix of graph G, normalized in a particular way, depending on the specific GNN variant [7,10]. W ∈ R d×c are trainable parameters; d, c are input and output dimensionalities, and σ is an element-wise nonlinearity, such as ReLU. Different extensions of this method have demonstrated considerable success at tasks like graph/node classification and link prediction [11,12,13]. The local aggregation operator in (1) was derived from spectral graph convolution [6] and motivated by the success of convolutional neural networks (CNNs) in dominating vision and audio tasks, such as image and speech recognition. However, CNNs are limited to Euclidean space, where translation is well-defined [1], while GNNs are more flexible and can also be applied to non-Euclidean data, such as graphs and sets.
*Equal contribution. Corresponding author: Boris Knyazev (e-mail: bknyazev@uoguelph.ca). 1 The focus of GNNs thus far has been on static graphs-graphs with a fixed structure. However, a key component of network analysis is often to predict the state of an evolving graph G τ at time τ . For example, knowledge of the evolution of person-to-person interactions during an epidemic facilitates analysis of how a disease spreads [14], and can be expressed in terms of the links between people in a dynamic graph. Other applications include predicting whether two people will become friends at time τ , predicting locations of players (nodes) or some interaction between them in team sports, such as basketball or soccer [15,16], and others.
Previous approaches for representation learning over dynamic graphs, such as DyRep [17], have assumed the entire dynamic graph structure is known (i.e., no edges are missing and there are no redundant connections). Methods for learning latent graphs, such as Neural Relational Inference (NRI) [18], have focused primarily on the fixed graph setting, which does not support addition or deletion of nodes or the complex multimodal interactions between them.
Our approach simultaneously infers graph structure (as in NRI) while learning the dynamics of the graph via a GNN (as in DyRep). We explore the use of a learned representation of the underlying graph in lieu of the ground truth, inspired by NRI. We use a learned temporal attention matrix, which can also be interpreted as a graph, to drive graph dynamics. In this temporal attention matrix, we use a bilinear relationship instead of DyRep's concatenation to permit more complex relationships between node representations. We then apply this method to the task of dynamic link prediction on the Social Evolution dataset [19].
We begin by discussing related work in § 2, then proceed to relevant background information in § 3 before a description of our model in § 4. Our experiments, with results and insights, are given in § 5, and we conclude with a brief discussion of the implications of our findings in § 6.

Related work
Prior work [20,21,22,23,24,16,18] addressing the problem of learning from dynamic graphs has tended to develop methods that are very specific to the task at hand, with only a few shared ideas. This is primarily due to the difficulty of learning from temporal data in general and temporal graph-structured data in particular, which remains an open problem [3] that we address in this work.
Given an evolving graph [..., G t−1 , G t , G t+1 , ...], where t ∈ [0, T − 1] is a discrete index at a continuous time point τ , most prior work uses some variant of a recurrent neural network (RNN) to update node embeddings over time [25,22,24,18,16]. The weights of RNNs are typically shared across nodes [25,22,24,18]; however, in the case of smaller graphs, a separate RNN per-node can be learned to better tune the model [16]. RNNs are then combined with some function that aggregates features of nodes. While this can be done without explicitly using GNNs (e.g., using the sum or average of all nodes instead), GNNs impose an inductive relational bias specific to the domain, which typically improves performance [22,18,24,16].
Closely related to our work, there are a few applications where the graph G t is considered to be either unknown or suboptimal, so it is inferred simultaneously with learning the model [22,26,18]. Among them, [22,26] focus on visual data, while NRI [18] proposes a more general framework and, therefore, is adopted in this work.
DyRep [17] is a method that has been recently proposed for learning from dynamic graphs based on temporal point processes [27]. This method: • supports two time scales of graph evolution (i.e. long-term connections to allow addition and removal of nodes and edges, and short-term connections to inform future changes to the graph); • operates in continuous time; • scales well to large graphs; and • is data-driven due to employing a GNN similar to (1).
These key advantages make DyRep favorable, compared to other methods discussed previously. In this work, we improve on DyRep by adding bilinear interactions, and we study the ability to learn a temporal attention matrix to inform connections using NRI [18] simultaneously with the DyRep model. 2

Background
Here we describe relevant details of the DyRep model. A complete description of DyRep can be found in [17]. DyRep is a representation framework for dynamic graphs, which assumes that graphs evolve according to two elementary processes: • Long-term association, in which nodes and edges are added or removed from the graph affecting the evolving adjacency matrix A t ∈ R N ×N .
• Communication, in which nodes communicate over a short time period, whether or not there is an edge between them.
For example, in a social network, association may be represented as one person adding another as a friend. A communication event may be an SMS message between two friends (an association edge exists) or an SMS message between people who are not friends yet (an association edge does not exist). These two processes interact to fully describe information transfer between nodes.
More formally, an event is a tuple o t = (u, v, τ, k) of type k ∈ {0, 1}, with k = 0 corresponding to association and k = 1 communication, between nodes u and v at continuous time point τ with time index t. This interaction can be expressed as three recursively executed functions: where z t u , z t v ∈ R d are the current d-dimensional embeddings of nodes u and v; Z t−1 = [z t−1 1 , z t−1 2 , ..., z t−1 N ] ∈ R N ×d are node embeddings at previous time t − 1; f λ and f z are based on neural networks; f S is a function updating temporal attention over edges S t ∈ R N ×N in the graph, which is implemented as a hard-coded algorithm in DyRep [17]; N i are the one-hop neighbors of the other node participating in the event at time t (see details below following (5)); A t−1 is the adjacency matrix at time t − 1; λ t is the conditional intensity of events between nodes u and v; and [·, ·] is the concatenation operator.
This formulation is similar to recurrent networks with relational inductive bias, e.g., [28,24,16], but here, association and communication events are modelled in continuous time through a two time-scale deep temporal point process. The conditional intensity function of the point process, λ t (2), which models the frequency at which events occur between nodes, acts in conjunction with a deep network f z that computes node embeddings Z (3a), (3b). Together with temporal attention S t (4), these components create a learned representation of a dynamic graph that can be used for tasks like dynamic link prediction. The relationship between learned node representations in turn drives graph dynamics through the conditional intensity function.
To better understand DyRep, it is important to expand (3a) and (3b), which define the evolution of node embeddings. Embeddings z t v and z t u are updated in the same way, so below we only expand it for z t v (3a). In particular, for an event o t between nodes u and v, the embeddings of both nodes are updated based on the summation of the following three terms, followed by a nonlinearity σ : The first term of Equation (5) is the "Localized Embedding Propagation", which is comprised of learned parameters W struct ∈ R d×d multiplied by a learned hidden representation. This representation, h struct,t−1 u , is  Figure 1. Overview of our approach relative to DyRep [17], in the context of dynamic link prediction. During training, events o t are observed, affecting node embeddings Z. In contrast to DyRep, which updates attention weights S t in a predefined hard-coded way based on associative connections A t , such as CloseFriend, we assume that graph A t is unknown and our latent dynamic graph (LDG) model based on NRI [18] infers S t by observing how nodes communicate. We show that learned S t has a close relationship to certain associative connections. Best viewed in colour. a function of temporal attention S t−1 u ∈ R N between node u and all its one-hop neighbors N u (t) [17]. Using features of node u's neighbors to update node v's embeddings is important for creating a temporal edge by which node features propagate between the two nodes. The idea is that the more frequently the nodes communicate, the more similar their embeddings will be, with a weak dependency on whether or not there is a long-term edge between them, induced by attention. The second term is the "Self-Propagation", which is comprised of learned parameters W rec ∈ R d×d multiplied by the previous computed embedding of node v, at the time index node v last participated in an association or communication event t v − 1. This term performs a recurrent update of the features of node v. The third term is the "Exogenous Drive", which is comprised of the learned parameters W temp ∈ R d multiplied by the waiting time between the current event τ and the previous event involving node v (τ tv−1 ), which captures other forces acting on the graph that may influence the embedding of node v, such as global events involving many nodes.
Additionally, the component f S from Equation (4) merits further discussion, as it is a hard-coded attention layer that is computed pairwise. In particular, the temporal attention matrix S t is only updated if a communication event occurs (k = 1), and an association exists between the two nodes under consideration (A t−1 u,v = 1), otherwise it is left unchanged. This function is computed as a softmax over the attention given by node u to its one-hop neighbors at time t. In this method, attention is only given to the node from the neighborhood of u that contributes the most information. No learned parameters directly contribute to the computation of the temporal attention matrix, which limits information propagation.
In this paper, we extend the DyRep algorithm in two ways. First, we examine the benefits of a learned representation of the underlying graph in (4) by using a variational autoencoder, in the style of [18]. This permits learning of a sparse representation of the interactions between nodes instead of using a hard-coded function f S and known adjacency matrix A ( § 4.1). Second, both the original DyRep work [17] and [18] use concatenation to make predictions for a pair of nodes (see (2) above and (6) in [18]), which only captures relatively simple relationships. We are interested in the effect of allowing a more expressive relationship to drive graph dynamics, and specifically to drive temporal attention ( § § 4.1 and 4.2). We demonstrate the utility of our model by applying it to the task of dynamic link prediction ( § 5).

Model
Recently, [18] proposed Neural Relational Inference (NRI), showing that in some settings, models that use a learned representation, in which the original graph structure is discarded, can outperform models that use the ground truth graph. A learned sparse graph representation would keep the most salient features, i.e., only those connections that are necessary for the downstream task, whereas the underlying graph may have redundant connections. For example, in a human motion capture dataset, such as that explored by [18], the human body has connections from the hip to the knee, knee to ankle, etc. To predict how a person walks, the connection between, say, the foot and a specific toe, may be unnecessary. In other applications, the underlying graph might be unknown, so by learning it, we can reveal structural interactions between nodes, which can improve a fully-connected graph or other heuristic.
While NRI learns latent graph structure from observing node movement, we learn the graph by observing how nodes communicate. In this spirit, we repurpose the encoder of NRI, combining it with DyRep, which gives rise to our latent dynamic graph (LDG) model, described below in more detail (Fig. 1). We also summarize our notation in Table 1 at the end of this section.

Bilinear encoder
DyRep's encoder (4) requires a graph structure represented as an adjacency matrix A. We propose to replace this with a sequence of learnable functions f enc S , borrowed from [18], that only require node embeddings as input: S Given an event between nodes u and v, our encoder takes the embedding of each node j ∈ V at the previous time step z t−1 j as an input, and returns an edge embedding h 2 (u,v) between nodes u and v using two passes of node and edge mappings, denoted by superscripts 1 and 2 (Fig. 2): ∀i, j : h 1 where f 1 node , f 1 edge , f 1 node , f 2 edge are two-layer, fully-connected neural networks, as in [18]; W 1 and W 2 are trainable parameters implementing bilinear layers. In detail: node transforms embeddings of all nodes in a graph; • (8): f 1 edge is a "node to edge" mapping that returns an edge embedding h 1 (i,j) for all pairs of nodes (i, j);  Figure 2. Inferring an edge S t u,v of our latent dynamic graph (LDG) using two passes, according to (7)-(10), assuming an event between nodes u = 1 and v = 3 has occurred. Even though only nodes u and v has been involved in the event, to infer the edge S t u,v between them, interactions with all nodes in a graph are considered. 5 • (9): f 2 node is an "edge to node" mapping that updates the embedding of node j based on all edges connected to it; • (10): f 2 edge is similar to the "node to edge" mapping in the first pass, f 1 edge , but only the edge embedding between nodes u and v involved in the event is used.
The softmax function is applied to the edge embedding h 2 (u,v) , which yields the edge type posterior as in NRI [18]: where S t u,v ∈ R r are temporal one-hot attention weights sampled from the multirelational conditional multinomial distribution q φ (S t u,v |Z t−1 ), hereafter denoted as q φ (S|Z) for brevity; r is the number of edge types (note that in DyRep r = 1); and φ are parameters of the neural networks in (7)- (10). S t u,v is then used to update node embeddings at the next time step, according to (3a) and (3b).
Replacing (4) with (6) means that it is not necessary to maintain an explicit representation of the ground-truth graph in the form of an adjacency matrix. The evolving graph structure is implicitly captured by S t . While S t represents temporal attention weights between nodes, it can be thought of as a graph evolving over time, which we call a Latent Dynamic Graph (LDG). This graph, as we show in our experiments, can have a particular semantic interpretation.
Bilinear layers have proven to be advantageous in settings like Visual Question Answering (e.g., [29]), where multi-modal embeddings interact. In our case, they permit a richer interaction between embeddings of different nodes, so in (8) and (10), we replace [18]'s linear layers by computing a bilinear interaction, rather than concatenating features.
The two passes in (7)-(10) are important to ensure that attention values S t u,v depend not only on the embeddings of nodes u and v, z t−1 u and z t−1 v , but also on how they interact with other nodes in the entire graph. With one pass, the values of S t u,v would be predicted based only on local information, as only the previous node embeddings influence the new edge embeddings in the first pass (8).
Unlike DyRep, our temporal attention module has multiple edges between nodes, i.e., S t u,v are one-hot vectors of length r. We therefore modify the "Localized Embedding Propagation" term in (5), such that features h struct,t−1 u are computed for each edge type and parameters W struct act on concatenated features from all edge types, i.e., W struct ∈ R rd×d .

Bilinear intensity function
The conditional intensity function λ t k represents the instantaneous rate at which an event of type k (i.e., association or communication) occurs between nodes u and v in the infinitesimally small interval (τ, τ + δτ ] [30]. DyRep formulates the conditional intensity as a softplus function of the concatenated learned node where ψ k is the scalar trainable rate at which events of type k occur, and ω k ∈ R 2d is a trainable vector that represents the compatibility between nodes u and v at time t. We replace concatenation [·, ·] in (12) with bilinear interaction: where Ω k ∈ R d×d are trainable parameters, to allow more complex interactions between evolving node embeddings. We use this interaction to inform the sampling step illustrated in Fig. 2, and in the likelihood during training. 6

Training
Given a minibatch with a sequence of P events, we optimize the model by minimizing the following cost function: where L events = − P p=1 log(λ tp kp ) is the total negative log of the intensity rate for all events between nodes u p and v p (i.e., all nodes that experience events in the minibatch); and L nonevents = M m=1 λ tm km is the total intensity rate of all nonevents between nodes u m and v m in the minibatch. Since the sum in the second term is combinatorially intractable in many applications, we sample a subset of nonevents according to the Monte Carlo method as in [17], where we follow their approach and set M = 5P.
The first two terms, L events and L nonevents , were proposed in DyRep [17] and we use them to train our baseline models. The KL divergence term, adopted from NRI [18] to train our LDG models, regularizes the model to align predicted q φ (S|Z) and prior p θ (S) distributions of attention over edges. Here, p θ (S) can, for example, be defined as [θ 1 , θ 2 , ..., θ r ] in case of r edge types.
Following [18], we consider uniform and sparse priors. In the uniform case, θ i = 1/r, i = 1, . . . , r, so the KL term becomes the sum of entropies H over events p = [1, . . . , P] and over generated edges excluding Table 1 Index of a node involved in the event Embedding of node v at time t Conditional intensity of edges of type k at time t between nodes u and v ψ k 1 Trainable rate at which edges of type k occur ω k 2d Trainable compatibility of nodes u and v at time t Ω k d × d Trainable interaction matrix between nodes u and v at time t self-loops (u = v): where entropy is defined as a sum over edge types r: H p φ = − q p φ log q p φ and q p φ denotes distribution q φ (S tp u,v |Z tp−1 ). In case of sparse p θ (S) and, for instance, r = 2 edge types, we set p θ (S) = [0.90, 0.05, 0.05], meaning that we generate r + 1 edges, but do not use the non-edge type corresponding to high probability, and leave only r sparse edges. In this case, the KL term becomes: where H p φ,q = − q p φ log p θ (S). During training, we update S tp u,v after each p-th event and backpropagate through the entire sequence in a minibatch. To backpropagate through the process of sampling discrete edge values, we use the Gumbel reparametrization [31], as in [18]. Training behaviour is illustrated in Fig. 3.

Dataset
We use the Social Evolution Dataset released by the MIT Human Dynamics Lab [19] for our experiments, and preprocess it in a similar way to [17]. The dataset consists of over 1 million events o t = (u, v, τ, k).
A communication event (k = 1) is represented by the sending of an SMS message, or a Proximity or Call event from node u to node v; an association event (k = 0) is a CloseFriend record between the nodes. We also experiment with other associative connections (Fig. 4). As Proximity events are noisy, we filter them by the probability that the event occurred, which is available in the dataset annotations. The filtered dataset on which we report results includes 83 nodes with 43,517 training and 10,462 test communication events. We evaluate models only on communication events, since the number of association events is small, but we use both for training. As in [17], we use events from September 2008 to April 2009 for training, and from May to June 2009 for testing. Associative connections corresponding to the beginning and end of training events are illustrated in Fig. 4.
At test time, given tuple (u, τ, k), we compute the conditional density of u with all other nodes and rank them [17]. We report Mean Average Ranking (MAR) and HIST@10: the proportion of times that a test tuple appears in the top 10. 8

Implementation Details
We train models with the Adam optimizer [32], with a learning rate of 2 × 10 −4 , minibatch size P = 200 events, and d = 32 hidden units per layer, including those in the encoder (7)-(10). We consider two priors, p θ (S), to train the encoder: uniform and sparse with r = 2 edge types in each case. For the sparse case, we generate r + 1 edges, but do not use the non-edge type corresponding to high probability and leave only r sparse edges.
We run each experiment 10 times and report the average and standard deviation of MAR and HIST@10 in Table 2. We train for 5 epochs with early stopping. To run experiments with random graphs, we generate S once in the beginning and keep it fixed during training. For the models with learned temporal attention, we use random attention values for initialization, which is then updated during training (Fig. 4).
We chose PyTorch [33] as a deep learning framework, along with the publicly available implementation of NRI 1 .   Table 3.

Results
We report results of the baseline DyRep with CloseFriend and FacebookAllTaggedPhotos as an underlying graph (Table 2) and compare them to the models with learned temporal attention (latent dynamic graph, LDG). Models with learned attention perform better than DyRep's hard-coded attention based on FacebookAllTaggedPhotos and some other ground truth graphs (see Fig. 5 for more comparisons), further confirming the finding from [18] that the underlying graph can be suboptimal. However, these models are still worse compared to CloseFriend. We also show that bilinear models consistently improve results and exhibit better training behaviour, including when compared to a larger linear model with an equivalent number of parameters (Fig. 3).
While the models with a uniform prior have better test performance than those with a sparse prior in some cases, sparse attention is typically more interpretable. This is because the model is forced to infer only a few edges, which must be strong since that subset defines how node features propagate (Table 3). In addition, relationships between people in the dataset tend to be sparse. To estimate agreement of our learned temporal attention matrix with the underlying association connections, we take the matrix S t generated after the last event in the training set and compute the area under the ROC curve (AUC) between S t and each  of the associative connections present in the dataset. These associations evolve over time, so we consider associations corresponding to the beginning (September 2008) and end (April 2009) of training events.
AUC is used as opposed to other metrics, such as accuracy, to take into account sparsity of true positive edges, as accuracy would give overoptimistic estimates. We observe that LDG learns a graph that is most similar to CloseFriend, with AUC of 84%. This is an interesting phenomenon, given that we only observe how nodes communicate through many events between non-friends. Thus, the learned temporal attention matrix is capturing information related to the associative connections.  Figure 5. Results of using graphs based on training data statistics ("no learn"), compared to bilinear DyRep with different graphs used for associations.

BLOG -BlogLivejournalTwitter
Node embeddings of bilinear models tend to form more distinct clusters, with frequently communicating nodes generally residing closer to each other after training (Fig. 6). We notice bilinear models tend to group nodes in more distinct clusters. Also, using the random edges approach clusters nodes well and embeds frequently communicating nodes close together, because the embeddings are mainly affected by the dynamics of communication events.

CloseFriend
Edge 1 Edge 2 Figure 6. tSNE node embeddings after training (coordinates are scaled for visualization). Lines denote associative or sampled edges. Darker points denote overlapping nodes. Red, green, and cyan nodes correspond to the three most frequently communicating pairs of nodes, respectively.

Dataset Insights
The application of GNNs to datasets with dynamic events and evolving graphs, as we have done here, is relatively new and demanding. To understand the nature of the Social Evolution Dataset, we report results on the test set, which were obtained simply by computing statistics from the training set (Fig. 5), which led to quite strong performance, e.g., MAR=23 by exploiting FacebookAllTaggedPhotos. Interestingly, based on the MAR results (Fig. 5, left), FacebookAllTaggedPhotos connections are more correlated with communication events than CloseFriend in the case of "no learn". This can mean that friends are strong, longer term relationships that do not necessary involve frequent communication events. Based on the HITS@10 results (Fig. 5, right), our model performs better than or comparably to all associations, except for CloseFriend.
In the experiments based on dataset statistics discussed above, to predict a link for node u at time τ , we randomly sample node v from those associated with u. In the Random case, we sample node v from all nodes, except for u. Moreover, we measured the frequency of events in the training set between each pair of nodes and used those values to rank nodes in the test set (Fig. 7). Surprisingly, this achieves almost perfect results, where MAR=0.30 and HITS@10=0.99 (or MAR=5.57 and HITS@10=0.83 for the full unfiltered dataset). These results imply that current models are unable to capture relatively simple regularities in dynamic data, although admittedly, we do not directly feed these statistics to the model. It also means we need datasets with more complicated dynamics inherent to many applications, where such simple regularities would have less predictive power.

Conclusion
We propose an extension of DyRep and NRI for dynamic link prediction. In addition to the advantage, in some cases, of learned temporal attention over use of the ground truth graph, we showed that bilinear layers can capture essential graph dynamics that concatenation alone cannot. Finally, we showed that a simple statistical analysis can outperform more complex models based on node embeddings. More diverse data exhibiting richer dynamics would allow for more meaningful dynamic graph analysis.