Dynamic graph convolutional networks with attention mechanism for rumor detection on social media

Social media has become an ideal platform for the propagation of rumors, fake news, and misinformation. Rumors on social media not only mislead online users but also affect the real world immensely. Thus, detecting the rumors and preventing their spread became an essential task. Some of the recent deep learning-based rumor detection methods, such as Bi-Directional Graph Convolutional Networks (Bi-GCN), represent rumor using the completed stage of the rumor diffusion and try to learn the structural information from it. However, these methods are limited to represent rumor propagation as a static graph, which isn’t optimal for capturing the dynamic information of the rumors. In this study, we propose novel graph convolutional networks with attention mechanisms, named Dynamic GCN, for rumor detection. We first represent rumor posts with their responsive posts as dynamic graphs. The temporal information is used to generate a sequence of graph snapshots. The representation learning on graph snapshots with attention mechanism captures both structural and temporal information of rumor spreads. The conducted experiments on three real-world datasets demonstrate the superiority of Dynamic GCN over the state-of-the-art methods in the rumor detection task.


Introduction
Social media has been a great disseminator for new information and thoughts. Due to its accessibility of sharing information, however, social media has also become an ideal platform for propagations of rumors, fake news, and misinformation [1]. Although the definition of rumor may vary by literature, we use the term rumor to indicate messages in which the veracity labels are unknown at the time of diffusion [2,3]. Rumors on social media not only mislead the users of online but also affect the real world immensely [4]. Thus detecting the rumors and preventing their spread became an essential task.
Early studies in rumor detection focused on understanding the characteristics of rumors [5,6] and extracting prominent features of rumor from the textual contents or the users' profiles [7][8][9][10][11]. Also, the temporal features or propagation patterns were elaborated significantly in [12][13][14][15][16][17], respectively. These elaborated features show profound results in rumor detection tasks. The manually extracted content-based, user-based, or propagation-based handcrafted features were used to train classical machine learning classifiers such as a decision tree, random forest, or SVMs. However, the limitation of using manually extracted features is that it fails to capture the high-dimensional patterns of rumors.
To solve the problem of using handcrafted features and avoid the feature engineering efforts [18][19][20][21], had adopted neural networks such as recurrent neural networks (RNNs) or convolutional neural networks (CNNs). The proposed rumor detection models were able to capture the high-dimensional representation from the textural contents, user profiles, and propagation structures. The models of using the propagation structure [20,21] try to represent the skeptical or conflict opinions of the responsive posts such as retweets, replies, or comments toward the original message.
Recent advent in Graph Neural Networks (GNNs) and its variants such as Graph Convolutional Networks (GCN), GraphSAGE, and Graph Attention Networks (GAT) [22][23][24][25] have gained a lot of attention. The GNNs have shown promising results in graph inference tasks such as node classification, graph classification, and link prediction. [26,27] successfully adopted GCN and GAT in the rumor detection domain, respectively. However, both models aren't considering the temporal dynamics of the rumor propagation, which only considers the static graph structure of the final state of rumor propagation.
In this study, motivated by the dynamic nature of rumor propagation, we present a novel graph convolutional network-based model, named Dynamic GCN, to better understand the evolving pattern of rumor propagation. The model includes two distinct ways of representing rumor propagation with graph snapshots: sequential and temporal snapshots. Fig 1 depicts how the rumor propagation can be represented with the sequence of snapshots. In the example scenario, the initial trust (Fig 1a) of the root post begins to gain doubts (Fig 1b), and the posts that reveal doubts are supported by others (Fig 1c). With this whole process, the veracity value of the root post can be inferred. The details of the representation will be discussed in section 4. The extended GCNs capture the spatial representation of rumor posts within a snapshot. And finally, the series of graph snapshot representations are combined with an attention mechanism. We evaluate the proposed model with three real-world datasets and show our model outperforms other state-of-the-art methods.
We summarize the main contributions as follows: • We propose two distinct ways of depicting a dynamic graph by generating two variants of graph snapshots: sequential and temporal snapshots.
• We propose a novel GCN-based rumor detection model that can capture the evolving pattern of rumor propagation by aggregating the structural representations of snapshot sequences.
• The conducted experiments on three real-world datasets demonstrate that our model accomplishes superior results on the rumor detection task compared to other state-of-theart methods.
We organize this paper as follows. In Section 2, we briefly review the rumor detection methods and the fundamental components of our model; GCNs and attention mechanisms. In Section 3, we formulate the rumor detection problem with the propagation structure of rumor. In Section 4, we introduce our model as follows: snapshot generation, graph convolution networks, readout layer, attention mechanisms, and prediction. In Section 5, the details of experiments and performance evaluation are described. And finally, we conclude this work in Section 6.

Rumor detection
Rumor is commonly defined as a message in which the veracity labels are unknown [2,3]. Rumor detection on social media is a task of classifying messages or posts with their veracity labels. Traditional approaches in rumor detection and other misinformation detection are to extract handcrafted features with prior knowledge on rumors. The content-based method and user-based method were two main approaches [7][8][9]11]. To elaborate different and additional features, the temporal or linguistic features were considered in [12][13][14]. Another characteristic feature of the rumor is its propagation structure. [15][16][17] utilize propagation patterns of rumor and show profound results on rumor detection. The manually extracted content-based, userbased, temporal, or propagation-based handcrafted features were used to train classical machine learning classifiers such as a decision tree, random forest, or SVMs. However, the limitation of models with handcrafted features is that they fail to capture the high-dimensional patterns of rumors. To solve the problem [18,19], adopted deep learning models such as RNNs or CNNs variants to extract texture, image features, or user profile features from the rumor posts. Noticeably, models which utilize propagation structure as additional features that try to represent the skeptical or conflict opinions from the responsive posts. Recently, sophisticated models like GCN [26] or GAT [27] have successfully been adopted in the rumor detection domain.

Representation learning on graphs
Promising results on neural networks in various fields, encourage studies to bring deep learning to topological graph structures. Early studies of node embedding [28,29] leverage sampling method like random walk for shallow node embedding. Recent advent in graph neural networks (GNNs) and its variants [22][23][24][25] made representation learning to be applied directly to a variety of graph structures such as social networks (friendship network, citation network, transaction network), knowledge graphs, computer networks, biochemical graph, and so on. One of the early and honored studies of GNNs is graph convolutional networks (GCNs) [23]. It approximates spectral filters with Chebyshev polynomial to extend convolutional operations on graphs. Another important study of the GNNs variant is GraphSAGE [24], which proposes different trainable aggregation functions from neighbor node embeddings with sampling methods. The proposed aggregation functions like mean, LSTM (random ordered), max-pooling are symmetric, where the ordering of neighbor nodes can be invariant. GAT [25] utilizes the attention mechanism for neighbor node embeddings. The GNNs have firmly established state-of-the-art performance in various graph inference tasks such as node classification, graph classification, link prediction, and community detection (clustering for the network structure). The fundamental component of GNNs is message passing architecture where the representation of the node is aggregated with its neighbors. The key differences in GNN variants are diverse neighborhood aggregation methods and different pooling approaches [30,31].

Attention mechanism
The attention mechanism captures the importance of the input sequence by calculating the attention scores and weights. Compared to RNN-variants, such as Long Short-Term Memory (LSTM) [32], Gated Recurrent Units (GRU) [33], or Seq2Seq model [34], attention mechanisms have demonstrated outstanding results on both the efficiency and the performance in a variety of fields [35,36]. Various attention mechanisms have been proposed depending on how they calculate the attention weights. [36] proposed additive attention, which adopts a feedforward neural network to calculate the importance of the input in the context of the input sequence. [35,37] suggested dot-product attention and self-attention, which utilized dot-product similarity to capture the significance of certain input words from the set of words in the task of neural machine translation. Attention mechanism had also introduced and shown promising results in graph representation learning [25] where the node embeddings are calculated and attended over their neighbor nodes' features.

Representation learning on dynamic graph
Graph structure like social network contains the property of dynamics by its nature [38]. Different approaches have been proposed to capture the dynamics of graphs. Early studies [39,40] have focused on the changes or graph properties such as clusters, centralities, and similarities in certain temporal points of graphs called graph snapshots. From the advancement of feature-based dynamic graph representations, architectures with triadic closure and RNNs [41,42] were adopted to embed sequences of graph structures. [43] suggested Dyngem which utilizes the snapshot method with an autoencoder to embed the evolving graphs. As the GNNbased methods have shown promising results on graph embedding tasks [44,45], proposed GCN architectures combined with LSTM, GRU for the dynamic graph embedding. [46] applied a self-attention mechanism for representing the dynamic graphs.

Problem definition
In this section, the rumor detection task on graph structure is described. Rumor detection aims to predict the veracity label of a message. We formulate the task as below.
Let C = {c 1 , c 2 , � � �, c m } be the set of m claims, where each claim (or a conversational thread) c i consists of n i microblog posts Propagated from the root post, responsive posts form a propagation tree G i = hV i , E i i, where each edge represents its direct responsiveness [15,16]. The vertex set V i is represented with the posts' features fx i0 ; x i1 ; � � � ; x iðn i À 1Þ g and the edge set E i represents set of directed edges from source posts (root or responsive posts) to their direct responsive posts. A i is an adjacency matrix for the directed graph G i and for posts P i . Upon representing the propagation tree as a static graph, to elaborate its evolving pattern, we define the diffusion graph with T step series of snapshots S i ¼ fS ð1Þ i ; S ð2Þ i ; � � � ; S ðTÞ i g. The detail of the snapshot formulation will be discussed at section 4.1.
Each claim c i is associated with its veracity label y i , where y i belong to one of four classes {T, F, U, N} (True rumor, False rumor, Unverified rumor, or Non-rumor) or two classes {R, N} (Rumor, Non-rumor) depending on the dataset [16,18]. The definition of rumor labels that we borrowed is the messages in which the veracity labels are unknown at the stage of the propagation and later classified by human annotators as true, false, or unknown. (non-rumor messages are thoughts or simple admiration) [2,3]. In this study, we define the task of rumor detection as a supervised graph classification problem, which the goal is to learn a mapping function f: C ! Y to classify the veracity labels of c i using S i and X i .

Dynamic GCN
In this section, we propose a dynamic graph representation learning model for rumor detection, named Dynamic GCN (DYNGCN). The main components of the model are snapshot generation, graph convolutional networks, readout layer, and attention mechanisms. The components are respectively responsible for the following functionalities: rumor propagation representation, representation learning on a graph snapshot, node embedding aggregation for global graph representation, and sequential learning from the series of graph snapshots.

Snapshot generation
To capture the evolving pattern of the rumor diffusion, we adopt the series of graph snapshots. We introduce two different ways of depicting the dynamic graphs as T step graph snapshots S = {S (1) , S (2) , � � �, S (T) }. One is with sequential snapshots, and the other is with temporal snapshots. In Fig 3, we illustrated the two different methods of snapshot generations. Here on the index i for the claim c i will be omitted. S (t) is the graph snapshot at the time step t. Each graph snapshot in S will have separate adjacency matrices A = {A (1) ,

Sequential snapshots.
Consider the ordering of the additional nodes and links of the propagation tree. Starting from S (1) , the following graph snapshots will contain d(n − 1)/Te additional links (and nodes), where n − 1 is the total number of responsive links. Eventually, each graph snapshot S (t) will contain dt × (n − 1)/Te links. The edge set for the sequential snapshot is as:

Temporal snapshots.
Consider temporal information of the propagation tree. Compared to the sequential snapshot which contains the equal counts of additional edges, temporal snapshots separate T step diffusion with the fixed time interval r. Time interval r is retrieved by dividing the time difference of the first and the last responsive posts with the time step T. The edge set for the temporal snapshot S (t) can be defined as: where τ e is the timestamp of link e, and r is the time interval of the snapshots.

Graph convolutional networks
For the snapshot representation learning, we adopt graph convolutional architecture. Upon generating the graph snapshots S = {S (1) , S (2) , � � �, S (T) } and their adjacency matrices A = {A (1) , A (2) , � � �, A (T) }, we conduct representation learning on the graph snapshots with the graph convolutional networks (GCNs) [23]. Introduced in [23], the approximated normalized graph Laplacian [47] is used for high-dimensional node representation learning. Together with an adjacency matrix A ðtÞ 2 R N ðtÞ �N ðtÞ , where N (t) is the number of nodes in the snapshot, and feature matrix X 2 R N ðtÞ �F , the learnable parameters W k 2 R d kÀ 1 �d k are trained, where k th layer produce node embeddings H k 2 R N ðtÞ �d k . The GCN model that we adopted is as: Trainable parameters W� are shared between same level of GCNs with different snapshots steps. We use 2-layer GCNs with ReLU as activation function σ. We also adopt a skip-connection-like method [48] called feature enhancement [26] to enhance the information from a certain node, in this case, the root node. The root representations in a previous GCN layer bypass the layer as:H And finally, inspired and introduced by [26,49], instead of perceiving diffusion pattern as undirected graph, we adopt bi-directional GCNs which consider both direction of graph representation separately as: The outputs H ðtÞ K , produced by the last layer K of GCNs, are node embeddings of each graph snapshot S (t) .

Readout layer
After the GCN layers embed node representation H ðtÞ K 2 R N ðtÞ �d K of each graph snapshot S (t) , the global graph pooling method is used to convert node representation to graph representation. The permutation invariant (symmetric) down-sampling method like max / mean / sumpooling, or even sophisticated pooling method like [30,31] can be used for the aggregation function in the readout layer. In this work, we empirically selected mean-pooling method for global graph pooling. The element-wise mean operation of node embeddings of the last layer K of GCN as: for the global graph snapshot embedding at t 2 {1, 2, . . ., T}; h S (t).

Attention mechanism
To apprehend the dynamic (temporal) information of graph snapshots, we use attention mechanisms. We adopt two well-known attention mechanisms: additive attention [35] and scaled dot-product attention [36]. From the graph snapshot embeddings h s ¼ fh s ð1Þ ; h s ð2Þ ; � � � ; h s ðTÞ g, the goal is to learn the attention weights and use them to aggregate the weighted inputs. Introduced in [20,35], for the additive attention, we retrieve the context vector m s by applying element-wise mean operation of embeddings of h s . The context vector m s is used as a query (Q) of the attention mechanism and h s is used for the key (K) and value (V). For the additive attention, query and key are concatenated and fed to a feed-forward neural network to produce the attention scores z. Attention weights are calculated as: Scaled dot-product attention consider the dot-product similarity of the embeddings when calculating the attention scores. We adopt self-attention which the query (Q), key (K), value (V) is all h s ¼ fh s ð1Þ ; h s ð2Þ ; � � � ; h s ðTÞ g as: The softmax result of normalized similarity measures of snapshots is applied to calculate the attention weights for the h s where d k is the dimension of h s (t).
The outputs of the two different attention layers are both the weighted sequences of the snapshot embeddings. The element-wise average of the T snapshots where attention weights are applied is used to retrieve the global graph embedding h G as:

Training & prediction
For the graph classification task, the graph embedding h G is fed to the multi-layer perception as:ŷ Theŷ 2 R jclassj is the probabilities of veracity labels where class = {T, F, U, N} or class = {R, N}.
Our supervised graph classification model is trained with the cross-entropy loss of the predictions and ground truth labels. The loss function of our model is defend as: where y i is the ground truth label for the claim c i .

Experiments
In this section, we perform experiments on three real-world datasets and compare the performance of the proposed model, Dynamic GCN, with other rumor detection baselines. Furthermore, we conduct ablation studies and analyze the results on different snapshot counts and variants of the sequential learning methods.

Datasets
We evaluate the proposed model with three publicly available rumor detection datasets: Twit-ter15 [13], Twitter16 [16], and Weibo [18]. These datasets contain rumor propagation trees, where nodes are posts and links are responsive relations such as replies or retweets, with one of the four ground truth veracity labels (True rumor, False rumor, Unverified rumor, Non-rumor) for Twitter15 and Twitter16 and two classes (Rumor, Non-rumor) for Weibo dataset. The detailed statistics of the datasets are provided in Table 1. We used the bag-of-words (BoW) features by selecting the top 5,000 vocabularies for the corpus by TF-IDF; thus, each post initially contains 5,000 features.

Baselines
We compare our Dynamic GCN model with the following rumor detection baseline models: • DTC [7]: A decision tree-based classifier with handcrafted features to identify the credibility of microblog posts related to trending topics.
• RFC [11]: A random forest based-ranking method that elaborates the inquiry phrases of posts.
• SVM-TS [12]: An SVM model that captures the temporal characteristics of social context features of posts.
• SVM-TK [16]: An SVM model with a tree kernel that captures higher-order patterns of propagation structures of rumors.
• GRU [18]: An RNN-based model that learns contextual information from continuous representations of relevant posts over time.
• RvNN [21]: A recursive neural network-based model which captures the structural patterns of a top-down and bottom-up rumor propagation trees.
• Bi-GCN [26]: A graph convolutional network-based model, which captures propagation patterns with message passing architecture.
• DYNGCN (Proposed): A graph convolutional network-based model with attention mechanisms to capture temporal dynamics of graph snapshots.
We haven't included the Propagation Path Classification (PPC) model [20] and Global-Local Attention Network (GLAN) model [27] as our baselines since both methods include crawled user profiles as additional input features (such as whether the user is suspended or verified), which could be too biased at the time of current work. A few years had passed since the initial collection of the datasets, the results could be distorted and might be too much depended on when the user profiles were crawled. Instead, we compare our model with the state-of-the-art model [26], which considers the posts relations without additional crawled user profiles.

Experimental setup
We conducted 10 runs of 5-fold cross-validation and reported the average accuracies and F1 scores by each label. For the fair comparison, for the models with early stopping method [50] such as Bi-GCN and ours, we randomly splitted 4-fold of training set into 80% training set and 20% validation set, which eventually making 16:4:5 splits for train, validation, and test sets. The validation set was used for early stopping with patience of 10 epochs. The model has 256 hidden dimensions for a single graph snapshot, including root feature enhancement and bi-directional representation. We set 2-layer GCNs and used rectified linear units for the non-linearity. We adopt the dropout [51] rate of 0.5 for GCN layers and Dro-pEdge [52], graph data augmentation method, rate with 0.2. We train our model with Adam optimization algorithm [53] with the initial learning rate 5E-4 and a maximum of 200 epochs if not early stopped.
Our model is implemented in PyTorch [54] with PyTorch Geometric [55] for the message passing framework. For the baseline models, we conduct experiments with the authors' provided codes with the same hyperparameters that were reported, respectively. For the fair comparison, we directly cited ( � ) some of the metrics already reported in original papers [16,18] with equivalent experimental settings due to some handcrafted features that are unavailable at the time of the reproduction. Tables 2 and 3 RvNN, BiGCN, DYNGCN). However, SVM-TS and SVM-TK show superior results within the traditional handcrafted methods since these models are able to utilize temporal features. It is constructive to consider temporal information of rumor for rumor detection.

Performance evaluations
Finally, among the propagation-based baselines, a graph-based models, DYNGCN and Bi-GCN, outperforms other baselines such as RvNN or GRU since graph convolutional network can better capture the structural representation of rumor diffusion.

Ablation study
In order to see the performance of our model in different settings, we report the following ablation studies. The performance results with different snapshot counts for sequential and temporal snapshots, with different learning algorithms for combining snapshot sequences, and attention weights of additive attention and dot-product attention. Fig 4 is the result of DYNGCN with the snapshot counts of 1, 2, 3, 4, and 5 with dot-product attention. Although there aren't significant correlations in the aspect of accuracy with the counts, adopting multiple snapshots shows better performance compare to a single static snapshot in both sequential and temporal snapshots. However, we observed that simply applying larger snapshot counts won't produce a performance improvement and believe this can be a hyperparameter for the dataset. 5.5.2 Different learning methods for the sequence. The attention layer of our model can be replaced with other Seq2Seq [34] models since the inputs to the attention layer are a sequence of snapshot representations. Fig 5 is the result of different sequence learning methods (Bi-LSTM, Bi-GRU, additive attention, and dot-product attention (self-attention)) with the snapshots count of 3. Attention mechanisms that are used for a weighted sum of sequential    significantly considers the snapshots in the end-stage. This can be interpreted as that the additive attention reply on the context query to understand the global or overall propagation while dot-product attention relies on the input sequence to jointly understand the overall pattern. Although the weight itself depends on the dataset, we could see that each attention mechanism represents the propagation structure in its own way.

Conclusion
In this research, we propose Dynamic GCN, an end-to-end GCN-based model with attention mechanisms, for rumor detection. The model is able to capture the dynamics of rumor propagations using sequential snapshots and temporal snapshots. We empirically evaluate our model with three real-world datasets and compare the performance of the rumor detection (veracity classification) task with other rumor detection baselines. The results show that our model outperforms other state-of-the-art methods. The ablation studies report performance differences with snapshots counts, learning sequence variants, and the weights for the different attention mechanisms. We believe there is still room for improvement in the context of GCNs variants, global graph pooling, and additional features from different contexts.