TweepFake: About detecting deepfake tweets

The recent advances in language modeling significantly improved the generative capabilities of deep neural models: in 2019 OpenAI released GPT-2, a pre-trained language model that can autonomously generate coherent, non-trivial and human-like text samples. Since then, ever more powerful text generative models have been developed. Adversaries can exploit these tremendous generative capabilities to enhance social bots that will have the ability to write plausible deepfake messages, hoping to contaminate public debate. To prevent this, it is crucial to develop deepfake social media messages detection systems. However, to the best of our knowledge no one has ever addressed the detection of machine-generated texts on social networks like Twitter or Facebook. With the aim of helping the research in this detection field, we collected the first dataset of real deepfake tweets, TweepFake. It is real in the sense that each deepfake tweet was actually posted on Twitter. We collected tweets from a total of 23 bots, imitating 17 human accounts. The bots are based on various generation techniques, i.e., Markov Chains, RNN, RNN+Markov, LSTM, GPT-2. We also randomly selected tweets from the humans imitated by the bots to have an overall balanced dataset of 25,572 tweets (half human and half bots generated). The dataset is publicly available on Kaggle. Lastly, we evaluated 13 deepfake text detection methods (based on various state-of-the-art approaches) to both demonstrate the challenges that Tweepfake poses and create a solid baseline of detection techniques. We hope that TweepFake can offer the opportunity to tackle the deepfake detection on social media messages as well.


Introduction
During the last decade, the social media platforms-developed to connect people and make them share their ideas and opinions through multimedia contents (like images, video, audio, and texts)-have also been used to manipulate and alter the public opinion thanks to bots, i.e., computer programs that control a fake social media account as a legitimate human user would do: by "liking", sharing and posting old or new media which could be real, forged through simple techniques (e.g., editing of a video, use of gap-filling texts and search-and-replace methods) or deepfake. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 challenges that TweepFake poses and providing a solid baseline of detection techniques, we also evaluated 13 different deepfake text detection methods: some of them exploiting text representations as inputs to machine-learning classifiers, others based on deep learning networks, and others relying on the fine-tuning of transformer-based classifiers. The code used in the experiments is publicly available on GitHub [18].

Related work
Deepfake technologies have first risen in the computer vision field [19][20][21][22], followed by effective attempts on audio manipulation [23,24] and text generation [3]. Deepfakes in computer vision usually deal with face manipulation-such as entire face synthesis, identity swap, attribute manipulation, and expression swap [6]-and body re-enacting [22]. Recently, audio deepfakes involved the generation of speech audio from a text corpus by using the voice of different speakers after five seconds of listening time [24]. In 2017, the development of the selfattention mechanism and the [25]'s transformer led to the improvement of the language models. Language modeling refers to the use of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. The subsequent transformer-based language models (GPT [26], BERT [27], GPT2 [3] etc.) did not only enhance the natural language understanding tasks, but language generation as well. In 2019, [3] developed GPT-2, a pre-trained language model that can autonomously generate coherent human-like paragraphs of text by having in input just a short sentence; in the same year, [28] contributed to text generation with GROVER, a new approach for efficient and effective learning and generation of multi-field documents such as journal articles. Soon after [29], released CTRL, a conditional language model that uses control codes to generate text having a specific style, content, and task-specific behavior. Last but not least [30], presented OPTIMUS, putting the variational autoencoder in the game for text generation.
Currently, the approaches to automatic deepfake text detection roughly fall into three categories, here listed in order of complexity: Simple classifier: a machine-learning or deep-learning binary classifier trained from scratch.
Zero-shot detection: using the output from a pre-trained language model as features for the subsequent classifier. This classifier may be a machine-learning based one or a simple feed forward neural network.
Fine-tuning based detection: jointly fine-tuning a pre-trained language model with a final simple neural network (consisting of one or two layers).
The GPT-2's research group made an in-house detection research [31] on GPT-2 generated text samples: first, they evaluated a standard machine-learning approach that trains a logistic regression discriminator on tf-idf unigram and bigram features. Then, they tested a simple zero-shot baseline using a threshold on the total probability: a text excerpt is predicted as machine-generated if its likelihood according to GPT-2 is closer to the mean likelihood over all machine-generated texts than to the mean of the human-written ones.
[4] provided GLTR (Giant Language model Test Room), a visual tool that helps humans to detect deepfake texts. Generated text is sampled word by word from a next token distribution (several sampling techniques can be used [32], the simplest way is to take the most probable token): this distribution usually differs from the one that humans subconsciously use when they write or speak. GLTR tries to show these statistical language differences to aid people in discriminating human-written text samples from machine-generated ones.
GROVER's authors [28] followed the fine-tuning based detection approach by using BERT, GPT2 and GROVER itself as the pre-trained language model. GROVER was the best, suggesting that maybe the best defense against the transformer-based text generators is a detector based on the same kind of architecture. However, OpenAI [31] proved it wrong on GPT-2 generated texts: they showed that fine-tuning a RoBERTa-based detector achieved consistently higher accuracy than fine-tuning a GPT-2 based detector with equivalent capacity. [11] developed an energy-based deepfake text detector: unlike auto-regressive language models (e.g. GPT-2 [3], XLNET [33]), which are defined in terms of a sequence of conditional distributions, an energy-based model is defined in terms of single scalar energy function, representing the joint compatibility between all input variables. Thus, the deepfake discriminator is an energy function that scores the joint compatibility of an input sequence of tokens given some context (e.g. a text sample, some keywords, a bag of words, a title) and a set of network parameters. These authors tried also to generalize the experimental setting, where the generator architectures and the text corpora are different between training and test time.
The only research on the detection of deepfake social media texts was conducted by [5] on Amazon reviews written by GPT-2. They evaluated several human-machine discriminators: the Grover-based detector, GLTR, RoBERTa-based detector from OpenAI and a simple ensemble that fused these detectors using logistic regression at the score level.
The above deepfake text detection methods have got two flaws: except for [5]'s research, they dealt with generated news articles, having a longer length with respect to social media messages; then, just a single known adversarial generative model is usually used to generate the deepfake text samples (usually GPT-2 or GROVER). In a real-setting scenario, we don't know how many and what generative architectures are used. Our Tweepfake dataset provides a set of tweets produced by several generative models, hoping to help the research community in detecting shorter deepfake texts written by heterogeneous generative techniques.

DeepFake tweets generation
There exist several methods to generate a text. What follows is a short description of the generative methods used to produce the machine-generated tweets contained in our dataset.

Generation techniques
First and foremost, the training set of text corpora is tokenized (punctuation included), and then one of the following methods can be applied. Notice that the following techniques write a text token-by-token (a token could be a word, a char, a byte pair, a Unicode code point) until a stop token is encountered or a pre-defined maximum length is reached. RNN, LSTM, GPT2 are language models. Therefore, at each token generation, they always produce a multinomial distribution-in which a category is a token of the vocabulary derived from a set of humanwritten texts-from which the next token is sampled with a specific sampling technique (e.g., max probability, top-k, nucleus sampling [34]). A special start token is given in input to the generative model to prime the text generation; with language models, a short sentence can work as a priming text as well: each token of the start sentence is processed without computing the multinomial distribution, just to condition the generative model.

Markov
Chains is a stochastic model that describes a sequence of states by moving from a state to another with a probability which depends on the current state only. For the text generation a state is identified as a token: the next token/state is randomly selected from a list of tokens following the current one. The probability of a token t to be chosen is proportional to the frequency of the appearance of t after the current token.
RNN, helped by its loop structure, stores in its accumulated memory the information on the previously encountered tokens and computes the multinomial distribution from which the next token is chosen. The selected token is given back in input so that the RNN can produce the following one.
RNN+Markov method may employ the Markov Chain's next token selection as a sampling technique. In practice, the next token is randomly sampled from the multinomial distribution produced by RNN, with the tokens having the highest probability value being the most likely to be chosen. However, no reference was found to confirm our hypothesis on RNN+Markov mechanism.
LSTM generates text as RNN does. However, it is smarter than the latter because of its more complicated structure: it can learn to selectively keep track of only the relevant information of the already seen piece of text while also minimizing the vanishing gradient problem that affects a RNN. LSTM's memory is "longer" than RNN's.
GPT-2 is a generative pre-trained transformer language model relying on the Attention mechanism of [25]: by employing the Attention, a language model pre-trained on millions of sentences/texts learns how each token/word relates to every other in every possible context. This is the trick to generate more coherent and non-trivial paragraphs of text. Anyhow, being a language model, GPT-2's text generation steps are the same as RNN and LSTM: generation of a multinomial distribution at each step and then selection of the next token from it by using a specific sampling technique.
CharRNN employs RNN at char level to generate a text char-by-char.

The TweepFake dataset
In this section we describe the process of building the novel TweepFake-A Twitter Deep Fake Dataset together with the results of the experimentation on the deepfake detection task. Twitter accounts have been searched heuristically on the web, GitHub and Twitter looking for keywords related to automatic or AI text generation, deepfake text/tweets, or to specific technologies as well as GPT-2, RNN, etc. in order to collect a sample of Twitter profiles as huge as possible. We selected only accounts referring to automated generated text technologies in Twitter descriptions, profile URLs, or related GitHub. From this sample, we selected a subset of accounts mimicking (often fine-tuned on) human Twitter profiles. Thus, we obtained 23 bots and 17 human accounts because some fake accounts imitate the same human profile (see Table 1). Then we downloaded timelines of both deep fake accounts and their corresponding humans via Twitter REST API. In order to get a data set balanced on both categories (human and bots) we randomly sampled tweets for each accounts' couple (human and bot/s) based on the less productive. For example, after the download, we had 3,193 tweets by human#11 and 1,245 by the corresponding bot#12, thus we random sampled 1,245 tweets by the human account timeline to get the same amount of data. In total, we had 25,572 tweets half human and half bots generated. In Table 1, we report, for each fake account we considered, the human account imitated, the technology used for generating the tweets, and the number of tweets we collected from both the fake and the human account. In Table 2, we grouped the fake accounts by technology reporting, together with the number of collected tweets, the citation of the information we found about the bot (i.e., more technical information, code, news, etc.). Please note that in our detection experiments we grouped the technologies in three main groups: GPT-2, RNN, others (see Sections Results and Discussion).

Detection techniques
To verify the difficulty level in the detection task of automatically generated natural language contents, we used the built dataset to measure the effectiveness of a set of ML and DL methods of increasing complexity. The results obtained allow us to fix some baseline configurations in terms of performance and give an idea on which approaches are most promising in solving this specific problem.
In Table 3, we report all the methods that have been tested in this work. We explored the usage of four main approaches to model the solutions to this specific task. The first scenario uses a text representation based on bag-of-words (BoW) [53] with encoded feature weighted according to TF-IDF function [53]. The tweets encoded in this way have been next processed by a statistical ML algorithm able to produce a suitable classifier to solve the specific problem. In this work, we have chosen to implement three popular classifiers: logistic regression, random forest, and SVM.
The approach based on BoW+TF-IDF, although being very popular and used for many years as the primary methodology to vectorize texts, suffers from two main drawbacks. The first problem is related to the curse of dimensionality [59], i.e., the feature space is very sparse, and the amount of data required to produce statistically significant models is very high. The second issue of BoW is that it ignores the information about word order, and thus it misses completely any information about the semantic context on which the words occur. To

PLOS ONE
overcome these limitations, on the second approach, we encoded texts using BERT [27], a recent pre-trained language model that contributed to improving state-of-the-art results on many NLP problems. BERT provides contextual embeddings, fixed-size vector representations of words which depend not only by the words itself but also by the context on which the words occur: for example, the word bank, depending on being near to word economy or river, will assume a different meaning and consequently a different contextual vector. Therefore, these contextual representations can be merged together to obtain a contextualized fixed-size vector of a specific text (e.g., by averaging the vectors of all words composing a specific text). As in the previous scenario, the tweets encoded through BERT has been processed using the same set of classifiers.
On the third approach, we leverage another effective way to encode textual contents by working at the character level instead of words or tokens [60]. This methodology has the advantage of not requiring access to any external resource, but it only exploits the dataset used to learn the model. The encoding process is summarized in Fig 1. Each tweet is encoded as a set of contiguous characters IDs obtained from a fixed vocabulary of characters. This mapping allows us to use the internal embeddings matrix (learned during training phase) to select, at each time step in the text, only the row vector corresponding to the current analyzed character, thus contributing to building a proper matrix representation of current text. The resulting text's embedding matrix is next passed as input to the successive layers in the tested deep learning networks.
As the final and most effective approach, we used several pre-trained language models by fine-tuning them directly on the built dataset. This process just consists of taking a trained language model, integrate its original architecture with a final dense classification layer, and perform training on a specific dataset (typically small) for very few epochs [27]. This step of fine-tuning allows us to customize and adapt the native language model's network weights to the considered use case, maximizing the encoding power of the model to solve that specific classification problem. As reported in Table 3, in this work, we tested four different language models, all based on transformer architecture [25], which have provided state of the art results on many text processing benchmarks. BERT [27] was presented in 2018, and thanks to the innovative transformer-based architecture with dual prediction tasks (Masked Language Model and Next Sentence Prediction) and much data, it was able to basically outperform all other methods on many text processing benchmarks. XLNet [33] and RoBERTa [61] tried to increase BERT effectiveness by slightly varying and optimizing its original architecture, and using a lot more data on training step, resulting in improvements on prediction powers on the same benchmarks up to 15%. DistilBERT [61], on the other hand, tried to keep the performance of the original BERT model (97% of original ones) but greatly simplifying the network architecture and halving the number of parameters to be learned.

Experimental setup
The main parameters of each algorithm (except for those based on deep learning models where, for computational reasons, we used default parameters) have been optimized using the validation set.
Baselines built on standard machine learning methods (with both BoW and BERT representations) have been implemented using scikit-learn Python library [62]. In BoW experiments, we performed tweets tokenization by splitting texts into words, removing all hashtags, replacing all user mentions with the token __user_mention__, replacing all URLs with token __url__, and leaving all found emoticons as separated tokens. During the encoding phase, to minimize computational cost, we only left the most frequent 25,000 tokens, and we weighted each token inside tweets according to tf-idf method [53]. In BERT experiments we encoded tweets using bert-base-cased pre-trained model from transformers Python library [63]. In SVC configurations we tried different kernels (linear and rbf), and a range of values for C and gamma parameters. The C misclassification cost has also been optimized on logistic regression configurations. On random forest baselines we have chosen the best setting varying these parameters: max_depth, min_samples_leaf, min_ samples_split, and n_estimators.
Solutions based on characters deep learning networks have been implemented using Keras Python library [64]. We used a fixed window of length 280 (on Twitter, 280 is the maximum length of a tweet in terms of the number of characters) to represent input tweets and tanh activation function at every level of hidden layers. In all three configurations of chars neural networks, the first hidden layer is an embedding layer of size 32. At the second level, CHAR_CNN is characterized by three independent CNN subnetworks (CNN layer composed by 128 filters and followed by a global max pooling layer) with different kernel sizes (3,4, and 5) which are next concatenated and filtered by a dropout layer before performing final classification. CHAR_GRU configuration is more simple, composed at the second level by a bidirectional GRU layer followed by dropout and a final classification layer. CHAR_CNNGRU configuration adds to the first hidden layer two different subnetworks (one CNN-based and one GRU-based with the same architecture as defined before), concatenates them, applies a dropout, and performs final classification.
We used simpletransformers Python library [65] to implement all models in finetuned configurations. In agreement with other works in literature and for computational reasons, we decided to limit the number of epochs to just three complete iterations over training data.
A summary of the customized parameter values used in the final configurations is reported in S1 Table All the other unspecified parameters used by tested algorithms are left to their  default values, as defined in the software libraries providing their implementation. Our experiments are reported in a GitHub repository [18].

Results
As evaluation measures, we used the canonical adopted metrics in text classification contexts: precision, recall, F1, and accuracy [53]. In this context, given that the analyzed dataset is balanced in terms of examples, the accuracy seems the most reasonable measure to capture the effectiveness of a method. On Table 4, we report the results obtained on test set using the proposed detection baselines.
To have a better understanding on how the tested baselines behave at detection time, we split all available accounts on the dataset into four different categories: human The set of Twitter accounts having contents produced only by a human.
gpt2 The set of Twitter accounts having contents produced only by GPT2-based generative algorithms.
rnn The set of Twitter accounts having contents produced only by RNN-based generative algorithms.
others The set of Twitter accounts having contents produced only by generative algorithms using mixed (e.g., RNN + Hidden Markov models) or unknown approaches.
Each account has been assigned to one of those categories according to the specific information found in the corresponding Twitter account's description or in a linked Web page describing its purpose. For some accounts, we were not able to find any information provided by the author about the technology used to implement the BOT, so in that case we assigned the account to others category. On Fig 2 we show a qualitative evaluation of the accuracy of the proposed baselines in relation to the category of accounts and the "global" performance over all categories.
To obtain a fair comparison between human and the other types of categories, giving that the human class has more examples than the other categories alone, we performed a random undersampling of humans to match the maximum size in terms of examples given by one of the other three categories. The resulting distribution in terms of examples has been the following: humans (484), GPT-2 (384), RNN (412), and "others" (484).

Discussion
Globally, in terms of accuracy, we can observe ( Table 4) that the methods based on BoW representations have the worst performance (around 0.80), followed by those using both BERT (around 0.83) and character encodings (up to 0.85), and remarkably outperformed by methods using native language modeling encoding (0.90 for ROBERTA_FT). A high level view of the results thus indicates that the most complex text representations, especially those based on big amount of external data, provide evident advantages in terms of effectiveness. An interesting exception is the character encoding in deep learning methods which is simple, able to generally provides good performance, and be useful in cases where no pretrained models are available (e.g., for non-english languages).
Going more on details, the baselines based on fine tuning (except for XLNET) show very well balanced performance in terms of precision and recall, differently from the other configurations where one the two measures is a lot higher than the other. Another observation is that The qualitative analysis of the accuracy in relation to the accounts' categories highlights some interesting facts (Fig 2): a) all methods (except RAND_FOREST_BOW) perform extremely well in identifying tweets as BOT on both on RNN and others accounts; b) tweets from human accounts are easily identifiable by methods based on fine tuning but not from the others; and c) all methods have difficulties in identifying correctly tweets produced by GPT-2 accounts. In particular, on this last point it is interesting to note that all complex fine tuned LM methods perform remarkably worst than some character based methods like CHAR_GRU. This could indicate that RNN networks maintain slight advantages in temporal representations for short contexts respect to newer transformer networks, an important aspect to be investigated in the future.
To sum up, these findings suggest that a wide variety of detectors (text representationbased using machine-learning or deep-learning methods and transformer-based using transfer-learning) have greater difficulties in detecting correctly a deepfake tweet rather than a human-written one; this is especially true for GPT-2 generated tweets, insinuating that the newest and more sophisticated generative methods based on the transformer architecture (here GPT-2) can produce more human-like short texts than old generative methods like RNNs. We manually verified several GPT-2 and RNN tweets: the former were harder to label as bot-generated. In any case, a future work could deeply investigate the humanness of tweets coming from several generative methods by questioning people.

Conclusion
Deepfake text detection is increasingly critical due to the development of highly sophisticated generative methods like GPT-2. However, to the best of our knowledge no deepfake detection has been conducted over social media texts yet. Therefore, the aim of this paper was to present the first real deepfake tweets dataset (TweepFake) to help the research community to develop techniques fighting the deepfake threat on social media. The proposed real deepfake tweets are publicly available on the well-known Kaggle platform. The dataset is composed of 25,572 tweets, half human and half bots generated, posted on Twitter in the last few months. We collected them from 23 bots and from the 17 human accounts they imitate. The bot accounts are based on various generative techniques, including GPT-2, RNN, LSTM and Markov Chain. We tested the difficulty in discriminating human-written tweets from machine-generated ones by evaluating 13 detectors: some of them exploiting text representations as inputs to machinelearning classifiers, others based on deep learning networks, and others relying on the finetuning of transformer-based classifiers.
Our detection results suggest that the newest and more sophisticated generative methods based on the transformer architecture (e.g., GPT-2) can produce high-quality short texts, difficult to unmask also for expert human annotators. This finding is in line with the ones in [3] covering long texts (news, articles, etc.). Additionally, the transformer-based language models provide very good word representations for both text representation-based and fine-tuning based detection techniques. The latter provide a better accuracy (nearly 90% for RoBERTabased detector) than the former.
We recommend to further investigate the RNN-based detectors, as the CHAR_GRU-based detector was the best at correctly labelling GPT2-tweets as bots. Moreover, a study of the capability of humans to discriminate human-written tweets from machine-generated ones is necessary; also, the humanness of tweets produced by different generative methods could be assessed. Of course, different detection techniques are appreciated.
Supporting information S1 Table.