The geography of corporate fake news

Although a rich academic literature examines the use of fake news by foreign actors for political manipulation, there is limited research on potential foreign intervention in capital markets. To address this gap, we construct a comprehensive database of (negative) fake news regarding U.S. firms by scraping prominent fact-checking sites. We identify the accounts that spread the news on Twitter (now X) and use machine-learning techniques to infer the geographic locations of these fake news spreaders. Our analysis reveals that corporate fake news is more likely than corporate non-fake news to be spread by foreign accounts. At the country level, corporate fake news is more likely to originate from African and Middle Eastern countries and tends to increase during periods of high geopolitical tension. At the firm level, firms operating in uncertain information environments and strategic industries are more likely to be targeted by foreign accounts. Overall, our findings provide initial evidence of foreign-originating misinformation in capital markets and thus have important policy implications.

outline the data collection, data pre-processing, and model training step by step.

Step 1: Collecting training data
We collect a global sample of geo-tagged tweets because the geo-tagged locations (based on GPS coordinates of mobile devices) are reliable and difficult to manipulate.We use locations in geo-tagged tweets as ground-truth data and use them to train our model.We obtain the training data from two sources: (i) the authors of [1], who collect a global sample of 5,053,103 geo-tagged tweets in real time in 2014 and 2015, and (ii) Brandwatch, from which we randomly collect tweets over the sample period 2014-2019.The authors of [1] collect the geo-tagged tweets dataset for a location-prediction task and share only the tweet IDs, as Twitter does not allow the sharing of individual tweets (https://figshare.com/articles/dataset/Tweet_geolocation_5m/3168529).We reconstruct the dataset with tweet IDs using Twitter Academic API, from which we were able to retrieve 2,970,736 tweets (58.8% of the global sample).The remaining tweets are no longer available, either because users deleted these messages, or the user accounts were suspended.To extend the sample period of training data, we randomly collect a global sample of 1,275,110 geotagged tweets from Brandwatch, a Twitter data partner that offers access to historical tweets.
In doing so, we extend our sample through 2019.We keep only the native Twitter posts (where the geographical data are based on the GPS location of the device) and remove tweets from Instagram crossposts (where the user can choose a place to attach to the tweet in Instagram, which may lead to incorrect locations [2]).After we remove the Instagram tweets, the Brandwatch sample results in a total of 956,827 tweets.In the end, our training data consist of a global sample of 3,927,563 geotagged tweets over the period 2014-2019.We cannot extend the training data to post-2019 as Twitter removed the ability to tag precise locations in 2019 (https://twitter.com/TwitterSupport/status/1141039841993355264).
We pose the prediction task as a multi-label classification of geolocations.Our training data have 3,315 city locations with geocoordinates obtained from the code repository of [3] (https://github.com/Erechtheus/geolocation).The geotagged tweets are mapped to 3,315 cities using the Haversine distance.We keep 2,187 cities that are geotagged by a minimum of 100 tweets to minimize generalization error for rarely mentioned cities.The final training data include 149 countries and 2,187 cities.
Step 2: Pre-processing data We use the following Twitter features to predict users' geolocationstweet text, tweet language, user-declared location, user description, and user name.These are the commonly used features in the prior literature [3,4].Some other features used in the literature (e.g., time zone, UTC, user language) were not accessible via Twitter API at the time of our study.We also do not use URL links and tweet source, as they do not meaningfully contribute to the location prediction [3].
We Tokenization is the process of breaking up text up into individual tokens.Tweets, however, are more difficult to tokenize than formal text, and tokenization might be challenging for some languages.While some tokenizers may be suitable for most languages (e.g., English and French), they cannot readily be applied to other languages without clear word boundaries (e.g., Japanese).To overcome this challenge, we use BertTokenizerFast (https://huggingface.co/bert-base-multilingual-uncased), which is common in the training of multilingual BERT models.As our Twitter corpus is multilingual and very different from the pre-trained language models, we train the data from scratch using BertTokenizerFast.
BertTokenizerFast is a transformer model based on a tokenization algorithm that splits words into meaningful subwords.The algorithm thus can capture semantic meaning in agglutinative languages such as Turkish, where complex words can be formed by combining several subwords.Upon training, BertTokenizerFast converts text input to a machine-readable numerical format based on a dictionary of the most frequently occurring 100,000 subwords.
We use this self-trained BertTokenizerFast algorithm to tokenize Twitter features. Step

3: Partitioning the sample
To avoid overfitting, we partition our sample into three subsets: training set, validation set, and test set.As is standard, we set aside 20% of the sample for validation and test sets and the remaining 80% for the training set.We use the training set to fit our model, the validation set to estimate the hyperparameters, and the test set to evaluate the predictive ability of the model.To choose the most suitable model with the highest "out-of-sample" accuracy, we compare the performance across validation and test samples.We rely on stratified sampling, as the number of users in different countries can be imbalanced and a random sampling strategy could bias the sampling toward bigger countries.We start with a relatively large training dataset (around 3.9 million tweets), which the prior literature deems sufficient for a high-quality model [5].
Step 4: Model Architecture Next, we train an LSTM model based on the network architecture in [3].In natural language processing, recurrent neural networks (RNNs) represent temporal sequences better than vanilla neural networks or fully connected layers do.LSTM networks are a specific type of RNN with memory cells enabling them to retain longer-range dependencies than conventional RNNs.We rely on the model in [3] to build a set of prediction models and [Insert Supplementary Appendix The model inputs tokenized text and represents it as word embeddings (i.e., vector representation with a lower number of dimensions).This provides advantages over one-hot encoding, which creates a sparse vector with a lack of context.Learned embeddings group words with similar locational semantics and thus improve efficiency.To avoid overfitting (and to penalize large coefficients), we use dropout layers [6] and randomly disable a proportion of neuron connections (or ignore some layer outputs).Batch normalization is used to reduce internal covariate shift and improve convergence speed during training [7].The tanh activation function is used after the LSTM layer, and the softmax activation function is used to produce output logits (or probability scores for each class) after the final fully connected layer.We tune the hyperparameters of the model (i.e., learning rate, embedding dimension, and the number of LSTM layers) based on evaluation metrics over the validation set.The performance of each model is assessed on the validation set to identify the parameters that yield the highest prediction accuracy.
Step 5: Tuning Model Parameters The model is trained on a single NVIDIA GeForce RTX 3090 GPU, with a batch size of 64.We choose the Adam optimization algorithm as an extension of stochastic gradient descent, which results in a stable and faster convergence [8].The dataset is trained for five epochs, which corresponds to one pass through the whole training dataset.Since we have a large training set with about 3.9 million entries, the training losses and accuracies saturate within five epochs.The learning rate is an important parameter, as it determines the step size of the iterative improvement in stochastic gradient descent.We perform a hyperparameter search on the learning rates at 1e-2, 1e-3, 1e-4, and we report the results in Panel A of Supplementary Appendix Table 1.
[Insert Supplementary Appendix Table 1 about here] We also vary the parameters (as shown in Panel B) to choose the set of optimal hyperparameters maximizing the predictive ability in the validation set.The optimized model in [3] uses word-embedding dimensions of 100 and one LSTM layer.As we use a different training set, we implement two modifications: (i) the word-embedding dimension is increased from the original 100 to 250, and (ii) the number of LSTM layers is increased from one to three.We perform these experiments to determine the optimal set of parameters in our dataset.
When evaluating the models, we consider two performance metrics.We first use Accuracy, defined as the ratio of correct location predictions, as follows: where c(t) and c * (t) represent the predicted and ground-truth locations for tweet t.Second, we calculate the mean error distance (Mean-ED) as the Haversine distance between the predicted and ground-truth location.The model with the lower Mean-ED and higher Accuracy is better at predicting the geolocation of tweets.We present the results in Supplementary Appendix We choose the parameters with the lowest Mean-ED and highest Accuracy.
Accordingly, we use an initial learning rate of 1e-3, a word-embedding dimension of 100, and two LSTM layers in the feed-forward model.As learning progresses, the parameters in the LSTM are adjusted via backpropagation so the network continually improves its ability to predict a tweet's geolocation.We connect the LSTM layer with a dense layer for classification using a softmax activation function.Our final model estimates 15,090,237 parameters.

Supplementary Appendix 2 Conversion of fact-checking categories to fake and non-fake news
Each fact-checking organization has its own news classification scheme.We map these categories to fake and non-fake news as follows.
d) Factcheck: There is no classification scheme.We manually read articles and determine the claim's veracity.The results are robust to the exclusion of articles obtained from this site.

Fact-checking sites
Non-fake News Negative-sentiment news about a firm verified by a factchecking organization.

News Content
Data Privacy An indicator variable equal to one if the news is related to data privacy (e.g., data breach, hacking, etc.), and zero otherwise.

Founder/Executive Management
An indicator variable equal to one if the news is related to the firm's founders or executive management, and zero otherwise.

Product
An indicator variable equal to one if the news is related to the firm's product (e.g., ingredients, spoilage, branding, etc.), and zero otherwise.

Politics
An indicator variable equal to one if the news is related to politics (e.g., gun control, refugees, privatization, etc.), and zero otherwise.

Religion
An indicator variable equal to one if the news is related to religion (e.g., religious belief, symbols, etc.), and zero otherwise.

Operations
An indicator variable equal to one if the news is related to the firm's operations (e.g., investment, advertising, distribution, etc.), and zero otherwise.

Fact-checking sites
Other An indicator variable equal to one if the news is related to corporate social responsibility, legal, or financial issues, and zero otherwise.

Twitter Data
Foreign Fake News (%) Percentage of original tweets spreading corporate fake news initiated by a foreign (non-U.S.) Twitter account.

Industry Leader
An indicator variable equal to one if the firm is the largest member of its industry in terms of revenue, and zero otherwise.

Market Structure
TNIC HHI Sum of the squares of the sales (SALE) of firms in the same industry, using the timevarying Text-based Network Industry Classification (TNIC) developed by Hoberg and Phillips [12].

Product Similarity
A firm-year-level measure of product similarity based on the product descriptions from 10-K filings calculated as the sum of pairwise product similarities between a given firm and all other firms in a given year [12].

Interstate Conflict Risk
Index measuring interstate conflict risk (based on the Goldstein scale) using daily reported events in the global news media.
Global Database on Event, Location, and Tone (GDELT) Geopolitical Risk Index Index based on the share of articles mentioning adverse geopolitical events in leading newspapers in the U.S.
estimate the model parameters in the training set.Supplementary Appendix Fig 1 illustrates the model architecture.

Fig 1
Fig 1 about here]