Public discourse and sentiment during the COVID 19 pandemic: Using Latent Dirichlet Allocation for topic modeling on Twitter

The study aims to understand Twitter users’ discourse and psychological reactions to COVID-19. We use machine learning techniques to analyze about 1.9 million Tweets (written in English) related to coronavirus collected from January 23 to March 7, 2020. A total of salient 11 topics are identified and then categorized into ten themes, including “updates about confirmed cases,” “COVID-19 related death,” “cases outside China (worldwide),” “COVID-19 outbreak in South Korea,” “early signs of the outbreak in New York,” “Diamond Princess cruise,” “economic impact,” “Preventive measures,” “authorities,” and “supply chain.” Results do not reveal treatments and symptoms related messages as prevalent topics on Twitter. Sentiment analysis shows that fear for the unknown nature of the coronavirus is dominant in all topics. Implications and limitations of the study are also discussed.


Introduction
WHO declares COVID-19 as a global health pandemic. Social media has played a crucial role before the virus outbreak and continues to do so as it spreads globally. After China took strict quarantine measures as an intervention (e.g., cities on locked down, school closure, and employed self-isolation), Chinese social media platforms (e.g., Weibo, WeChat, Toutiao) become the lifeline for almost all isolated people who have been housebound for 30+ days and relying on these channels to obtain information, exchange opinions, socialize, and order food [1]. Existing studies [2][3][4][5] show that Twitter data can provide useful information for epidemic disease (e.g., H1N1, Ebola), including tracking rapidly evolving public sentiments, measuring public interests and concerns, estimating real-time disease activity and trends, and tracking reported disease levels. However, these studies have limitations, with only qualitatively manual coding a very small number of Tweets. They require more advanced techniques to improve accuracy and precision for examining public opinions and sentiments. In addition, it remains unknown about public reactions to the COVID online. The vast majority of searched articles about COVID-19 and 2019-nCoV focus on epidemic control, such as the transmissibility of the virus [6], clinical characteristics of the infected cases [7], and patient screening [8].
The present study uses tremendous amounts of collected Twitter data to respond and add knowledge to our understandings of the pandemic. Aiming to explore the public discourse and psychological reactions during the early stage of COVID-19, we use a machine learning approach to examine (1) What latent topics related to COVID-19 can we identify from these Tweets? (2) What are the themes of these identified topics? (3) How Twitter users emotionally react to COVID-19 pandemic? And (4) How do these sentiments change over time?

Research design
We used an observational study design and a purposive sampling approach to select all the Tweets contained defined hashtags (e.g., #2019nCoV) related to COVID-19 on Twitter. We used natural language processing methods to find salient topics and terms related to COVID-19. Our Twitter data mining approach included data preparation and data analysis. Data preparation consisted of three steps: (1) sampling; (2) data collection; and (3) pre-processing the raw data. After pre-processing the raw dataset, we proceeded to the data analysis stage, including (1) unsupervised machine learning, (2) qualitative method; and (3) sentiment analysis. The unit of analysis was each message-level Tweet posted on Twitter.

Sampling and data collection
We purposely selected a list of 19 trending hashtags related to COVID-19 as key search terms to collect Tweets on Twitter (S1 Table). We used Twitter's open application programming interface (API) to collect Tweets published between January 23, 2020, and March 7, 2020. We used the Python code provided by Twitter Developer [9] to access the Twitter API. Shown in Fig 1, a total of 20 million (n = 20,370,854) Tweets were collected. After we removed the non-English Tweets (n = 9,694,320), duplicates and retweets (n = 7,731,035), 1.9 million (n = 1,963,285) Tweets were our dataset for this study. The following features were collected for each single Tweet message (1) each message-level tweets (full text); (2) function features of (a) hashtags; (b) the number of favorites; (c) the number of followers; (d) the number of friends; (e) number of retweets; (f) user location; and (g) user description. Our data collection method complied with Twitter's Terms of Service and Developer's Agreement and Policy.

Pre-processing the raw dataset
We pre-processed the raw data to ensure quality. We used Python, a programming language, to conduct data analysis. The pre-processing plan was as follows:

Data analysis
Unsupervised machine learning. We used unsupervised machine learning to examine data for patterns because this approach was commonly used when studies had little observations or insights of the unstructured text data. A qualitative approach had challenges analyzing the large scale of Twitter data. Unsupervised learning derived a probabilistic clustering based on the data itself, allowing us to conduct exploratory analyses of large unstructured texts in social science research. We configured topic modeling, an unsupervised machine learning method, to generate top latent topic distributions. Latent Dirichlet allocation (LDA) [10] was a probabilistic model of word counts that analyzes a set of documents. We used LDA to identify patterns, themes, and structures of the Tweets texts and examine how these themes were connected. It enabled us to efficiently categorize the large bodies of data based on patterns and features. LDA had been used to do sentiment analysis of Tweets related to health [11]. Topic modeling had been widely used to gain a descriptive understating of unstructured Twitter big data in social science research [12].
Qualitative analysis. We triangulated and contextualized findings from unsupervised learning in the study. We employed the qualitative approach to support deeper qualitative dives into the dataset, such as labeling popular words and Tweet topics, assigning meanings and themes to the topics, interpreting the themes and patterns identified from the Tweets https://doi.org/10.1371/journal.pone.0239441.g001 [13], and inductively developing themes for the latent topics generated by machine algorithms. The qualitative approach relies on the diverse, in-depth interpretations from human, which allows for inductive, exploratory analysis, and the application of theoretical approaches [14]. Sentiment analysis. Sentiment analysis was a computational and natural language processing-based method that analyzed the people's sentiment, emotions, and attitudes in given texts [15] and an essential method in social media research. The sentiment analysis in the present study was based on a machine learning model for predicting emotions from English Tweets [16]. This model classified each tweet into eight pairwise emotions in Plutchik's wheel of emotions [17], including joy-sadness, trust-disgust, fear-anger, and surprise-anticipation. This method returned one emotion from the eight categories for each given Tweet.

COVID-19 related topics
The automated machine learning LDA approach generated commonly co-occurred words and also organized them into different topics. We calculated the most appropriate number of topics based on the coherence model-gensim [18]. We chose the number of topics to be 11 returned by LDA for this dataset because it had the highest coherence score. Fig 3 showed the coherence score for the number of topics returned by the LDA model.
We analyzed the document-term matrix with the chosen 11 topics and obtained the distributions of the 11 topics. Table 1 presented the results of identified 11 salient topics, the most popular pairs of words within each topic, and the number of Tweets under each topic.

COVID-19 related themes
We generated some representative Tweets on each topic to explain the themes of these topics. Two authors discussed the bigrams and representative Tweets in each of the 11 topics and then categorized them into ten themes ( Table 2). In addition, we computed the topic distance [10] and presented a 2D plane of the intertopic distance [19] in Fig 4. Each circle represented a topic from Topic 1 to Topic 13 in the study. The centers are determined by computing the distance between topics. In the visualization, these circles were not overlapped, which cross-validated the classification of the ten themes. Table 2 presented the identified topics and themes, and each row of bigrams represented one topic under the theme. We identified ten themes, such as "updates about the number of COVID-19 cases (confirmed cases, total confirmed, cases reported)," "COVID-19 related death [(new deaths, total deaths) and (people die, death rates)]," and "preventive measures [(toilet paper, self-isolate), (face masks, panic buying), travel bans, and (washing hands, test kits, 20 seconds, soap water, hands soap)]". Table 3 highlighted the representative Tweets within each topic under each theme. To protect the privacy and anonymity of the Twitter users of these sample Tweets, we used either excerpt of Tweets or paraphrased several terms in the message.

Sentiment analysis
Tweets contained information about people's thoughts and emotions [20]. We presented individuals' emotional reactions to the COVID-19 pandemic in Fig 5. It represented the proportion of emotional tweets over daily tweets by date. Fear (yellow line) was consistently the dominant emotion over time, which was about 50% of daily Tweets from the Wuhan outbreak to early March. Proportionally lower than feeling of fear, Tweets on trust (brown line) slightly increased over time.  Table 4 showed the percentage of each emotion within each of the 11 topics. Across all topics, we observed that the feeling of fear has been prominent. For example, fear for the unknown nature of the COVID-19 consisted of almost 50% of the Tweets in all eleven topics. Approximately 24% of the emotions within Tweets under Topic 1 related to the public's trust for the health authorities.

Sentiments within 11 topics
Since fear was prominent in all eleven topics, we further ran a one-tailed z test and assessed if each of the eight emotions was statistically significantly different across topics. We used a pvalue smaller than .001 as a threshold and presented the results in Table 4. For example, fear for the uncertainty about COVID-19 was found to have a higher probability of being prevalent in Topics 1, 4, 9, and 11. Trust expressed in Tweets was statistically significant prevalent in Topics 1, 2, and 10. Surprise for the pandemic was statistically significant frequent in Topics 1 and 11. Joy was statistically significant widespread in Topics of 5, 7, 8, and 11.

Discussion and conclusion
This study shows Twitter users' discussions and sentiments to the COVID-19 from January 23 to March 7, 2020. Our findings facilitate an understanding of public discussions and sentiments to the outbreak of COVID-19 in a rapid and real-time way, contributing to the surveillance system to understand the evolving situation. The study overcomes the limitations of the traditional social science approach, which relies on time-consuming, retrospective, timelagged, small-scale surveys, and interviews. The identified patterns and emotions of public tweets could be used to guide targeted intervention programs. First, early recognition of COVID-19 cases and a potential outbreak in New York City were identified among a massive number of tweets, suggesting that the Twitter community has acknowledged the disease severity as early as February. A small peak of the Tweets volume is identified between Feb 10 th and 14 th , and then gradually increase again after Feb.14 th . This finding is also timed with the very first CDC's warning on Twitter (@CDCgov) on February 10, 2020: "If you've recently from China, know the symptoms of #2019nCoV. These include mild to severe respiratory illness with fever, cough, shortness of breath. See bit.ly/38zjnYo." An increasing number of Tweets may be followed with CDC's post, suggesting a good opportunity to guide the public to take action to take preventive measures in February. Rapidly identifying and utilizing social media messages may help the public and authorities to respond to the spread of the disease at the early stages.
Second, discussions of COVID-19 symptoms (e.g., cough, fever, difficulty breathing) and treatments (e.g., vaccine, rest and sleep, drink liquids) were notably missing from our collected Tweets from January 23 to March 7, 2020. One study selects Tweets (n = 35,786) associated with COVID-19 symptoms (e.g., diagnosed, pneumonia, fever, cough) from March 3 to 20, 2020, and finds that the volume of signal Tweets for symptoms increases over time [21]. The inconsistent findings suggest that Twitter is not widely used as a platform for posting symptoms or seeking medical help. Findings inform that more treatmentrelated messages can be posted as an educational tool for the public on social media Health authorities or public health communities.

Authorities
• ". . . Trump lied about #coronavirus, vote him out #voteblue #JoeBiden2020. . ." • "coronavirus 'likely' to hit UK-professors say public health officials must do more #coronavirus. . ." • "Mike pence will stop #coronavirus with gender segregated workplaces and don't tell him otherwise. . ." • ". . .Chinese doctor #LiWenLiang, one of the eight HERO whistleblowers who tried to warn other . . ." • ". . .is the the figure #WHO told us the coronavirus is under control? Let there be no panic. . ." • ". . .the PRESIDENT OF THE UNITED STATES said the coronavirus was not a concern anymore #CDC. . ." Supply chain • "with #wuhancoronavirus, the supply chain in China will soon collapse, better prepare for the global shortage of supply of everything. . .?
• ". . .@Catalysis3D can help with low cost and fast additive manufactured bridge tooling and part. . .#supplychain. . ." online users' concerns and feelings during the epidemic. Our findings have implications for health authorities that mental health and psychosocial well-being support is needed during this time [20]. There are several limitations to the study. First, we only sample a trending of 19 hashtags as search terms to collect Twitter data. Some new hashtags have become new trending terms for Twitter users to group topics over time. For example, #COVID19 has been widely used after it becomes the official name for the virus. Second, Twitter users are not representative of the whole population and only indicate online users' opinions and reactions about COVID-19. However, the Twitter dataset is a valuable source for understanding the real-time Twitter usergenerated content related to COVID-19 disease activities. Third, non-English Tweets are removed from the analysis, and results are limited to a particular population. Future studies are recommended to include Italian, Germany, and Spanish languages for COVID-19 analysis.