Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Public discourse and sentiment during the COVID 19 pandemic: Using Latent Dirichlet Allocation for topic modeling on Twitter

  • Jia Xue,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Methodology, Writing – original draft, Writing – review & editing

    Affiliations Factor-Inwentash Faculty of Social Work, University of Toronto, Toronto, Canada, Faculty of Information, University of Toronto, Toronto, Canada

  • Junxiang Chen,

    Roles Formal analysis

    Affiliation School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America

  • Chen Chen,

    Roles Data curation, Methodology

    Affiliation Middleware system research group, University of Toronto, Toronto, Canada

  • Chengda Zheng,

    Roles Methodology, Validation

    Affiliation Faculty of Information, University of Toronto, Toronto, Canada

  • Sijia Li,

    Roles Writing – review & editing

    Affiliations Chinese Academy of Sciences, Institute of Psychology, Beijing, China, Department of Psychology, University of Chinese Academy of Sciences, Beijing, China

  • Tingshao Zhu

    Roles Conceptualization, Supervision

    Affiliation Chinese Academy of Sciences, Institute of Psychology, Beijing, China

Public discourse and sentiment during the COVID 19 pandemic: Using Latent Dirichlet Allocation for topic modeling on Twitter

  • Jia Xue, 
  • Junxiang Chen, 
  • Chen Chen, 
  • Chengda Zheng, 
  • Sijia Li, 
  • Tingshao Zhu


The study aims to understand Twitter users’ discourse and psychological reactions to COVID-19. We use machine learning techniques to analyze about 1.9 million Tweets (written in English) related to coronavirus collected from January 23 to March 7, 2020. A total of salient 11 topics are identified and then categorized into ten themes, including “updates about confirmed cases,” “COVID-19 related death,” “cases outside China (worldwide),” “COVID-19 outbreak in South Korea,” “early signs of the outbreak in New York,” “Diamond Princess cruise,” “economic impact,” “Preventive measures,” “authorities,” and “supply chain.” Results do not reveal treatments and symptoms related messages as prevalent topics on Twitter. Sentiment analysis shows that fear for the unknown nature of the coronavirus is dominant in all topics. Implications and limitations of the study are also discussed.


WHO declares COVID-19 as a global health pandemic. Social media has played a crucial role before the virus outbreak and continues to do so as it spreads globally. After China took strict quarantine measures as an intervention (e.g., cities on locked down, school closure, and employed self-isolation), Chinese social media platforms (e.g., Weibo, WeChat, Toutiao) become the lifeline for almost all isolated people who have been housebound for 30+ days and relying on these channels to obtain information, exchange opinions, socialize, and order food [1]. Existing studies [25] show that Twitter data can provide useful information for epidemic disease (e.g., H1N1, Ebola), including tracking rapidly evolving public sentiments, measuring public interests and concerns, estimating real-time disease activity and trends, and tracking reported disease levels. However, these studies have limitations, with only qualitatively manual coding a very small number of Tweets. They require more advanced techniques to improve accuracy and precision for examining public opinions and sentiments. In addition, it remains unknown about public reactions to the COVID online. The vast majority of searched articles about COVID-19 and 2019-nCoV focus on epidemic control, such as the transmissibility of the virus [6], clinical characteristics of the infected cases [7], and patient screening [8].

The present study uses tremendous amounts of collected Twitter data to respond and add knowledge to our understandings of the pandemic. Aiming to explore the public discourse and psychological reactions during the early stage of COVID-19, we use a machine learning approach to examine (1) What latent topics related to COVID-19 can we identify from these Tweets? (2) What are the themes of these identified topics? (3) How Twitter users emotionally react to COVID-19 pandemic? And (4) How do these sentiments change over time?


Research design

We used an observational study design and a purposive sampling approach to select all the Tweets contained defined hashtags (e.g., #2019nCoV) related to COVID-19 on Twitter. We used natural language processing methods to find salient topics and terms related to COVID-19. Our Twitter data mining approach included data preparation and data analysis. Data preparation consisted of three steps: (1) sampling; (2) data collection; and (3) pre-processing the raw data. After pre-processing the raw dataset, we proceeded to the data analysis stage, including (1) unsupervised machine learning, (2) qualitative method; and (3) sentiment analysis. The unit of analysis was each message-level Tweet posted on Twitter.

Sampling and data collection

We purposely selected a list of 19 trending hashtags related to COVID-19 as key search terms to collect Tweets on Twitter (S1 Table). We used Twitter’s open application programming interface (API) to collect Tweets published between January 23, 2020, and March 7, 2020. We used the Python code provided by Twitter Developer [9] to access the Twitter API. Shown in Fig 1, a total of 20 million (n = 20,370,854) Tweets were collected. After we removed the non-English Tweets (n = 9,694,320), duplicates and retweets (n = 7,731,035), 1.9 million (n = 1,963,285) Tweets were our dataset for this study. The following features were collected for each single Tweet message (1) each message-level tweets (full text); (2) function features of (a) hashtags; (b) the number of favorites; (c) the number of followers; (d) the number of friends; (e) number of retweets; (f) user location; and (g) user description. Our data collection method complied with Twitter’s Terms of Service and Developer’s Agreement and Policy.

Pre-processing the raw dataset

We pre-processed the raw data to ensure quality. We used Python, a programming language, to conduct data analysis. The pre-processing plan was as follows:

  1. We removed the hashtag symbol and its content (e.g., #COVID19), @users, and URLs from the messages because the hashtag symbols or the URLs did not contribute to the message analysis.
  2. We removed all non-English characters (non-ASCII characters) because the study focused on the analysis of messages in English.
  3. We removed repeated words. For example, sooooo terrified was converted to so terrified.
  4. We removed special characters, punctuations, and numbers from the dataset as they did not help with detecting the profanity comments.

Data analysis

Unsupervised machine learning.

We used unsupervised machine learning to examine data for patterns because this approach was commonly used when studies had little observations or insights of the unstructured text data. A qualitative approach had challenges analyzing the large scale of Twitter data. Unsupervised learning derived a probabilistic clustering based on the data itself, allowing us to conduct exploratory analyses of large unstructured texts in social science research. We configured topic modeling, an unsupervised machine learning method, to generate top latent topic distributions. Latent Dirichlet allocation (LDA) [10] was a probabilistic model of word counts that analyzes a set of documents. We used LDA to identify patterns, themes, and structures of the Tweets texts and examine how these themes were connected. It enabled us to efficiently categorize the large bodies of data based on patterns and features. LDA had been used to do sentiment analysis of Tweets related to health [11]. Topic modeling had been widely used to gain a descriptive understating of unstructured Twitter big data in social science research [12].

Qualitative analysis.

We triangulated and contextualized findings from unsupervised learning in the study. We employed the qualitative approach to support deeper qualitative dives into the dataset, such as labeling popular words and Tweet topics, assigning meanings and themes to the topics, interpreting the themes and patterns identified from the Tweets [13], and inductively developing themes for the latent topics generated by machine algorithms. The qualitative approach relies on the diverse, in-depth interpretations from human, which allows for inductive, exploratory analysis, and the application of theoretical approaches [14].

Sentiment analysis.

Sentiment analysis was a computational and natural language processing-based method that analyzed the people’s sentiment, emotions, and attitudes in given texts [15] and an essential method in social media research. The sentiment analysis in the present study was based on a machine learning model for predicting emotions from English Tweets [16]. This model classified each tweet into eight pairwise emotions in Plutchik’s wheel of emotions [17], including joy-sadness, trust-disgust, fear-anger, and surprise-anticipation. This method returned one emotion from the eight categories for each given Tweet.


Descriptive results

After pre-processing the collected tweets, our final dataset consisted of 1,963,285 Tweets after removing the duplicates mentioning at least one of the nineteen hashtags from January 23 to March 7, 2020. Fig 2 presented the number of Tweets under the top 9 hashtag by dates (“#Coronavirus”, n = 1,405,254, “#Wuhan”, n = 144,240, “#Wuhancoronavirus”, n = 73,393, “#Coronaoutbreak”, n = 73,147, “#2019ncov”, n = 60,278, “#ChinaCoronavirus”, n = 19,188, “#Chinavirus”, n = 17,865, “#CoronavirusChina”, n = 16,371, “#Wuhanoutbreak”, n = 10,548). The number of Tweets using hashtag #coronavirus gradually increased since February 14 and dropped on March 1 when the hashtag #wuhanoutbreak suddenly increased for four days.

COVID-19 related topics

The automated machine learning LDA approach generated commonly co-occurred words and also organized them into different topics. We calculated the most appropriate number of topics based on the coherence model–gensim [18]. We chose the number of topics to be 11 returned by LDA for this dataset because it had the highest coherence score. Fig 3 showed the coherence score for the number of topics returned by the LDA model.

We analyzed the document-term matrix with the chosen 11 topics and obtained the distributions of the 11 topics. Table 1 presented the results of identified 11 salient topics, the most popular pairs of words within each topic, and the number of Tweets under each topic.

COVID-19 related themes

We generated some representative Tweets on each topic to explain the themes of these topics. Two authors discussed the bigrams and representative Tweets in each of the 11 topics and then categorized them into ten themes (Table 2). In addition, we computed the topic distance [10] and presented a 2D plane of the intertopic distance [19] in Fig 4. Each circle represented a topic from Topic 1 to Topic 13 in the study. The centers are determined by computing the distance between topics. In the visualization, these circles were not overlapped, which cross-validated the classification of the ten themes.

Table 2 presented the identified topics and themes, and each row of bigrams represented one topic under the theme. We identified ten themes, such as “updates about the number of COVID-19 cases (confirmed cases, total confirmed, cases reported),” “COVID-19 related death [(new deaths, total deaths) and (people die, death rates)],” and “preventive measures [(toilet paper, self-isolate), (face masks, panic buying), travel bans, and (washing hands, test kits, 20 seconds, soap water, hands soap)]”.

Table 3 highlighted the representative Tweets within each topic under each theme. To protect the privacy and anonymity of the Twitter users of these sample Tweets, we used either excerpt of Tweets or paraphrased several terms in the message.

Sentiment analysis

Tweets contained information about people’s thoughts and emotions [20]. We presented individuals’ emotional reactions to the COVID-19 pandemic in Fig 5. It represented the proportion of emotional tweets over daily tweets by date. Fear (yellow line) was consistently the dominant emotion over time, which was about 50% of daily Tweets from the Wuhan outbreak to early March. Proportionally lower than feeling of fear, Tweets on trust (brown line) slightly increased over time.

Sentiments within 11 topics

Table 4 showed the percentage of each emotion within each of the 11 topics. Across all topics, we observed that the feeling of fear has been prominent. For example, fear for the unknown nature of the COVID-19 consisted of almost 50% of the Tweets in all eleven topics. Approximately 24% of the emotions within Tweets under Topic 1 related to the public’s trust for the health authorities.

Table 4. Percentage of each emotion within 11 topics and p-value from Z-test.

Since fear was prominent in all eleven topics, we further ran a one-tailed z test and assessed if each of the eight emotions was statistically significantly different across topics. We used a p-value smaller than .001 as a threshold and presented the results in Table 4. For example, fear for the uncertainty about COVID-19 was found to have a higher probability of being prevalent in Topics 1, 4, 9, and 11. Trust expressed in Tweets was statistically significant prevalent in Topics 1, 2, and 10. Surprise for the pandemic was statistically significant frequent in Topics 1 and 11. Joy was statistically significant widespread in Topics of 5, 7, 8, and 11.

Discussion and conclusion

This study shows Twitter users’ discussions and sentiments to the COVID-19 from January 23 to March 7, 2020. Our findings facilitate an understanding of public discussions and sentiments to the outbreak of COVID-19 in a rapid and real-time way, contributing to the surveillance system to understand the evolving situation. The study overcomes the limitations of the traditional social science approach, which relies on time-consuming, retrospective, time-lagged, small-scale surveys, and interviews. The identified patterns and emotions of public tweets could be used to guide targeted intervention programs.

First, early recognition of COVID-19 cases and a potential outbreak in New York City were identified among a massive number of tweets, suggesting that the Twitter community has acknowledged the disease severity as early as February. A small peak of the Tweets volume is identified between Feb 10th and 14th, and then gradually increase again after Feb.14th. This finding is also timed with the very first CDC’s warning on Twitter (@CDCgov) on February 10, 2020: “If you’ve recently from China, know the symptoms of #2019nCoV. These include mild to severe respiratory illness with fever, cough, shortness of breath. See” An increasing number of Tweets may be followed with CDC’s post, suggesting a good opportunity to guide the public to take action to take preventive measures in February. Rapidly identifying and utilizing social media messages may help the public and authorities to respond to the spread of the disease at the early stages.

Second, discussions of COVID-19 symptoms (e.g., cough, fever, difficulty breathing) and treatments (e.g., vaccine, rest and sleep, drink liquids) were notably missing from our collected Tweets from January 23 to March 7, 2020. One study selects Tweets (n = 35,786) associated with COVID-19 symptoms (e.g., diagnosed, pneumonia, fever, cough) from March 3 to 20, 2020, and finds that the volume of signal Tweets for symptoms increases over time [21]. The inconsistent findings suggest that Twitter is not widely used as a platform for posting symptoms or seeking medical help. Findings inform that more treatment-related messages can be posted as an educational tool for the public on social media Health authorities or public health communities.

Third, fear is a dominant emotion in all topics during the early stages of the COVID-19 pandemic. Results are consistent with other studies [2225], which shows that COVID-19 significantly impacts individuals’ psychological conditions. Sentiment analysis of the COVID-19 pandemic related content contributes to our understanding of the dynamics of online users’ concerns and feelings during the epidemic. Our findings have implications for health authorities that mental health and psychosocial well-being support is needed during this time [20].

There are several limitations to the study. First, we only sample a trending of 19 hashtags as search terms to collect Twitter data. Some new hashtags have become new trending terms for Twitter users to group topics over time. For example, #COVID19 has been widely used after it becomes the official name for the virus. Second, Twitter users are not representative of the whole population and only indicate online users’ opinions and reactions about COVID-19. However, the Twitter dataset is a valuable source for understanding the real-time Twitter user-generated content related to COVID-19 disease activities. Third, non-English Tweets are removed from the analysis, and results are limited to a particular population. Future studies are recommended to include Italian, Germany, and Spanish languages for COVID-19 analysis.


  1. 1. Wu H. The coronavirus and Chinese social media: finger-pointing in the post-truth era; 2020[cited 2020 July 7] [Internet]. Available from:
  2. 2. Chew C, Eysenbach G. Pandemics in the age of Twitter: content analysis of Tweets during the 2009 H1N1 outbreak. PloS ONE.2010; 5(11): e14118. pmid:21124761
  3. 3. Jones JH, Salathe M. Early assessment of anxiety and behavioral response to novel swine-origin influenza A (H1N1). PLoS ONE. 2009; 4(12): e8032. pmid:19997505
  4. 4. Kim Y, Kim JH. Using photos for public health communication: a computational analysis of the Centers for Disease Control and Prevention Instagram photos and public responses. Health Informatics Journal. 2020 Jan 23. pmid:31969051
  5. 5. Signorini A, Polgreen PM, Segre, AM. Using Twitter to estimate H1N1 influenza activity. Proceedings of the 9th Annual Conference of the International Society for Disease Surveillance; 2010 Dec. Emerging Health Threats Journal, 2011.
  6. 6. Chen J. Pathogenicity and transmissibility of 2019-nCoV—A quick overview and comparison with other emerging viruses. 2020; 22(2): 69–71. pmid:32032682
  7. 7. Chen N, Zhou M, Dong X, Qu J, Gong F, Han Y, et al. Epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in Wuhan, China: a descriptive study. Lancet. 2020; 395(10223): 507–513. pmid:32007143
  8. 8. Quilty BJ, Clifford S, Flasche S, Eggo RM. Effectiveness of airport screening at detecting travellers infected with novel Coronavirus (2019-nCoV) Euro Surveillance. 2020; 25 (5): 1560–7917.
  9. 9. Get Tweet timelines; 2020 [cite 2020 June 17] [Internet]. Available from
  10. 10. Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Journal of Machine Learning Research. 2003; 3(4–5): 993–1022.
  11. 11. Paul MJ, Dredze M. Discovering health topics in social media using topic models. PloS ONE. 2014; 9(8): e103408. pmid:25084530
  12. 12. Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Ramones SM, Agrawal M, et al. Personality, gender, and age in the language of social media: The open-vocabulary approach. PloS ONE. 2013; 8(9): e73791. pmid:24086296
  13. 13. Braun V, Clarke V. Using thematic analysis in psychology. Qualitative Research in Psychology. 2006; 3(2): 77–101. pmid:32100154
  14. 14. Murthy D. The ontology of tweets: mixed methods approaches to the study of Twitter. In: Pertti A, Leonard B, Julia B, editors. The SAGE handbook of social media research methods. London: SAGE publication; 2017. pp. 559–572.
  15. 15. Beigi G, Hu X, Maciejewski R, Liu H. An overview of sentiment analysis in social media and its applications in disaster relief. In: Pedrycz W, Chen SM, editors. Sentiment analysis and ontology engineering. Berlin: Springer Cham; 2016. pp. 313–340.
  16. 16. Colnerič N, Demšar J. Emotion Recognition on Twitter: Comparative Study and Training a Unison Model. IEEE Transactions on Affective Computing. 2018; 99: 1.
  17. 17. Plutchik R. A General Psychoevolutionary Theory of Emotion. In: Robert P, Henry K, editors. Theories of Emotion. USA: Academic Press; 1980. pp. 3–33.
  18. 18. Röder M, Both A, Hinneburg A. Exploring the space of topic coherence measures. Proceedings of the 8th ACM international conference on Web search and data mining; 2015. pp. 399–408.
  19. 19. Chuang J, Ramage D, Manning C, Heer J. Interpretation and trust: designing model-driven visualizations for text analysis. Paper presented at: SIGCHI Conference on Human Factors in Computing Systems; 2012; Austin, Texas.
  20. 20. Griffis H, Asch DA, Schwartz HA, Ungar L, Buttenheim AM, Barg FK, et al. Using Social Media to Track Geographic Variability in Language About Diabetes: Infodemiology Analysis. JMIR diabetes. 2020;5(1):e14431. pmid:32044757
  21. 21. Mackey T, Purushothaman V, Li J, Shah N, Nali M, Bardier C, et al. Machine learning to detect self-reporting of symptoms, testing access, and recovery associated with COVID-19 on Twitter: retrospective big data infoveillance Study. JMIR Public Health and Surveillance.2020; 6(2): e19509. pmid:32490846
  22. 22. Li SJ, Wang YL, Xue J, Zhao N, Zhu TS. The impact of COVID-19 epidemic declaration on psychological consequences: a study on active Weibo users. Int. J. Environ. Res. Public Health. 2020; 17: 2032. pmid:32204411
  23. 23. Lwin MO, Lu J, Sheldenkar A, Schulz PJ, Shin W, Gupta R, et al. Global sentiments surrounding the COVID-19 pandemic on Twitter: analysis of Twitter trends. JMIR Public Health and Surveillance. 2020; 6(2): e19447. pmid:32412418
  24. 24. Su Y, Xue J, Liu X, Wu P, Chen J, Chen C, et al. Examining the impact of COVID-19 lockdown in Wuhan and Lombardy: a psycholinguistic analysis on Weibo and Twitter. Int. J. Environ. Res. Public Health. 2020; 17(12): 4552. pmid:32599811
  25. 25. Xue J, Chen J, Hu R, Chen C, Zheng C, Zhu T. Twitter discussions and concerns about COVID-19 pandemic: Twitter data analysis using a machine learning approach. arXiv: 2005.12830 [Preprint]. 2020 [cited 2020 July 7]. Available from: