Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Towards using Tweet sentiment for infectious disease detection

  • James Stassinos,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Geography and Geoinformation Science, George Mason University, Fairfax, Virginia, United States of America

  • Taylor Anderson ,

    Roles Conceptualization, Funding acquisition, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

    tander6@gmu.edu

    Affiliation Department of Geography and Geoinformation Science, George Mason University, Fairfax, Virginia, United States of America

  • Andreas Züfle

    Roles Conceptualization, Funding acquisition, Investigation, Methodology, Project administration, Supervision, Writing – original draft, Writing – review & editing

    Affiliation Department of Computer Science, Emory University, Atlanta, Georgia, United States of America

Abstract

Social media data has shown potential for identifying infectious disease outbreaks faster than official records of disease incidence. We examine spatial, temporal, and spatiotemporal relationships between COVID-19-related microblog sentiment and COVID-19 cases over space and time to investigate whether microblog-derived sentiment can be used for local infectious disease outbreak early warning. Therefore, we measure the sentiment of 56,755,894 COVID-19 related microblogs (tweets) from the microblogging platform X. We group these tweets by county and by calendar week to investigate spatial and temporal correlation between sentiment and observed cases (in the corresponding county and week). Our temporal analysis shows a significant negative correlation between sentiment and cases between June and September 2020. During this time, tweet sentiment could have served as an early warning for new COVID-19 outbreaks. Our spatial analysis shows that the East of the United States exhibits a significant negative correlation between Sentiment and Cases while the West exhibits a significant positive correlation. In these regions, Tweet Sentiment could have been used as an early warning signal for new outbreaks. Our spatiotemporal analysis discovers even stronger correlations in certain regions during certain time periods. If we could understand when, where, and why this correlation is strong, then we may be able to leverage social media as a successful early warning system.

Introduction

Accurate and timely data capturing the ecology and evolution of infectious diseases such as COVID-19 has been crucial for informing policy interventions [13]. To measure the cases and deaths due to COVID-19 across the US, the Center for Disease Control (CDC) gathered data from jurisdictional and state partners that independently report the new daily number of confirmed COVID-19 cases and deaths for each county of residence based on testing data captured by pharmacies, hospitals, and other testing facilities [4]. However, due to a decentralized and fragmented public health infrastructure, this data is subject to inherent biases due to budget and resource availability or group characteristics, under-ascertainment of mild cases, the changing definition of a “confirmed case,” reporting lags, and disentanglement of the cause of death [57]. One such example is testing bias, where some locations may have better testing infrastructure, well-funded access to testing, and less stigma around getting tested [5]. In another example, a lack of standardization resulted in an inconsistent set of reporting metrics where some states defined a “confirmed case” as a total count of positive tests and others as the total number of unique individuals that tested positive [8]. Furthermore, delays and backlog in the reporting pipeline resulted in a 3-21 day lag between the time a patient tested positive and reported as having tested positive [9].

Publicly available data from news outlets, chat rooms, web searches, or social media have been proven to be a valuable tool for disaster response and management, providing insights into both real and perceived threats [10,11]. For example, Kryvasheyeu et al. [12] find strong relationships between the proximity to tornadoes, hurricanes, earthquakes and other natural disasters and Twitter activity. Such data has also been used as a supplement to official data sources to identify disease outbreaks faster than what is reported by the CDC and can even detect outbreaks not detected by official sources [1316]. Collectively, these sources, referred to as digital health data, provide a lens into public health that is fundamentally different from that yielded by official sources [17]. Among such sources, microblogs such as those posted by users on X (the platform formerly known as Twitter) are used to detect the prevalence of diseases such as influenza and dengue fever. In some cases, the volume of such microblogs (called Tweets) is used as an indicator of local disease outbreaks. More specifically, the number of tweets that contain keywords relating to a disease or have been classified as a “self report” have been found to correlate with CDC reported cases and other official sources and thus can be used as a tool for early detection and monitoring [1823].

In other cases, additional analyses of the Tweets, using natural language processing approaches such as sentiment analysis, have been used to differentiate between Tweets that are positively, i.e. “my flu shot worked, no flu for me!” and negatively associated with disease, i.e. “my whole family has the flu” and predict to what extent influenza is present in the population over time [2427]. Where Tweet sentiment and official influenza cases tend to be inversely associated, the relationship is less clear for COVID-19. For example, Valdez et al [28] were surprised to observe a positive correlation between US wide COVID-19 related Tweet sentiment and cases and deaths, meaning that as cases and deaths increase in the US, sentiment towards COVID-19 trends positive. This contradicts what would intuitively be expected. In another example, Feng and Kirkley [29] find a weak negative or absent correlation between state-level Tweet sentiment and COVID-19 and cases and deaths.

Therefore, the objective of this study is to examine the relationship between COVID-19 related Tweet sentiment and official case data over space and time and to assess the extent to which Tweet-derived sentiment can be used for local COVID-19 surveillance in the United States. We leverage the TBCOV dataset [30], containing a total of two billion geolocated COVID-19 related Tweets between Jan 2020 to Dec. 2021. After pre-processing, we use approximately 57 million geo-located tweets in the United States. Details of the pre-processing are described in the Data section. We examine the county-level correlation between Tweet sentiment and official data from February 1, 2020 to March 31, 2021 to determine whether, when, and where Tweet sentiment from a county can be used as a predictor of the number of cases in the same county. To extract sentiment for Tweets we use existing sentiment analysis algorithms which leverage natural language processing, text analysis, and computational linguistics to systematically identify, extract, quantify, and study people’s moods, opinions, attitudes, and emotions in written text [31]. For each Tweet, we measure the Sentiment Polarity, which is a score associated with a direct opinion about an object on a scale from –1 (negative) to +1 (positive) [31]. We leverage multiple sentiment analysis tools including as Text Blob [32], Vader Sentiment [33], AFINN [34] and TBCOV [30] as our measures for Sentiment Polarity.

We intuitively assume that there should be an inverse association between COVID-19 Tweet sentiment and COVID-19 cases at the local level. Thus, places experiencing a high number of COVID-19 cases are expected to have a low COVID-19 sentiment during that time. We can observe this inverse association in Fig 1 which shows both the COVID-19 related sentiment in Tweets and the number of COVID-19 cases for the entire U.S. and for some locations in the U.S. For example, in the case of New York City, near the end of April 2020, when number of local COVID-19 cases dramatically decreased, we observe a large increase in mean sentiment. Yet, in general, it is difficult to discern any clear relationship, thus, warranting further investigation. In the case that there is an inverse association between COVID-19 Tweet sentiment and COVID-19 cases at a local level, we anticipate that COVID-19 related Tweet sentiment could be used to supplement official disease surveillance data streams and provide important insights for local outbreak detection of diseases [35]. To our knowledge, there is no existing study that examines the relationship between Tweet sentiment and infectious disease cases at a spatially local level.

thumbnail
Fig 1. Seven-day rolling average of (a) COVID-19 tweet sentiment and (b) COVID-19 incident cases.

https://doi.org/10.1371/journal.pone.0325166.g001

Data

Tweet preprocessing

The primary source of geo-located Tweets used in this work is the TBCOV dataset [30]. This dataset contains over two billion Tweets collected world-wide with keywords related to COVID-19. All Tweets are enriched with geo-location information, using either the location directly provided by the user or using location information from publicly available user profiles. The dataset covers a 424 day period from February 1,2020 to March 31, 2021. Since the X platform (formally Twitter) does not allow the re-distribution of data from Tweets, the TBCOV dataset is provided as a “dehydrated” dataset which includes only the Tweet identifiers. To obtain the full dataset, we rehydrated the Tweets using the Twitter API (now with reduced capabilities).

TBCOV is a global dataset. Therefore, we first removed all of the Tweets from outside of the US, reducing the number of tweets that we consider to 384,073,303. To avoid redundant information, we next removed all re-Tweets from the dataset. Re-Tweets echo the tweet of another user and would contribute noise to our analysis. This step reduced the number of Tweets to 120,678,732. Furthermore, some Tweets were not successfully rehydrated because the original Tweets may have already been removed—either through deletion by the user or because an account has been removed or banned from the platform. The Tweets in this research were hydrated through April 3, 2022 to April 10, 2022. Depending on when a person hydrates the dataset the number of Tweets will change. Out of the 120,678,732 unhydrated Tweets, we were successfully able to rehydrate 77% of Tweets for a total of 93,362,576. Globally, very few Tweets (0.1%) in the TBCOV dataset have explicit coordinates (latitude, longitude). Thus, for the majority of the Tweets, the Tweet location is inferred either based on the user location or the content of the Tweet text. The authors of the TBCOV dataset report 76.1% and higher F1-scores for county-level localization using the location of the user [30]. However, for the case of locations inferred from tweet text, the F1-score was only 0.100%. For this reason, we discarded all tweets whose location is estimated using tweet text. After removing such Tweets, our dataset was reduced to a total of 56,755,894 Tweets. The TBCOV dataset was able to infer the user type of the tweets based on associated account information. We chose to filter down to just the user types labelled as a person, to improve our ability to detect a persons emotion among the other user types on the platform. After filtering for people the dataset was reduced to 30,328,442 tweets. Using our set of Tweets that are reliably geo-located, we then group all Tweets by day and by county. By doing so, we observe that many counties have too few observed Tweets to be a representative sample Therefore, we used a 7-day rolling sum of Tweets for each county and discarded (county, day) pairs with fewer 30 Tweets total across the seven days. The original TBCOV dataset has a gap in the tweet collection from September 15, 2020, to September 23, 2020. This period plus the 7 days following were omitted from the final aggregation.

Sentiment measures

Sentiment is generally measured on a scale from –1 to + 1 where the signum describes the polarity (+, –) and the absolute describes the magnitude of the sentiment. To compute the sentiment of Tweets for a (county, day) pair, we apply four common methods of calculating sentiment from text. These are Text Blob [32], Vader Sentiment [33], Afinn [34], and an included calculation denoted as TBCOV Sentiment [36].

TextBlob [32] is a commonly used natural language processing module for Python. To estimate Sentiment Polarity, TextBlob utilizes a rule-based system for sentiment analysis, which involves a predefined list of words with associated sentiment polarities. For this list, we use the commonly used list sourced from the Python Natural Language Toolkit (NLTK) database. TextBlob returns a score between 1 and –1 indicating both polarity (positive or negative) and strength of the sentiment [32].

Valence Aware Dictionary for Sentiment Reasoning (VADER) is the second Python module that will be used to calculate sentiment on the Tweets. It is a lexicon-based sentiment analyser. This tool is also built on top of NLTK. It calculates the polarity of sentiment by matching sentiment intensity scores to words and then aggregating these scores into an overall score. This process is repeated for each Tweet in the database. VADER returns a number between 1 and –1 for polarity [33].

Afinn is the third sentiment calculation Python module named after its developer Finn Årup Nielsen. The developer created a dictionary of 2477 words for calculating polarity scores. Afinn is a lexicon-based sentiment analyzer. It returns a number between 5 and –5 for polarity [34].

TBCOV Sentiment uses a multilingual transformer-based deep learning model. This returns a polarity number between 1 and –1, and a confidence score. It uses a transformer-based deep learning model called XML-T, trained on millions of general-purpose tweets in 8 different languages [36]. Figure ?? shows the sentiment polarity from TBCOV from February 1, 2020 to March 31, 2021. We present the results of our analysis using the TBCOV sentiment in this paper with the results using the other sentiment measures in the Supplemental Materials.

COVID-19 case data

The number of new confirmed COVID-19 cases per county over time, also known as incident cases, is collected by the Center for Disease Control (CDC) and is published by USA Facts [37]. To smooth the case data, a 7-day rolling average is applied. To account for population differences the daily case counts are normalized by population. Thus, when we refer to COVID-19 cases from this point on, we mean a 7-day rolling average of incident cases normalized by population to a per 100,000. Fig ?? shows the daily COVID cases from February 1, 2020 to March 31, 2021.

Sentiment-disease case correlation analysis

To examine the relationship between COVID-19 Tweet sentiment and official case data over space and time and to assess the extent to which Tweet-derived sentiment can be used for local COVID-19 surveillance in the United States, we first group the observed COVID-19-related tweets by calendar week and county and compute the average sentiment of tweets in this group. This grouping and aggregation provides us with one set of Tweets for each county r (among all counties ) and for each week w (among all weeks ). Deriving the sentiment of each such set provides us with a function S(w,r) that returns the sentiment of all tweets observed in region r during week w. In the following, we define measures to quantify the correlation, globally, spatially, and temporally, between these sets S(w,r) and the number of COVID-19 cases C(w,r) observed in the same region during the same week.

Global analysis. To measure the correlation between COVID-19 Tweet sentiment and COVID-19 cases observed over all weeks and all regions, we simply measure Pearson’s correlation across space and time, as follows:

(1)

where denotes the average sentiment (across all weeks and counties) and denotes the average number of cases (across all weeks and counties). Equation 1 provides us with the sample correlation coefficient (a single scalar) that measures the correlation between COVID-19 Tweet sentiment and COVID-19 cases. Intuitively, we would expect a negative correlation as a high number of cases should (in average across all weeks and counties) yield a low (negative) sentiment towards COVID-19. However, our experimental evaluation shows that this intuition is not confirmed by the data. That is because an aggregation across all weeks and all counties overgeneralizes any interesting spatial and temporal patterns. To find such patterns, we next propose to measure the correlation between COVID-19 Tweet sentiment and COVID-19 disease cases both temporally local (for a specific “frozen” week in time) and spatially local (for a specific “frozen” county).

Temporal analysis. To understand whether and how the correlation between COVID-19 sentiment and COVID-19 cases changes over time, we define a by-week correlation over all spatial regions , as follows:

(2)

where is the average sentiment across all counties during week w and is the average number of cases across all counties during week w.

Equation 2 provides us with the sample correlation coefficient between sentiment and cases across all states during week w. Intuitively, we would expect that this correlation should be stationary over time, such that at any time a region having a higher number of cases should have a lower sentiment towards COVID-19. We will evaluate this hypothesis by computing for each week and plotting the resulting time series (and corresponding p-values and z-scores of the significance of the correlations) in our experiments results section.

Spatial analysis. To understand whether the correlation between COVID-19 sentiment and COVID-19 cases is stationary over space, we compute the correlation across all weeks of our dataset for each spatial region r. This correlation is defined as:

(3)(4)

where is the average sentiment in region r over all weeks and is the average number of cases across in region r over all weeks.

Equation 3 provides us with the sample correlation coefficient between sentiment and cases observed in region r during the entire COVID-19 pandemic. Intuitively, we would expect that this correlation should be stationary over space, such that, for all regions, we’d expect that times having a higher number of cases should have a lower sentiment towards COVID-19. We will evaluate this hypothesis by computing for each region r and mapping the results across the United States in our experiments results section.

Spatiotemporal analysis. In order to understand whether the correlation between COVID-19 sentiment and COVID-19 cases is stationary over space and time, we divide the entire temporal extent of the data (March 02, 2020 - March 31, 2021) into three temporal intervals: wave 1 (March 02-June 07, 2020), wave 2 (June 9–August 24, 2020), and wave 3 (August 26–December 30, 2020) and perform spatial analysis for each period.

Results

This section presents the results of our correlation analysis between COVID-19-related Tweet sentiment and COVID-19 cases globally (Equation 1), across time (Equation 2), and across space (Equation 3). All results shown in the following use the TBCOV Sentiment Polarity measure. The results for TBLOB, VADER, and AFINN Sentiment Polarity are similar and are presented in Supplementary material.

Global correlation

Table 1 shows the global correlation between COVID-19 Tweet sentiment and COVID-19 cases using the four different sentiment measures TBCOV [30], TBLOB [32], VADER [33], and AFINN [34].

Surprisingly, we observe that the negative correlation that we would intuitively expect only holds using three out of the four sentiment measures. For VADER, we observe a positive correlation. For all the others, we do observe a negative correlation. The magnitude of the correlation is weak (). Despite the weak correlations, all these correlations are highly significantly different from zero (using a Student’s t test) due to the very large number of (week, county) pairs.

To summarize our global correlation analysis, we observe some significant correlation, but the magnitude is weak and even the direction of the correlation differs between sentiment measures. This rather inconclusive result confirms observations made in prior work [28,29].

To gain a better understanding of the link between tweet sentiment and COVID-19 cases, we next evaluate the hypothesis that correlations are space and time-homogeneous. The following will show that this hypothesis can be rejected at a high level of confidence as correlations are significant positive and strong in some regions during some time periods, but negative and strong in other regions during other periods.

Correlation over time

Based on Equation 2, Fig 2a presents the correlation coefficient between a 7-day moving average of TBCOV sentiment and a 7-day moving average of COVID-19 cases for all counties from January 20, 2020 to April 14, 2021. First, we observe that the hypothesis of temporal homogeneity is not supported by the data: We observe that different time intervals exhibit different magnitudes and directions of correlation between Tweet COVID-19 sentiment and COVID-19 cases. Fig 2b also provides the p-values of each weekly correlation having p-values close to zero on days where the correlation (over the last seven days) was significantly different from zero. We also provide the corresponding z-scores in Fig 2c indicating the daily (over the last seven days) absolute standard deviations that the observed correlation coefficients deviate from zero.

thumbnail
Fig 2. (a) Daily correlation coefficient between sentiment measures and daily new cases per 100,000 in each county and (b) the corresponding p-values and (c) z-scores.

https://doi.org/10.1371/journal.pone.0325166.g002

Looking at Fig 2a more closely, we can see that four time intervals exhibit significantly different correlations. The first time interval is a period of having a weakly significant (p-values <0.05 on most days) positive correlation between tweet sentiment and COVID-19 from March 2, 2020 to June 10, 2020 where as COVID-19 cases increase, tweet sentiment increases. This unexpected positive correlation may be explained by having many counties that had zero cases at this time, but which did already exhibit a low sentiment towards COVID-19 despite the zero cases. The strongest and most significant result can be observed in the time interval from June 12, 2020 to August 15, 2020. During this period the correlation between sentiment and COVID-19 cases abruptly shifts to a period of highly significant (p-values approaching zero) relatively strong ( 0.15 magnitude) negative correlation where as COVID-19 cases increase, tweet sentiment decreases. This short period aligns with our intuition that places with a high number of cases should have a negative sentiment towards COVID-19. We note that the p-values and z-scores shown in Fig 2 are computed independently for each day and the probability of having such low p-values over sixty days in a row are infinitesimal indicating a highly significant result. Third, we observe another shift to a period of positive correlation from November 2, 2020 to December 6, 2020 similar to the first period. Finally, from December 6 onward, the correlation coefficients fluctuate around a sentiment of zero with p-values seemingly uniform in the interval [0,1] indicating no more significant correlation after this time.

As the main takeaway from this analysis, we observed a time interval in which the negative correlation between sentiment and cases was significantly and strongly correlated - indicating that during this time, tweet sentiment could have been successfully used as an early warning indicator.

Correlation over space

Using Equation 3 to obtain a correlation measure for each county, Fig 3 presents the correlation coefficients map, aggregated over all 424 days. For this experiment, we omitted any county with fewer than 200 days of data and any day with less than 30 Tweets across a 7 day period. After omitting counties with insufficient data, we retained 1,442 counties out of all the 3,025 counties that have at least one tweet in the dataset. While correlations aggregated to the entire United States were rather weak as seen in our Global Correlation study, we first observe that some counties indeed exhibit strong correlations, ranging from –0.76 to 0.51. We also observe that these correlations exhibit spatial clustering, with many counties having either a strong negative correlation (red colour in Fig 3a) or medium negative correlation (orange colour) in the North Eastern US. Counties having positive correlations (light and dark blue in Fig 3a) are less frequent, which is expected as the average correlation for TBCOV is –0.055 as shown in Table 1, but appear more frequently in the West and Midwest. As these spatial trends may not be immediately evident from Fig 3a, we applied Anselin’s test for local spatial autocorrelation to the correlations values obtaining the map of local clusters as shown in Fig 3b. As expected, we observe a large cluster of low correlation values to the Northeast (although having many outliers of relatively high values within this low cluster). We see a large cluster of high values (indicating positive correlation) in the West and Midwest with parts of California excluded from this cluster as not significant.

thumbnail
Fig 3. (a) Correlation coefficients between TBCOV sentiment and COVID-19 cases for all counties. (b) Anselin local Moran’s I cluster and outlier analysis of TBCOV correlation coefficients. An interactive version of this map can be found at

https://mygmu.maps.arcgis.com/apps/mapviewer/index.html?webmap=e73b409615824f9f9e019623b42ae664.

https://doi.org/10.1371/journal.pone.0325166.g003

The main result of the spatial analysis is that there are regions within the United States for which there exists a strong correlation (either negative or positive) between tweet sentiment and COVID-19 cases. This shows that in these regions, tweet sentiment could have been successfully used as an early warning indicator.

Spatiotemporal analysis

The previous sections have shown that, despite the weak global correlation between tweet sentiment towards COVID-19 and actual COVID-19 cases across all weeks and regions, there are:

  • Some periods of time when where the correlation between tweet sentiment and cases is relatively strong and highly significant and
  • There are regions within the United States in which the correlation between tweet sentiment towards COVID-19 and actual COVID-19 cases is relatively strong (either positive or negative).

These observations raise the following question: Are there any combinations of time and region when and where these correlations are particularly strong? And could we use this knowledge for the prevention of future epidemics? This section answers these question by investigating the spatiotemporal correlation between sentiment and COVID-19 cases during distinct periods of the pandemic. Therefore, we select three specific time intervals corresponding to the three first waves of COVID-19 cases as observed in Fig 1: (1) The first wave from March 02, 2020 to June 07, 2020; (2) the second wave from June 09, 2020 to August 24, 2020; and (3) the third wave from August 26, 2020 to December 30, 2020. For each of these waves, we repeat the spatial analysis presented in the previous section. We report our results in Fig 4, where each map corresponds to a distinct period. For comparison, Figure 4a) shows the results of the spatial analysis over the entire time period which is identical to Fig 3a. The first wave from March 02, 2020 to June 07, 2020 shows an overall more positive correlation than the other periods. The spatial distribution reflects a predominantly positive relationship across regions. In contrast, the second period from June 09, 2020 to August 24, 2020 exhibits a noticeable shift towards a negative correlation between sentiment and cases. In the third period from June 09, 2020 to August 24, 2020, the overall correlation reverts to a positive trend, resembling the first period. Notably, the areas around New York and Philadelphia exhibit a distinct negative correlation. We can visually observe that the correlations in specific time periods (Fig 4b4d) are much stronger than the correlations over the entire study period (Fig 4a).

thumbnail
Fig 4. Fig 4. TBCOV correlation coefficients visualized spatially for 3 different time periods. An interactive version of this map can be found at https://mygmu.maps.arcgis.com/apps/mapviewer/index.html?webmap=bc1qs8x6caxpkns9szm37d8csd6ej4nr67qx3hk732.

https://doi.org/10.1371/journal.pone.0325166.g004

The main results of the spatiotemporal analysis and of this work is that there are certain times when certain regions exhibit a very strong correlation (either negative or positive) between tweet sentiment and COVID-19 cases. If we could understand and predict when and where these strong correlations occur, we could leverage this understanding to use tweet sentiment as an early indicator for cases.

Explanation of correlations

Towards understanding when and where these correlations occur, in this section, we investigate the links between the observed correlations as seen in Fig 4 with population variables related to population density, socioeconomic differences, education, and medical resources. Table 2 shows the result of a regression analysis using the following population variables that we obtained from the County Health Rankings and Roadmaps Dataset [38]:

  • Percentage of Population without a High School Degree,
  • The number of primary care physicians per population,
  • The percentage of population unemployed,
  • The percentage of population of age 65 or older,
  • The percentage of population living in a census-defined rural area.
  • The population density, defined as population per square kilometre.

We perform this regression analysis separately for each of the four time-periods used in Fig 4:

  • The full duration including the first three COVID-19 waves from March 02, 2020 to March 31, 2021.
  • The duration of the first wave from March 02 to June 07, 2020.
  • The duration of the second wave from June 09 to August 24, 2020, and
  • The duration of the third wave from August 26 to December, 2020.

First, we observe that, over the entire duration, these population variables do not explain the correlations between COVID-19 cases and Twitter sentiment well, shown by the low adjusted coefficient of determination of 0.002 for the entire duration. For the full duration, none of the above population variables are significantly correlated to the case-sentiment correlations at a level of significance of less than . This result is consistent with the temporal analysis in Fig 2b, due to the signum of case-sentiment correlations changing over time from positive (Wave 1), to negative (Wave 2), back to positive (Wave 3), thus having no significant correlation during the entire period.

The regression results, however, are significant during the individual waves. During the March 02 to June 07, 2020, the adjusted increases to and we observe a highly significant negative correlation (p-value <0.0005) of rural populations and the case-sentiment correlation. Therefore, as the variable “percent of the population living in rural areas” of a county increases, the correlation between COVID-19 cases and sentiment becomes more negative, which is expected. We see a similar correlation during the third wave from August 26 to December, 2020 albeit not quite as significant at a p-value of 0.005. Interestingly, during the second wave, the explanatory variable “percent of the population living in rural areas” remains highly significant, at a p-value <0.0005 but the signum of this correlation is reversed during this period. Instead of having a highly significant negative correlation, we observe a highly significant positive correlation during this period. Thus, during this time, rural areas are more likely to have a high tweet sentiment when having a higher number of COVID-19 cases which is not the expected behaviour. Looking at other population variables, we observe no significant link (at a level of significance of ) for any of the other population variables.

Towards understanding where we can observe a strong correlation between COVID-19 cases and tweet sentiment, this regression analysis shows a significant link to rural areas. But an unanswered question remains why the direction of the significant correlation changed in rural (and, dually, non-rural) areas. If we could explain and predict this correlation for future pandemics, we could possibly use sentiment as an early-warning system for cases.

Discussion and conclusions

Globally, across all counties of the United States and across the entire 16-months study period, we are not able to find a significant link between tweet sentiment and COVID-19 cases. Thus, we extended our analysis to spatial and temporal dimensions to investigate the spatial and temporal stationarity of such a trend and whether variations in the magnitude and direction of the correlation could be found at certain times or in certain locations. Our temporal analysis showed significant temporal patterns, including an interval of strong and significant negative correlation in the Summer of 2020. We also observe periods of strong and significant positive correlation (contrary to the expectation that high case numbers would align with lower sentiment). In both cases, the strong positive and negative correlations indicate that sentiment could be used as an indicator of COVID-19 cases.

In addition to temporal analysis, we also performed a spatial analysis, to understand whether different counties across the United States may also exhibit non-stationarity. We observed that areas in the Northeast of the U.S. exhibit a stronger negative correlation while East exhibits are stronger positive correlation. This result is also interesting, as it shows some regions of the U.S. may allow more accurate forecasting of infectious disease cases based on Twitter sentiment than others.

Finally, we performed a spatiotemporal analysis observing three distinct portions of our study period. Particularly the northeast United States oscillates between positive and negative correlation throughout the study period. These three results are quite interesting: They indicate that if the nature of the correlation—whether positive, negative, strong, weak, spatial, or temporal—can be understood during an infectious disease outbreak, tweet sentiment might be effectively used as a localized indicator.

For example, we find in the first wave of the COVID-19 pandemic that rural areas have a stronger negative correlation between COVID-19 cases and sentiment (as expected). Inversely, we might assume that urban areas will thus have a stronger positive correlation between COVID-19 cases and sentiment. This may mean that during the first wave of the COVID-19 pandemic, positive sentiment in urban areas and negative tweet sentiment in rural areas could be signals for higher COVID-19 cases. While we were able to get some insights into some underlying demographic factors that may explain the spatial patterns in our analysis, the models have a low R-squared, meaning that further investigation using other variables that could explain these patterns is needed. The observed spatial and temporal variability (and likely differences across spatial scales) may explain why a scientific consensus about the relationship between sentiment and COVID-19 disease prevalence has not been made.

We note that social media data from platform X is not without its limitations. One issue is the inherent bias and lack of representativeness in social media data like platform X. While this bias has not been quantified in the TBCOV dataset, other researchers have attempted to understand such bias in platform X data as a whole. For example, Mislove et al. [39] find that platform X users with geolocated Tweets represent a small fraction of the population (1.15%) and that populous counties are overrepresented. They also find a bias towards male users and that over/undersampling of different races of users is a function of location. This could affect our interpretation of the results, which may not reflect the relationship between sentiment and COVID-19 cases in a way that is representative of the broader population. In addition to the bias in the user base, high-profile events throughout the COVID-19 pandemic (e.g. major changes in policy, new public health interventions) as well as media-driven narratives may disproportionately shape sentiment in the TBCOV dataset, potentially diverging from broader public sentiment and contributing to the observed spatial and temporal variation. Finally, misinformation, fake news, and partisanship on platform X may also explain why social media, specifically sentiment, may fall short as a consistent indicator for COVID-19 outbreaks [40,41]. For example, platform X sentiment has been found to correlate with public opinion about how the pandemic is being managed [42,43] and may not fully reflect case numbers or disease severity. The two are difficult to disentangle. Given the limitations of both case data from official sources and social media data, a more comprehensive disease detection surveillance system would leverage both.

Factors such as regional cultural norms, socio-political climates, and local pandemic responses could further shape the variability in the correlation between sentiment and COVID-19 cases. Further research is needed to understand both temporal and spatial processes that cause the correlations between Tweet sentiment and infectious diseases to shift across space and time. If we could understand these processes, we might be able to better identify when and where Tweet sentiment could be used as a signal for early disease outbreak warning.

References

  1. 1. Bertsimas D, Boussioux L, Cory-Wright R, Delarue A, Digalakis V, Jacquillat A, et al. From predictions to prescriptions: a data-driven response to COVID-19. Health Care Manag Sci. 2021;24(2):253–72. pmid:33590417
  2. 2. Elarde J, Kim J-S, Kavak H, Züfle A, Anderson T. Change of human mobility during COVID-19: a United States case study. PLoS One. 2021;16(11):e0259031. pmid:34727103
  3. 3. Maleki M, Bahrami M, Menendez M, Balsa-Barreiro J. Social behavior and COVID-19: analysis of the social factors behind compliance with interventions across the United States. Int J Environ Res Public Health. 2022;19(23):15716. pmid:36497805
  4. 4. CDC. Cases, data, and surveillance. Available from: https://www.cdc.gov/coronavirus/2019-ncov/covid-data/about-us-cases-deaths.html; Accessed January 17, 2025.
  5. 5. Griffith GJ, Morris TT, Tudball MJ, Herbert A, Mancano G, Pike L, et al. Collider bias undermines our understanding of COVID-19 disease risk and severity. Nat Commun. 2020;11(1):5749. pmid:33184277
  6. 6. Angelopoulos AN, Pathak R, Varma R, Jordan MI. On identifying and mitigating bias in the estimation of the COVID-19 case fatality rate. Harv Data Sci Rev. 2020.
  7. 7. Carballada AM, Balsa-Barreiro J. Geospatial analysis and mapping strategies for fine-grained and detailed COVID-19 data with GIS. ISPRS Int J Geo-Inf. 2021;10(9):602.
  8. 8. Rader B, Astley CM, Sy KTL, Sewalk K, Hswen Y, Brownstein JS, et al. Geographic access to United States SARS-CoV-2 testing sites highlights healthcare disparities and may bias transmission estimates. J Travel Med. 2020;27(7):taaa076. pmid:32412064
  9. 9. Jin R. The lag between daily reported COVID-19 cases and deaths and its relationship to age. J Public Health Res. 2021;10(3):2049. pmid:33709641
  10. 10. Preis T, Moat HS, Bishop SR, Treleaven P, Stanley HE. Quantifying the digital traces of hurricane Sandy on Flickr. Sci Rep. 2013;3:3141. pmid:24189490
  11. 11. Balsa-Barreiro J, Menendez M, Morales AJ. Scale, context, and heterogeneity: the complexity of the social space. Sci Rep. 2022;12(1):9037. pmid:35641578
  12. 12. Kryvasheyeu Y, Chen H, Obradovich N, Moro E, Van Hentenryck P, Fowler J, et al. Rapid assessment of disaster damage using social media activity. Sci Adv. 2016;2(3):e1500779. pmid:27034978
  13. 13. Nagel AC, Tsou M-H, Spitzberg BH, An L, Gawron JM, Gupta DK, et al. The complex relationship of realspace events and messages in cyberspace: case study of influenza and pertussis using tweets. J Med Internet Res. 2013;15(10):e237. pmid:24158773
  14. 14. Aiello AE, Renson A, Zivich PN. Social media- and internet-based disease surveillance for public health. Annu Rev Public Health. 2020;41:101–18. pmid:31905322
  15. 15. Kpozehouen EB, Chen X, Zhu M, Macintyre CR. Using open-source intelligence to detect early signals of COVID-19 in China: descriptive study. JMIR Public Health Surveill. 2020;6(3):e18939. pmid:32598290
  16. 16. Nair S, Moa A, Macintyre R. Investigation of early epidemiological signals of COVID-19 in India using outbreak surveillance data. Glob Biosec. 2020;2(1).
  17. 17. Brownstein JS, Freifeld CC, Madoff LC. Digital disease detection--harnessing the Web for public health surveillance. N Engl J Med. 2009;360(21):2153–5, 2157. pmid:19423867
  18. 18. Nagar R, Yuan Q, Freifeld CC, Santillana M, Nojima A, Chunara R, et al. A case study of the New York City 2012-2013 influenza season with daily geocoded Twitter data from temporal and spatiotemporal perspectives. J Med Internet Res. 2014;16(10):e236. pmid:25331122
  19. 19. Broniatowski DA, Paul MJ, Dredze M. National and local influenza surveillance through Twitter: an analysis of the 2012--2013 influenza epidemic. PLoS One. 2013;8(12):e83672. pmid:24349542
  20. 20. Lee K, Agrawal A, Choudhary A. Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: KDD '13: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press; 2013, pp. 1474–7.
  21. 21. Achrekar H, Gandhe A, Lazarus R, Yu S, Liu B. Predicting flu trends using twitter data. In: 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE; 2011, pp. 702–7.
  22. 22. Szomszor M, Kostkova P, Quincey E. Swineflu: Twitter predicts swine flu outbreak in 2009. In: International Conference on Electronic Healthcare. Springer; 2010, pp. 18–26.
  23. 23. Gomide J, Veloso A, Meira W Jr, Almeida V, Benevenuto F, Ferraz F, et al. Dengue surveillance based on a computational model of spatio-temporal locality of Twitter. In: Proceedings of the 3rd International Web Science Conference. ACM Press; 2011, pp. 1–8. https://doi.org/10.1145/2527031.2527049
  24. 24. Carchiolo V, Longheu A, Malgeri M. Using twitter data and sentiment analysis to study diseases dynamics. In: International Conference on Information Technology in Bio- and Medical Informatics. Cham: Springer; 2015, pp. 16–24.
  25. 25. Byrd K, Mansurov A, Baysal O. Mining Twitter data for influenza detection and surveillance. In: Proceedings of the International Workshop on Software Engineering in Healthcare Systems. 2016, pp. 43–9.
  26. 26. Aramaki E, Maskawa S, Morita M. Twitter catches the flu: detecting influenza epidemics using Twitter. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics; 2011, pp. 1568–76.
  27. 27. Doan S, OhnoMachado L, Collier N. Enhancing Twitter data analysis with simple semantic filtering: example in tracking influenza-like illnesses. In: 2012 IEEE 2nd International Conference on Healthcare Informatics, Imaging and Systems Biology. IEEE; 2012, pp. 62–71.
  28. 28. Valdez D, Ten Thij M, Bathina K, Rutter LA, Bollen J. Social media insights into US mental health during the COVID-19 pandemic: longitudinal analysis of Twitter data. J Med Internet Res. 2020;22(12):e21418. pmid:33284783
  29. 29. Feng S, Kirkley A. Integrating online and offline data for crisis management: online geolocalized emotion, policy response, and local mobility during the COVID crisis. Sci Rep. 2021;11(1):8514. pmid:33875749
  30. 30. Imran M, Qazi U, Ofli F. TBCOV: two billion multilingual COVID-19 Tweets with sentiment, entity, geo, and gender labels. Data. 2022;7(1):8.
  31. 31. Liu B. Sentiment analysis and subjectivity. In: Indurkhya N, Damerau FJ, editors. Handbook of natural language processing. New York: CRC Press; 2010, pp. 627–66.
  32. 32. Loria S. Textblob: simplified text processing — textblob 0.16.0 documentation. Available from: https://textblob.readthedocs.io/en/dev/
  33. 33. Hutto C, Gilbert E. VADER: a parsimonious rule-based model for sentiment analysis of social media text. ICWSM. 2014;8(1):216–25.
  34. 34. Nielsen FÅ. AFINN sentiment analysis in Python. Available from: https://github.com/fnielsen/afinn. Accessed January 17, 2025.
  35. 35. Teutsch SM, Churchill RE. Principles and practice of public health surveillance. USA: Oxford University Press; 2000.
  36. 36. Qazi U, Imran M, Ofli F. GeoCoV19: a dataset of hundreds of millions of multilingual COVID-19 tweets with location information. SIGSPATIAL Special. 2020;12(1):6–15.
  37. 37. USAFacts. US COVID-19 cases and deaths by state. Available from: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/. Accessed January 17, 2025.
  38. 38. University of Wisconsin Population Health Institute. County health rankings and roadmap. Available from: https://www.countyhealthrankings.org/.
  39. 39. Mislove A, Lehmann S, Ahn Y-Y, Onnela J-P, Rosenquist J. Understanding the demographics of Twitter users. ICWSM. 2021;5(1):554–7.
  40. 40. Spiteri J. Media bias exposure and the incidence of COVID-19 in the USA. BMJ Glob Health. 2021;6(9):e006798. pmid:34518207
  41. 41. Castioni P, Andrighetto G, Gallotti R, Polizzi E, De Domenico M. The voice of few, the opinions of many: evidence of social biases in Twitter COVID-19 fake news sharing. R Soc Open Sci. 2022;9(10):220716. pmid:36303937
  42. 42. Naseem U, Razzak I, Khushi M, Eklund PW, Kim J. COVIDSenti: a large-scale benchmark Twitter data set for COVID-19 sentiment analysis. IEEE Trans Comput Soc Syst. 2021;8(4):1003–15. pmid:35783149
  43. 43. Tsai MH, Wang Y. Analyzing Twitter data to evaluate people’s attitudes towards public health policies and events in the era of COVID-19. Int J Environ Res Public Health. 2021;18(12):6272. pmid:34200576