Predicting zip code-level vaccine hesitancy in US Metropolitan Areas using machine learning models on public tweets

Although the recent rise and uptake of COVID-19 vaccines in the United States has been encouraging, there continues to be significant vaccine hesitancy in various geographic and demographic clusters of the adult population. Surveys, such as the one conducted by Gallup over the past year, can be useful in determining vaccine hesitancy, but can be expensive to conduct and do not provide real-time data. At the same time, the advent of social media suggests that it may be possible to get vaccine hesitancy signals at an aggregate level, such as at the level of zip codes. Theoretically, machine learning models can be learned using socioeconomic (and other) features from publicly available sources. Experimentally, it remains an open question whether such an endeavor is feasible, and how it would compare to non-adaptive baselines. In this article, we present a proper methodology and experimental study for addressing this question. We use publicly available Twitter data collected over the previous year. Our goal is not to devise novel machine learning algorithms, but to rigorously evaluate and compare established models. Here we show that the best models significantly outperform non-learning baselines. They can also be set up using open-source tools and software.

Early word embedding algorithms were already capable of analogical reasoning e.g., the vector obtained from the operationK ing ÀMan þWoman was found to be closest to the vec-torQueen. Impressively, modern variants, proposed in the last 5 years, are now capable of embedding sentences, including tweets, enabling robust machine learning algorithms to be built without manually intensive feature engineering [12,15].
This article considers the problem of predicting vaccine hesitancy in the United States, using public social media data from Twitter. We focus our study on major metropolitan areas which are known for high tweeting activity, and where users tend to enable the location facility on their phone compared to more rural milieus [16]. We are not looking to predict vaccine hesitancy at an individual level, both due to privacy concerns and also due to problems with accurately evaluating such predictions without polling the individual. Instead, we seek to develop systems that predict vaccine hesitancy at the zip code-level.
Specifically, our proposed methods rely on extracting vaccine hesitancy signals from the text in public, geolocated tweets. It does not identify or isolate user data of any kind. An advantage of making predictions at the zip code-level is that predictions can be validated using independent survey data, such as the Gallup poll mentioned earlier. As detailed subsequently, by averaging responses of individuals within a given zip code-demarcated region, we are able to obtain a real-valued vaccine hesitancy estimate for that zip code.
To evaluate such estimates, we define and discuss them in the Materials and Methods section. Equally important when evaluating such systems is the choice of baselines used for comparisons. In the absence of models that rely on machine learning and social media, a feasible choice might be a system that just predicts a constant-valued vaccine hesitancy estimate, sometimes using theoretical models. For instance, the baseline may declare a vaccine hesitancy of 0.5 or 1.0 in a given region. A more sophisticated option is to report the constant representing the average observed in the survey data. Considering such methods in our feasible set of baselines, we show that our proposed machine learning-based models outperform them. The best machine learning model is found to achieve a 10 percent relative improvement over the best constant-valued baseline, which itself relies on privileged information i.e., the mean vaccine hesitancy observed in the survey.
Our models are practical and guided by real-world intuitions. We not only consider the text and hashtags directly observed in geolocated tweets, but also consider the use of NLP software for extracting sentiment signals from the text. Additionally, we explore the use of features from external data sources not grounded in social media, such as the number of hospitals or scientific establishments in a zip code. We experimentally investigate the extent to which the use of these independent sets of features helps in improving the model. In other words, rather than propose a single winning model, we compare a range of models and features-sets to better understand performance differences and tradeoffs.
The rest of this article is structured as follows. We proceed with a comprehensive description of the Materials and Methods used in our study. We detail the Twitter dataset and its collection, and subsequent steps, such as data preprocessing and feature extraction. We also discuss the vaccine hesitancy ground-truth that we obtained from independent Gallup survey data. We summarize the evaluation methodology and metrics, and enumerate the models and baselines being evaluated. To enable maximal replicability and minimize cost, we implement our methods using open-source packages and public data. Next, our experimental findings are detailed in Results, including statistical significance analyses. A qualitative Discussion and Error Analysis section follows. We conclude the work with a summary and brief primer on promising future avenues for research.

Twitter dataset
We sample tweets related to the COVID-19 pandemic from the nine most populous metropolitan areas in the United States [17]. In order of decreasing order of population size, these are: New York, Los Angeles, Chicago, Houston, Phoenix, Philadelphia, San Antonio, San Diego, and Dallas. Our sampled tweets are a subset of the GeoCOV19Tweets dataset [18]. The Geo-COV19Tweets project collected geo-tagged tweets related to the COVID-19 pandemic on a daily basis, using a set of manually determined COVID-specific keywords and hashtags. The project also published a sentiment score for each tweet.
In keeping with Twitter's terms and conditions, only the tweet ID and sentiment scores were published online. In previous work [19], we hydrated, or directly retrieved from Twitter, tweets from the GeoCOV19Tweets dataset dated from March 20, 2020 through December 1, 2020. This period spans a total of 255 days. We skipped the period from October 27, 2020 through October 28, 2020 because sentiment scores were not available in GeoCOV19Tweets during that span.
Next, as discussed in [19], we processed each hydrated tweet object, which is a data structure described extensively in Twitter's developer documentation [20]. Specifically, we extracted a coordinates object from this data structure to derive a precise location for the tweet. These coordinates were then used to filter the tweets by metropolitan area, by checking if the coordinates fell within a manually-drawn bounding box demarcating each of the metropolitan areas listed earlier.
In this study, we re-hydrate this collection of tweets using the twarc library to save the tweet's full text and tweet ID [21]. After removing any archived tweets, as well as tweets for which the coordinates object is no longer available, we retained a total of 45,899 tweets. We also collect each tweet's zip code of origin by using an Application Programming Interface (API), provided by Geocodio [22]. Founded in 2014, Geocodio provides human-readable location information, such as state, city and country, given a pair of latitude-longitude coordinates as input.
We also eliminate all zip codes with fewer than 10 tweets, resulting in 4,799 tweets and 1,321 zip codes being removed. We then merge the data with the zip code-level attributes described subsequently in Features from External Sources. We remove rows with null values, leaving a total of 29,458 tweets, each of which belongs to one of 493 unique zip codes across the nine metropolitan areas listed above. We note that none of the 29,458 tweets is a retweet, allowing for each sample to be treated independently. In Table 1 below, we summarize key statistics of the data, including the number of hashtags in the data both before and after the text preprocessing steps detailed in the next section.

Text preprocessing
Using a hydrated tweet's full text, we tokenize, make lowercase, and remove mentions using TweetTokenizer [23] from the Natural Language Toolkit (NLTK) package [24]. NLTK is a leading package in the NLP community that uses Python programs to work with human language data. We also remove URLs, stop words, tokens less than or equal to 1 character in length, and any characters other than letters, including the # symbol and emojis. We use NLTK's standard set of English stop words [25] e.g., the, a, and so on. However, we retain the words not, no, nor, very, and most from this pre-determined set, as these are hypothesized to be relevant for making more accurate vaccine hesitancy predictions.
We then lemmatize all tokens using WordNetLemmatizer [26]. A consequence of our text preprocessing steps is that hashtags, such as "covid19", "covid", "Covid19", and "covid-19", for example, all result in the same token. Furthermore, hashtags consisting of numbers, or single characters only, such as "#2020" or "#K", are eliminated. In Table 1, the count of hashtags in the tweets, before text preprocessing, is computed by summing the occurrences of # in the full text. After text preprocessing, when the hashtags and text are well-separated and more easily analyzed, we count the number of times a token begins with the # symbol.
Note that we avoid directly using the hashtags object embedded within the tweet object for several reasons [27]. First, the object appears to already have applied certain filters e.g., numbers-only strings, such as #2020, are eliminated. Although our text preprocessing steps do so as well, as mentioned above, the hashtags object does not accurately represent the number of hashtags in the original tweet. One reason is that it fails to accurately count hashtags in a continuous string. For example, in a tweet from New York containing ". . .#corona#coronavirus#quarantine#quarantinelife#washyourhands. . .", the hashtags object was found to be an empty array. Therefore, to carefully control text preprocessing and feature extraction in a replicable and reliable manner, we exclusively use and count hashtags by processing the full text field.
Next, the processed tweets are embedded using the fastText word embedding model, which was released by Facebook AI research and contains word vectors trained on English Wikipedia data [28]. The training methodology and specific parameterization is detailed in Predictive Models and Features. Herein, we note that a word embedding model is typically a neural network that learns a representation, or embedding, of each word in an input text corpus. A classic example of such a word embedding model is word2vec, published almost a decade ago [13]. The embedding is a dense, real-valued vector with a few hundred (or fewer, in some cases) dimensions. The number of dimensions is much lower than the vocabulary of the corpus, which can be in the tens, if not hundreds, of thousands of unique words. The neural networks underlying these models automatically learn the embeddings by statistically parsing large quantities of text. The idea is that words that are semantically similar will be placed closer together in the vector space.
The fastText model, used in this article, extends and improves the word2vec model by embedding misspelled, or unusually spelled, words, even if it never encountered the specific misspelling during training. This is an obvious benefit when embedding social media text. The model accomplishes this by learning the fine-grained statistical associations between characters in the words, rather than directly learning an embedding for each word. As the name suggests, the model is also optimized to run quickly. It can be used to embed a full sentence or tweet in the vector space, rather than just a word [28]. While an imperfect representation of the tweet's meaning, we show subsequently that the embedding still contains enough signal that our regression-based models are able to use it to predict vaccine hesitancy within a reasonable margin of error.

Features from external sources
In the Twitter Dataset section, we noted a total of 493 unique zip codes that resulted from including only tweets for which we were able to determine the originating zip code. For each unique zip code, we also collected additional zip code-level information from external, publicly available data sources. These zip code-level attributes, which we add as features in our predictive models, comprise the Zillow Home Value Index (ZHVI) [29], as well as the numbers of establishments in the educational, healthcare, and professional, scientific, or technical sectors. We incorporate these features as expected, albeit approximate, proxies for measuring affluence and resource availability within a zip code. As noted earlier, the sentiment features are obtained at the finer-grained granularity of tweets, and were made directly available by the GeoCOV19Tweets project underlying our data [18]. It is important to emphasize that, while each tweet has its own sentiment score, tweets sharing a zip code also share the zip code-level attributes noted above i.e., the zip code-level attributes are repeated for all tweets belonging to the same zip code. Table 2 summarizes each zip code-level, and sentiment. Next, we provide detailed descriptions of these features are provided in the next section.
Sentiment score. We retain the original sentiment scores included in the GeoCOV19-Tweets dataset [18] generated using the TextBlob sentiment analysis tool [30]. In this dataset, every tweet is given a continuous value score between [-1, 1], where positive values signify positive sentiment and 0 signifies neutral sentiment. The more positive or negative the value, the stronger the sentiment. Prior to computing these sentiment scores, hashtag symbols (#), mention symbols (@), URLs, extra spaces, and paragraph breaks were eliminated. Punctuation, emojis, and numbers were included.
Zillow Home Value Index (ZHVI). The Zillow Home Value Index (ZHVI) is a measure of the typical home value for a region; in this case, zip code. It captures monthly changes in Zestimates [31], which are Zillow's estimated home market values that incorporate house Table 2. External features collected from publicly available data sources. With the exception of the sentiment score, all features are computed as zip code-level attributes, meaning that tweets sharing a zip code will have the same values for these features.

Feature Description
Sentiment Score Score between [-1, 1], where positive values signify positive sentiment and 0 signifies a neutral sentiment value. Additionally, the greater the absolute value, the stronger the implied sentiment.
Zillow Home Value Index (ZHVI) Measure of the typical home value (capturing market value, as well as price) in US Dollars within a zip code.
Healthcare and Social Assistance Number of establishments that provide health care and social assistance to individuals.
Educational Services Number of establishments, either for-profit or not-for-profit, that provide instruction or training. Such establishments may be privately or publicly owned.

Professional, Scientific, and Technical Services
Number of establishments that provide professional, scientific, and technical services requiring a high level of expertise or training. https://doi.org/10.1371/journal.pdig.0000021.t002 characteristics, market data such as listing prices of comparable of homes and their time on the market, as well as off-market data including tax assessment and public records. It also incorporates market appreciation. In this study, we take the average of the smoothened, seasonally adjusted value in the 35th to 65th percentile range (mid-tier) from January through December 2020.
Establishments. Data about the number of establishments per zip code is taken from the 2018 Annual Economic Surveys from the US Census (Table ID CB1800ZBP) [32]. We take the Health care and social assistance, Educational services, and Professional, scientific, and technical services data, which have the following meaning: 1. Healthcare and social assistance (sector 62) comprises establishments providing health care and social assistance for individuals [33] e.g., physician offices, dentists, mental health practitioners, outpatient care centers, ambulance services, etc. [34].
2. Educational services (sector 61) consist of establishments that provide instruction or training in a wide variety of subjects. The sector includes both privately and publicly owned institutions and both for profit and not for profit establishments [35] e.g., elementary and secondary schools, colleges, universities, computer training, professional schools, driving schools, etc. [36].
3. Professional, scientific, and technical services (sector 54) include establishments that specialize in providing professional, scientific, and technical services that require a high level of expertise or training [37] e.g., legal services, notaries, accounting, architectural services, building inspection, engineering services, scientific consulting, research and development, advertising, etc. [38].
All features are normalized using the StandardScaler function in Python's scikit-learn package [39]. Normalization is performed separately within the train and test data splits to prevent any test data leakage into the training phase. The next section provides further details on how the dataset was split into train and test partitions.

Train /Test split and vaccine hesitancy ground truth
We use stratified splitting, implemented in the StratifiedShuffleSplit function in the scikit-learn package [40], to partition our tweets into train (80%) and test (20%) sets. This stratification is applied per zip code, ensuring that both the train and test splits include tweets from all 493 zip codes in approximately equal proportions. For example, 42.8% of both the train and test sets are tweets from the New York City metropolitan area, since 42.8% of the overall tweets in our corpus are from New York, and so on. Due to this stratified construction, both the train and test sets include tweets from all 9 metropolitan areas. Overall, there are 23,566 tweets in the train set and 5,892 tweets in the test set. As the name suggests the train set is used to train the models described in the next section, while the test set is used for evaluations.
In order to evaluate any model, we need to obtain a ground truth, defined as the vaccine hesitancy score per zip code that the model is aiming to predict. Therefore, each of the 493 unique zip codes has a single corresponding vaccine hesitancy within the ground truth. The vaccine hesitancy values range from 0.0 to 1.0 on a continuous scale. Each such per-zip code value represents, on average, how much people are hesitant about the vaccine. It is also an estimate of the percentage of residents, within the zip code, who are vaccine hesitant.
We obtain such a ground truth by leveraging vaccine hesitancy data collected through the COVID-19 Gallup survey [8]. Specifically, Gallup launched a survey on March 13, 2020 that polled people's responses during the COVID-19 pandemic, using daily random samples of the Gallup Panel. This panel is a probability-based, nationally representative panel of U.S. adults. Vaccine hesitancy was polled by asking a vaccine hesitancy question, starting from July 20, 2020 (about four months after the initial survey was launched). The question is worded as follows: If an FDA-approved vaccine to prevent coronavirus/COVID-19 was available right now at no cost, would you agree to be vaccinated? A binary response of Yes or No polls a person's willingness to be vaccinated. We use the proportion of No responses among individuals polled within a specific zip code as our measure of the vaccine hesitancy score for this study. We calculate the proportion of the No answers to this question between July 20 and August 30, 2020 at the zip code-level to get a vaccine hesitancy score per zip code.
The mean vaccine hesitancy across all 493 unique zip codes corresponding to our tweets was calculated to be 0.240. The standard deviation is 0.334, showing that there is significant variance across zip codes, even when limited to the largest metropolitan areas in the US. The minimum and maximum values are 0.00 and 1.00, indicating complete vaccine acceptance and hesitancy, respectively.
Note that these ground truth values exist at the zip code-level, and are aggregate measures. A vaccine hesitancy of 0.5 in a zip code intuitively means that, on average, half the people in that zip code are vaccine hesitant. While we cannot say anything about an individual tweeter, for predictive modeling purposes, we label a tweet originating from zip code z with the ground truth vaccine hesitancy score corresponding to zip code z. This implies that, if there are k tweets from zip code z, then all k tweets are assigned the same pseudo vaccine hesitancy label. In the next section, we detail this further as an instance of weakly labeling the tweets with vaccine hesitancy signals. For completeness, when reporting the findings, we also report metrics at the tweet-level. However, the zip code-level metrics should always be interpreted as the true measure of our system's performance.

Evaluation methodology and metrics
All predictive models and baselines used in this study, and described in the next two sections, are evaluated in two different ways: at the tweet-level and at the zip code-level. The tweet-level evaluation is based on a vaccine hesitancy prediction for every tweet in the test set (a total of 5,892 predictions), while the zip code-level evaluation relies on a single vaccine hesitancy prediction per zip code (a total of 493 predictions). Our predictive models, however, only make vaccine hesitancy predictions at the tweet-level. To derive a zip code-level prediction from these tweet-level predictions, we average all the tweet-level predictions within that zip code. Formally, for k tweets (in the test set) belonging to zip code z with predicted tweet-level vaccine hesitancies ½ŷ 1 ; . . . ;ŷ k �, the predicted vaccine hesitancy for zip code z is given by the formula:ŷ We use the Root Mean Square Error (RMSE) metric for measuring performance for both tweet-level and zip code-level predictions. Given m data points with real-valued ground truth vaccine hesitancy labels [y 1 , . . ., y m ], and predicted labels ½ŷ 1 ; . . . ;ŷ m �, the RMSE is given by the formula below: For the tweet-level evaluation, each of the m data points represents a tweet, while for the zip code-level evaluations, each data point represents a zip code. Thus, in the tweet-level RMSE score calculation, the pseudo tweet-level vaccine hesitancy labels, the assignment of which was described in the previous section, are compared with the tweet-level predictions obtained from the model. Similarly, in the zip code-level RMSE calculation, the ground truth vaccine hesitancies, obtained from Gallup, are compared with the zip code-level predictions made by the models. The lower the RMSE score, the lower the predictive error, and the better the model.
We emphasize that, because each zip code is annotated with a real-valued vaccine hesitancy, regression-based predictive modeling applies, rather than classification-based predictive modeling. Hence, we do not consider models that are primarily designed to be used as classifiers, such as Random Forests or Decision Trees. However, future research can potentially consider a different formulation of this problem that enables direct use of classification-based predictors.
Although we train our predictive models at the tweet-level, the tweet-level predictions are auxiliary to obtaining zip code-level vaccine hesitancy predictions. This is because model performance cannot be evaluated at the tweet-level, when our ground truth vaccine hesitancy values are at the zip code-level. In other words, the tweet-level vaccine hesitancy labels should be thought of as pseudo, or weak, labels. By a weak label, we mean that the tweet does not necessarily indicate vaccine hesitancy. Namely, the user publishing the tweet is not necessarily vaccine hesitant. Indeed, the tweet may not even be discussing vaccines directly. However, the tweet is published in a zip code for which vaccine hesitancy is known as a real-valued aggregate variable. The intuition is that, for the purposes of modeling, we can assign a tweet t published in zip code z, the vaccine hesitancy of zip code z. The tweet is then said to be weakly labeled with that vaccine hesitancy, since the true vaccine hesitancy of the user publishing the tweet is unknown.
Because weak labeling is, by definition, relatively inaccurate compared to the zip code-level vaccine hesitancy, which is directly derived from survey data, predictive performance at the level of tweets is only reported as an auxiliary result for the sake of completeness. The primary goal of this study, as discussed in Background, is to predict zip code-level vaccine hesitancies, using publicly available individual tweets.
In addition to computing a RMSE score on the test set for each predictive model, we also report the average of the 5-fold cross-validated RMSE score at the tweet-level. The methodology is as follows. First, we split the train set into five folds. The first fold contains 4,714 tweets, while the other four folds each contain 4,713 tweets, adding up to a total of 23,566 tweets, which is the entirety of the train set. For the purposes of cross-validation experiments, each fold is used as a test set once, while the remaining four folds act as the train set.
Because each fold is used as a test set only once, there are five training iterations, corresponding to the number of folds. At each iteration, we obtain one RMSE score representing the performance of the model trained on four folds and evaluated on the fifth. Over all iterations, therefore, we have five RMSE scores of which we report the average in Results as a measure of model robustness i.e., to further verify that the reported tweet-level RMSE values are not the result of luck on the actual test set, containing 5,892 tweets. We also use these scores to do a statistical significance analysis on the best model. Note that we do not report the average of the 5-fold cross-validated RMSE scores at the zip code-level. The reason is that cross-validation is computed during training, and as mentioned in the previous section, model training is done exclusively at the tweet-level. The sole purpose behind training and cross-validating the predictive models at the tweet-level is to obtain a measure of model robustness, and to enable significance analyses.

Predictive models and features
As described in Text Preprocessing, we use fastText's word vectors trained on English Wikipedia data to embed tweet text. The resulting vectors are 300-dimensional, and all dimensions are retained throughout the study. We embed the processed full text in three different ways, corresponding to three representation. The first representation includes the text only i.e., no hashtags. The second representation includes both text and hashtags. Finally, the third representation only considers hashtags if any are available, but reverts back to using the text if no hashtags are present in the tweet. We refer to this last representation as the hybrid representation.
For example, ignoring any text transformations discussed in Text Preprocessing, the text only representation of the tweet "Be back soon my friends #corona #cov19 #notMyVirus #quarantinefitness" would embed only the "Be back soon my friends" part. The text and hashtags representation would incorporate the entire tweet, and the hybrid representation would embed only "#corona #cov19 #notMyVirus #quarantinefitness", since this specific tweet contains hashtags. Alternatively, the hybrid representation of the tweet "In the hospital not for Corona virus" would embed the tweet's text because no hashtags are provided. Compared to a representation that uses only hashtags, the hybrid representation is expected to be more robust because it still has the ability to use the text if no hashtags are present.
For each of the three representations described above, we build four predictive models, for a total of 12 models, incorporating all zip code-level features: two support vector regression (SVR) models, a linear regression model, and a stochastic gradient descent (SGD). One of the SVR models uses a radial basis function (RBF) kernel, while the other is based on a linear kernel. All of these models are established regression-based models in the machine learning community. Technical details can be found in any standard text [41].
Using the SVR with RBF kernel model, we build three additional predictive models (one per representation) that do not incorporate any zip code-level features. The reason for choosing the SVR with RBF kernel model is that, out of all twelve predictive models mentioned above, it was found to perform the best across all representations (subsequently demonstrated in Results). Additionally, we evaluate all predictive models, both including and excluding zip code-level features, with and without the sentiment score as feature to understand the impact of sentiment on the RMSE score. Note that the sentiment score is an external, tweet-level feature not computed, or verified, by us, as it is provided directly within the underlying Geo-COV19Tweets dataset.
We set the maximum number of iterations in the SVR with linear kernel to 4,000, and specified a random state value of 42 for both the SVR with linear kernel and SGD models. Otherwise, we use the default parameters within the sklearn library for all predictive models described above. Recall from the previous section that for each of these models, the RMSE score is computed at both the tweet-level and the zip code-level. We also report the mean of the 5-fold cross-validated RMSE scores, applicable only for the tweet-level evaluations.

Baselines
To evaluate the predictive power of the 15 models described in the previous section, we consider six constant-value baselines that predict a single value for each individual tweet (tweetlevel) and each unique zip code (zip code-level). The RMSE scores for the tweet-level baseline predictions are measured with respect to the weakly labeled test set (5,892 tweets), while the errors for the zip code-level baselines are computed with respect to the original zip code-level ground truth (493 zip codes). Our baselines do not rely on sentiment, or on any text or zip code-level features. Table 3 summarizes the models.
Concerning the last three baselines in the tables, both the mean vaccine hesitancy in the train set, and the mean vaccine hesitancy for all tweets, are weighted by the frequencies of the zip codes in our dataset. Additionally, the baseline relying on information from the ground-truth is highly optimistic because it assumes that this information is known. Even the previous two baselines (at the tweet-level) rely on this information since the pseudo-label relies on the zip code-level label, which is obtained from the ground-truth.
In practice, the mean vaccine hesitancy within the ground truth, train set, or the entire dataset, will not be available, since that is what we are aiming at predict at the zip code-level. In subsequent sections, we refer to these baselines as the optimistic baselines in contrast with the first three (more realistic) baselines, which assume a manually specified constant value.

Results
The RMSE metrics tabulated in Table 4 show that all predictive models outperform the bestperforming, realistic baseline (no vaccine hesitancy with an RMSE of 0.411) at the zip codelevel, although not all perform better than the most optimistic baselines (RMSE of 0.334). Specifically, no model has an RMSE score below 0.334, our optimistic baseline value, except for the SVR with RBF kernel models (without sentiment). Many of these models (such as the Text Only Linear Regression model) can themselves be thought of as baselines, given their prevalence and usage in Twitter-based predictive analytics [42]. We refer to them only as predictive models, however, because they have not been presented in the literature as baselines for the specific purpose of predicting vaccine hesitancy from Twitter data (let alone COVID-19 vaccine hesitancy).
For most predictive models, there is no difference in performance when sentiment is included versus when it is excluded. However, SVR with RBF kernel shows several noteworthy improvements when sentiment is omitted as a feature. Without sentiment, SVR with RBF kernel is the only model that outperforms even the most optimistic baselines across all three representations, regardless of whether zip code-level features are included. The only instance in which the SVR with RBF kernel model performs worse than the optimistic baselines, is when the text and hashtags representation is used, and when sentiment is included, but all zip codelevel features are excluded. Even in this case, the RMSE of 0.335 only represents a 0.1% increase in RMSE, compared with the optimistic baselines.
We observe the lowest zip code-level RMSE score (0.308) for the SVR with RBF kernel model (text only representation), when no features other than the text are included i.e., no zip code-level features and no sentiment score. This score represents a 7.78% improvement over the optimistic baselines, and a 25.06% improvement over the best-performing, realistic baseline i.e., the no vaccine hesitancy baseline. Compared with the complete vaccine hesitancy baseline, the model demonstrates a 62.89% improvement, and compared with the partial vaccine hesitancy baseline, a 27.19% improvement is observed. When zip code-level features are added, the text only representation achieves an RMSE of 0.312, yielding a 6.59% improvement over the optimistic baselines, and a 24.09% improvement over the no vaccine hesitancy baseline. Table 4 shows that adding sentiment to these two models, using the text only representation, actually reduces performance at the zip code-level.
Interestingly, adding sentiment as the only external feature to the text embeddings reduces performance greatly for the text only and text and hybrid representations, when using the SVR with RBF kernel model. For the former representation, performance is reduced by 6.38%, while for the latter representation, performance decreases by 6.27%. For the hybrid representation, the RMSE score is unaffected.
In addition to the results presented in Table 4, we also built an SVR with RBF kernel model that predicts zip code-level vaccine hesitancy, based on zip code-level data only (no text), both with and without sentiment. When predicting vaccine hesitancy based on zip code-level data only, this model achieves an RMSE of 0.332, and when sentiment is added, it achieves an RMSE of 0.333. Both outperform the most realistic no vaccine hesitancy baseline model, with 19.22% and 18.98% performance improvements, respectively. They also marginally outperform the optimistic baselines, with 0.60% and 0.30% respective decreases in RMSE scores. Table 4. The Root Mean Square Error (RMSE) scores at both the tweet-level and zip code-level for all models, along with the average of the 5-fold cross-validated RMSE scores for the predictive models. Models specified with ( � ) do not include any zip code-level features. The other models include all zip code-level features, in addition to text. The RMSE scores for the predictive models, with and without sentiment as a feature, are reported as without sentiment / with sentiment. Cross-validation is not applicable for baseline models. This finding suggests that, even when relying on only the number of establishments in the healthcare, educational, and professional, scientific, or technology sectors, as well as the Zillow Home Value Index (ZHVI), it is possible to predict zip code-level vaccine hesitancy with a marginally lower error, compared with constant-value baselines.

Representation Model Tweet-Level RMSE Mean of 5-fold Cross-Validation RMSE Scores Zip Code-level RMSE
Importantly, the findings show that the addition of tweet text leads to consistently lower error for the zip code-level predictions. The best-performing model shows a 25.06% improvement, compared with only 19.22% when no text is used, over the no vaccine hesitancy constant-value baseline. It shows a 62.89% improvement over the complete vaccine hesitancy constant-value baseline. Therefore, the use of tweet text, particularly in models implementing the text only representation, and without any external features, is a powerful indicator of zip code-level vaccine hesitancy. In other words, the tweet text contains more signal than noise, on average, when predicting vaccine hesitancy.

Significance analysis
We also conducted a significance analysis of the results in Table 4. Specifically, we took the distribution of the five RMSE scores, achieved on the folds of the cross-validation experiment recorded in Column 4 of Table 4, and computed the sample mean and standard error. Since we are looking to compare the effectiveness of the best model, which is the Text Only SVR (RBF kernel) without sentiment, we use the one-sided Student's t-test. We compute the test statistic and by extension, the p-value, by comparing the sample mean to each model's mean, both with and without sentiment, as recorded in Column 4 of Table 4.
The corresponding p-values are tabulated in Table 5. The results show that the best model's performance cannot be explained by chance alone. The results are also fairly robust i.e., the model obtained by adding hashtag-based features is not significantly different from the textonly best model. Similarly, the hybrid version without sentiment, as well as the text-only Table 5. One-sided p-values of each model's Root Mean Square Error (RMSE) score, using the Student's t-test, against the distribution of per-fold cross-validation means achieved by the best model in Table 4. The best model is the Text Only SVR (RBF Kernel) without sentiment, indicated as Ref below. The null hypothesis is that the mean of this best reference model is greater than (or in other words, worse than) the means shown in Column 4 of Table 4. Models specified with ( � ) do not include any zip code-level features. The p-values for the predictive models, with and without sentiment as a feature, are reported as without sentiment / with sentiment. Since cross-validation is not applicable for the constant-value baseline models, they are excluded in this table.

Representation
Model p-value version with sentiment, are only moderately different from the best model. Once sentiment is added to the hybrid model, it again becomes significantly worse than the best model. Automatic sentiment analysis, therefore, consistently proves to be a noisy feature for this problem domain. Whether this is due to noise in the sentiment analysis itself, or due to sentiment not being strongly associated with vaccine hesitancy, is an important question for future research to pursue. At minimum, these results indicate that sentiment-based features should be used with caution by computational modelers, when trying to build automatic social media-based vaccine hesitancy detection systems.

Discussion and error analysis
Although the machine learning-based models presented in the previous sections clearly outperform constant-valued baselines, it is useful to consider the issue from a qualitative lens. Such a lens helps us to understand the kinds of tweets that contain either signal or noise, as well as to hypothesize about sources of prediction error. In this section, we conduct such an analysis by sampling some tweets from the dataset, prior to any text preprocessing. We include only the text and hashtags, i.e. we exclude any mentions, hyperlinks, or locations, for brevity. Tweets sampled from zip codes labeled with a vaccine hesitancy of 1.0 (complete vaccine hesitancy) are enumerated below: 3. "This is how you know there's a problem? When is the last time you saw gas prices this low?! #cov?d19 #corona" (Philadelphia) These examples illustrate the importance of measuring vaccine hesitancy at an aggregate level, such as zip code, rather than at the individual tweet-level. First, there are obvious ethical and privacy-related concerns with setting up such experiments without individual consent. Second, it is unlikely that extreme cases of vaccine hesitancy or non-hesitancy will manifest on Twitter for the vast majority of the population, although a full study of this hypothesis merits future research. However, at the zip code-level, the predictions are more promising.
As presented earlier in Table 4, we showed that the text only representation of the SVR with RBF kernel model, excluding sentiment and zip code-level features, achieved an RMSE of 0.308 when predicting vaccine hesitancy at the zip code-level. This model, however, tends to slightly overestimate vaccine hesitancy. In 291 zip codes (out of 493), we find that the predicted vaccine hesitancy is greater than the true vaccine hesitancy. In the remaining 202 zip codes, the predicted vaccine hesitancy is lower than the ground truth value. For 50 zip codes, the predicted vaccine hesitancy is overestimated by 0.20 or more, and when looking at the absolutevalue difference between the predicted and true vaccine hesitancies, we find that the gap is 0.20 or more in 179 zip codes.
For instance, in Philadelphia, 83.33% of the zip code-level predictions are over-estimates, the highest among all metropolitan areas. In New York, the metropolitan area with the highest proportion of tweets and the highest average number of tweets per zip code in our study, the predicted vaccine hesitancies for 60.53% of zip codes are over-estimated. In fact, in all metropolitan areas except for Phoenix, San Diego, and Dallas, vaccine hesitancy in more than 50% of the zip codes is over-estimated. In Phoenix, exactly half of all zip code-level vaccine hesitancies are over-estimates. For the remaining metropolitan areas, the percentage, of zip codes having over-estimated vaccine hesitancies falls between 58.11% and 62.79%.
Despite the large proportion of over-estimated vaccine hesitancies presented above, we observe that only 16.67% and 9.21% of the zip codes in Philadelphia and New York, respectively, have over-estimated vaccine hesitancies by a margin greater than 0.20. The metropolitan area with the greatest proportion of zip codes with over-estimated vaccine hesitancies, by 0.20 or more, is San Antonio (30%). San Antonio is also the metropolitan area with the fewest number of tweets. San Diego, on the other hand, has only one zip code (3.70%) where vaccine hesitancy is overestimated by 0.20 points or more.
Despite the large proportion of over-estimates, we observe that Philadelphia has the smallest proportion of zip codes (25%) wherein the absolute-value difference between the predicted and true vaccine hesitancies is higher than 0.20. Thus, using this metric, we note that our SVR with RBF kernel model performed best in Philadelphia. On the contrary, the model performs worst in Dallas, with 57.58% of zip codes showing an absolute-value difference, higher than 0.20, between the predicted and true vaccine hesitancies. For all other metropolitan areas, the proportion of zip codes wherein predicted vaccine hesitancy differs by 0.20 or more points from the true vaccine hesitancy falls between 29.73% and 50%.
Overall, the results suggest that the methods presented herein should not be used for highly sensitive predictions, but the low gap between the true hesitancy and predicted hesitancy in many zip codes, especially compared to constant-valued baselines, all of which performed worse than the SVR method, suggests that the method can be feasibly used by social scientists and digital health experts as an early warning system. We close this section with a note on empirically checking the robustness of the findings in the previous section. Vaccination has been well underway since the survey data that was used for this study was compiled [43]. One issue that can potentially arise with surveying respondents on their future decisions, is that the responses may have some unknown bias. For instance, due to events that occur in the aftermath of the survey, people may change their opinion.
Given that vaccination data is now available at the zip code-level, at least for large states such as California, one way to do a robustness check of the results is to compare the association between survey-based vaccine hesitancy responses with actual vaccination statistics. We conducted such an analysis by downloading actual full-vaccination rates, as a percentage of the population, from a California state government data portal, the link to which is provided in the Data Availability section at the end of the article.
Our specific methodology was to first locate the California zip codes that are common between the Gallup survey data and the actual vaccination data. In total, there were 853 such zip codes. Next, we subtracted 1.0 from the survey-based vaccine hesitancy percentage to obtain a vaccine proclivity score. We computed both the Pearson correlation and the Spearman's rank correlation between vaccine proclivity and the actual vaccination rate. The Pearson correlation was found to be 0.161 (with p-value 2.461e-06) and Spearman's rank correlation was found to be 0.1664 (with p-value 1.023e-06).
Even a year later, therefore, there is a significant positive correlation between vaccine proclivity and actual vaccination. Although this provides a preliminary robustness check on our ground truth (and by extension, the results that it was used to validate) we leave for future work to conduct a confirmatory assessment of our models, using the actual vaccination rates as the ground truth.

Conclusion
Although it is declining, significant vaccine hesitancy continues to persist in many geographic and demographic segments of the adult population. Due to the expense of conducting detailed and representative surveys, there is a need for an inexpensive and more real-time predictive model for detecting vaccine hesitancy. In this article, we explored public social media data from Twitter as a potential source of such information. Without identifying individual users, our models use the text and hashtags in tweets to detect vaccine hesitancy at the zip code-level. Using independently collected survey data from Gallup as the ground truth, we conducted an experimental study demonstrating the feasibility of these models.
Specifically, we found that the models performed well compared to constant-valued baselines, despite their being irrelevant and non-vaccine related, or even non-COVID related, tweets, present in the corpus. Using a set of California-based zip codes, we also conducted a robustness check by computing the association between vaccine proclivity, derived from our survey-based ground truth, and the latest full vaccination rates. We found two association measures to be positive and significant, suggesting that the survey data is a reliable groundtruth and a reasonable predictor of vaccination rates in the immediate future.
We end with the caveat that such a system should not be intended as a replacement for comprehensive and representative surveys. Rather, by serving as a supplementary source of information, such a system may help public health officials detect clusters of vaccine hesitancy, and proactively seek to mitigate it with communication and outreach.
Another caveat is that such a system is also expected to be more reliable for urban areas, toward which social media platforms like Twitter tend to be heavily biased in their user-base. Also, our results currently hold for the United States. An important avenue for future research is to replicate and extend these results to other countries, including those where English is not the primary language. If correctly implemented and used, such a system may serve as a valuable and inexpensive asset in a nation's digital health infrastructure, especially as more people start engaging with social media. We also hope to explore an extension of this work to imageand video-based social media platforms, such as Instagram and TikTok.