The authors have declared that no competing interests exist.
Conceived and designed the experiments: LM MRF KDH PSD CMD. Performed the experiments: LM MRF KDH. Analyzed the data: LM MRF KDH PSD CMD. Wrote the paper: LM. Edited the manuscript: LM PSD CMD.
We conduct a detailed investigation of correlations between real-time expressions of individuals made across the United States and a wide range of emotional, geographic, demographic, and health characteristics. We do so by combining (1) a massive, geo-tagged data set comprising over 80 million words generated in 2011 on the social network service Twitter and (2) annually-surveyed characteristics of all 50 states and close to 400 urban populations. Among many results, we generate taxonomies of states and cities based on their similarities in word use; estimate the happiness levels of states and cities; correlate highly-resolved demographic characteristics with happiness levels; and connect word choice and message length with urban characteristics such as education levels and obesity rates. Our results show how social media may potentially be used to estimate real-time levels and changes in population-scale measures such as obesity rates.
With vast quantities of real-time, fine-grained data, describing everything from transportation dynamics and resource usage to social interactions, the science of cities has entered the realm of the data-rich fields. While much work and development lies ahead, opportunities for quantitative study of urban phenomena are now far more broadly available to researchers
Numerous studies on well-being are published every year. The UN’s 2012 World Happiness Report attempts to quantify happiness on a global scale with a ‘Gross National Happiness’ index which uses data on rural-urban residence and other factors
While these and other approaches to quantifying the sentiment of a city as a whole rely almost exclusively on survey data, there are now a range of complementary, remote-sensing methods available to researchers. The explosion in the amount and availability of data relating to social media in the past 10 years has driven a rapid increase in the application of data-driven techniques to the social sciences and sentiment analysis of large-scale populations.
Our overall aim in this paper is to investigate how geographic place correlates with and potentially influences societal levels of happiness. In particular, after first examining happiness dynamics at the level of states, we will explore urban areas in the United States in depth, and ask if it is possible to (a) measure the overall average happiness of people located in cities, and (b) explain the variation in happiness across different cities. Our methodology for answering the first question uses word frequency distributions collected from a large corpus of geolocated messages or ‘tweets’ posted on Twitter, with individual words scored for their happiness independently by users of Amazon’s Mechanical Turk service
We structure our paper as follows. In the Methods section, we describe the data sets and our methodology for measuring happiness. In part 1 of the Results section we measure the happiness of different states and cities and determine the happiest and saddest states and cities in the US, with some analysis of why places vary with respect to this measure. In part 2 of the Results section we compare our results for cities with census data, correlating happiness and word usage with common social and economic measures. We also use the word frequency distributions to group cities by their similarities in observed word use. We conclude with a discussion of the results and outlook for further research.
We examine a corpus of over 10 million geotagged tweets gathered from 373 urban areas in the contiguous United States during the calendar year 2011. This corpus is a subset of Twitter’s ‘garden hose’ feed, which in 2011 represented roughly 10% of all messages. For the present study, we focus on the approximately 1% of tweets that are geotagged. Urban areas are defined by the 2010 United States Census Bureau’s MAF/TIGER (Master Address File/Topologically Integrated Geographic Encoding and Referencing) database
To measure sentiment (hereafter happiness) in these areas from the corpus of words collected, we use the Language Assessment by Mechanical Turk (LabMT) word list (available online in the supplementary material of
For a given text
Importantly, with this method we make no attempt to take the context of words or the meaning of a text into account. While this may lead to difficulties in accurately determining the emotional content of small texts, we find that for sufficiently large texts this approach nonetheless gives reliable (if eventually improvable) results. An analogy is that of temperature: while the motion of a small number of particles cannot be expected to accurately characterize the temperature of a room, an average over a sufficiently large collection of such particles nonetheless defines a durable quantity. Furthermore, by ignoring the context of words we gain both a computational advantage and a degree of impartiality; we do not need to decide
Following Dodds et al. (2011), for the remainder of this paper, we remove all words
We will correlate our happiness results with census data which was taken from the 2011 American Community Survey 1-year estimates, accessible online at
We first examine how happiness varies on a somewhat coarser scale than we will focus on for the majority of this paper, by plotting the average happiness of all states in the US in
The happiest 5 states, in order, are: Hawaii, Maine, Nevada, Utah and Vermont. The saddest 5 states, in order, are: Louisiana, Mississippi, Maryland, Delaware and Georgia. Word shift plots describing how differences in word usage contribute to variation in happiness between states are presented in Appendix B in
At such a coarse resolution there is little variation between states, which all lie between 0.15 of the mean value for the entire United States of
In
Points are colored by
the behavioral risk factor survey score (BRFSS) used by Oswald and Wu
the 2011 Gallup well-being index
the 2011 United States peace index
the 2011 United Health Foundation’s America’s health ranking (AHR)
the number of shootings per 100,000 people in 2011.
We can further use this data on word frequencies to characterize similarities between states based on word usage. For simplicity, we focus on the 50,000 most frequently occurring words on Twitter
Red signifies states with similar or highly-correlating word frequency distributions, while blue signifies states with relatively dissimilar word frequency distributions.
We now change our resolution to a finer scale by focussing on cities rather than states. As an illustration of the resolution of the data set as well as our technique, we plot a tweet-generated map of a city, showing how average word happiness varies with location. In
Each point represents an individual tweet and is colored by the average word happiness
Several features can immediately be discerned in this purely tweet-generated map. Firstly, the spatial resolution reveals the outline of Manhattan, as well as Central Park, individual streets and bridges, and even airport terminals such as those at JFK and Newark airports at the lower right and center left of the figure respectively. Secondly, we can discern regions of higher and lower happiness: the Harlem and Washington Heights areas to the north appear relatively sad compared to the Downtown/Midtown area, as does the Waterfront, New Jersey area west of the southern tip of Manhattan. Similar tweet-generated maps for all 373 cities measured are presented in Appendix B in
In
Points are colored as in
Next we calculate the happiness
A vertical dashed line denotes the average for all cities. Note the greater weight towards the right of the distribution, with more cities having happiness scores higher than the average.
It is well known that city population sizes follow a power law distribution (see
Areas with a higher density of tweets per capita tend to be less happy.
The bar charts in
Scores were calculated using (1) and the LabMT word list. The full list of cities can be found in Appendix C in
Scores were calculated using (1) and the LabMT word list. The full list of cities can be found in Appendix C in
As was the case with our state happiness rankings, several cities that ranked both highly and lowly by our measure rank similarly in more traditional survey based efforts. For example, the 2011 Gallup-Healthways well-being survey
To investigate why the average word happiness varies across urban areas, we study the word shift graphs
These show how
We observe some features of the graphs that are consistent with geography–for example the word ‘beach’ appears high on the list of words for coastal cities such as Santa Cruz, California or Miami, Florida. Overall, the main factor driving the relative happiness scores for each city appears to be the presence or absence of key words such as ‘lol’, ‘haha’ and its variants, ‘hell’, ‘love’, ‘like’ and the negative words ‘no’, ‘don’t’, ‘never’ and ‘wrong’, as well as profanity.
The word shifts of
We first focus on how the average happiness
In
The 8 groupings along the horizontal axis are for covarying attributes identified by agglomerative hierarchical clustering, independently of happiness. Crosses lie on the median of each cluster, and the dashed lines represent the 1% significance level. The two clusters which have medians that correlate significantly with happiness are colored blue. A complete list of the correlation of all attributes with happiness can be found in Appendix D in
To further understand what drives this correlation of certain demographics with happiness, we now investigate how each word from the LabMT list correlates with each census attribute. To do this we first normalize the word counts in each urban area by the total number of tweets collected in each city, and then for each word calculate the Spearman correlation
The scatter plot shows the correlation between rate of occurrence of the word ‘café’ and percentage of population with a bachelor’s degree or higher in US cities during the calendar year 2011. The red line shows linear correlation while the reported
We present lists showing the correlation of each LabMT word with every demographic attribute in Appendix D in
Word | |||
cafe | 0.481 | 4.9×10−23 | 6.78 |
pub | 0.463 | 3.14×10−21 | 6.02 |
software | 0.458 | 9.07×10−21 | 6.30 |
yoga | 0.455 | 1.85×10−20 | 7.04 |
grill | 0.433 | 1.78×10−18 | 6.24 |
development | 0.424 | 1.14×10−17 | 6.38 |
emails | 0.419 | 2.87×10−17 | 6.54 |
wine | 0.417 | 3.83×10−17 | 6.42 |
library | 0.414 | 6.47×10−17 | 6.48 |
art | 0.414 | 6.8×10−17 | 6.60 |
sciences | 0.410 | 1.54×10−16 | 6.30 |
pasta | 0.410 | 1.57×10−16 | 6.86 |
lounge | 0.409 | 1.68×10−16 | 6.50 |
market | 0.408 | 2.2×10−16 | 6.28 |
india | 0.407 | 2.5×10−16 | 6.42 |
drinking | 0.405 | 3.74×10−16 | 6.14 |
technology | 0.405 | 3.76×10−16 | 6.74 |
forest | 0.405 | 3.83×10−16 | 6.68 |
brunch | 0.405 | 3.89×10−16 | 6.32 |
dining | 0.403 | 4.92×10−16 | 6.48 |
supporting | 0.399 | 1.1×10−15 | 6.48 |
professor | 0.398 | 1.23×10−15 | 6.04 |
university | 0.392 | 3.62×10−15 | 6.74 |
film | 0.391 | 4.27×10−15 | 6.56 |
global | 0.391 | 4.72×10−15 | 6.00 |
Top 25 words with strongest positive Spearman correlation
Word | |||
me | −0.393 | 3.26×10−15 | 6.58 |
love | −0.389 | 6.51×10−15 | 8.42 |
my | −0.354 | 1.97×10−12 | 6.16 |
like | −0.346 | 6.04×10−12 | 7.22 |
hate | −0.344 | 8.76×10−12 | 2.34 |
tired | −0.343 | 1×10−11 | 3.34 |
sleep | −0.341 | 1.27×10−11 | 7.16 |
stupid | −0.328 | 8.55×10−11 | 2.68 |
bored | −0.315 | 5.11×10−10 | 3.04 |
you | −0.315 | 5.23×10−10 | 6.24 |
goodnight | −0.305 | 1.77×10−9 | 6.58 |
bitch | −0.295 | 6.51×10−9 | 3.14 |
all | −0.289 | 1.33×10−8 | 6.22 |
lie | −0.285 | 2.24×10−8 | 2.60 |
mom | −0.284 | 2.42×10−8 | 7.64 |
wish | −0.271 | 1.05×10−7 | 6.92 |
talk | −0.267 | 1.74×10−7 | 6.06 |
she | −0.265 | 2.01×10−7 | 6.18 |
know | −0.262 | 2.78×10−7 | 6.10 |
ill | −0.259 | 4.11×10−7 | 2.42 |
dont | −0.258 | 4.54×10−7 | 3.70 |
well | −0.256 | 5.3×10−7 | 6.68 |
don’t | −0.255 | 5.8×10−7 | 3.70 |
give | −0.255 | 5.84×10−7 | 6.54 |
friend | −0.255 | 6.27×10−7 | 7.66 |
Top 25 words with strongest negative Spearman correlation
The technique applied here is not limited only to census data. As an example of a different use of the corpus, we now correlate word use to obesity at the metropolitan level. For this study we take obesity levels from the Gallup and Healthways 2011 survey
Performing the same analysis as for the attributes in
The scatter plot shows the correlation between
As we did for the census data, we also correlate the abundance of each individual word in the LabMT list to obesity levels in the 190 cities surveyed. From this list we extract words that are clearly food-related, and in
Word | |||
cafe | −0.509 | 6.07×10−14 | 6.78 |
sushi | −0.487 | 9.93×10−13 | 5.40 |
brewery | −0.469 | 8.67×10−12 | N/A |
restaurant | −0.448 | 8.93×10−11 | 7.06 |
bar | −0.435 | 3.59×10−10 | 5.82 |
banana | −0.434 | 3.77×10−10 | 6.86 |
apple | −0.408 | 5.22×10−9 | 7.44 |
fondue | −0.403 | 8.34×10−9 | N/A |
wine | −0.400 | 1.08×10−8 | 6.42 |
delicious | −0.392 | 2.17×10−8 | 7.92 |
dinner | −0.386 | 3.85×10−8 | 7.40 |
coffee | −0.384 | 4.51×10−8 | 7.18 |
bakery | −0.383 | 5.12×10−8 | N/A |
bean | −0.378 | 7.88×10−8 | 5.80 |
espresso | −0.377 | 8.47×10−8 | N/A |
cuisine | −0.376 | 8.82×10−8 | N/A |
foods | −0.374 | 1.07×10−7 | 7.26 |
tofu | −0.372 | 1.27×10−7 | N/A |
brunch | −0.368 | 1.79×10−7 | 6.32 |
veggie | −0.364 | 2.46×10−7 | N/A |
organic | −0.361 | 3.13×10−7 | 6.32 |
booze | −0.360 | 3.34×10−7 | N/A |
grill | −0.354 | 5.4×10−7 | 6.24 |
chocolate | −0.351 | 6.77×10−7 | 7.86 |
#vegan | −0.350 | 7.47×10−7 | N/A |
mcdonalds | 0.246 | 6.18×10−4 | 5.98 |
eat | 0.241 | 8.22×10−4 | 7.04 |
wings | 0.222 | 2.13×10−3 | 6.52 |
hungry | 0.210 | 3.65×10−3 | 3.38 |
heartburn | 0.194 | 7.37×10−3 | N/A |
ham | 0.177 | 1.45×10−2 | 5.66 |
The top 25 food-related words only with strongest negative correlation to obesity level (top), and the 6 food-related words with positive correlation to obesity level and
Conversely, only 6 food-related words significantly positively correlate with obesity with
The above analysis demonstrates that different cities have unique characteristics. We now ask whether cities can be sorted into groups based solely upon similarities in their word distributions. Bettencourt
We group the top 40 cities with highest total word counts in 2011 by calculating the linear correlation between word frequency vectors
The clustergram shows Cross-correlations between word frequency distributions for the 40 cities with highest word counts in 2011. Red signifies cities with similar word frequency distribution, while blue signifies cities with dissimilar word frequency distributions.
We cluster cities using an agglomerative hierarchical method with average linkage clustering
In this paper we have examined word use in urban areas in the United States, using a simple mathematical method which has been shown to have great flexibility, sensitivity, and robustness. We have used this tool to map areas of high and low happiness and score individual states and cities for average word happiness. In order to understand in greater detail how word usage influences happiness, we used word shift graphs to find the words which produced the greatest difference between the happiness scores of each individual city and the average for the entire US, and socioeconomic census data to attempt to explain the usage of certain words. A significant driver of the happiness score for individual cities was found to be frequency of profanity; we believe that future studies of regional variation in swear word use or ‘geoprofanity’ could help explain geographical differences in happiness. Indeed, swearing has previously been found to be a predictor of large-scale protests and social uprisings in Iran
Happiness within the US was found to correlate strongly with wealth, showing large positive correlation with increasing household income and strong negative correlation with increasing poverty. This is consistent with the first part of the ‘Easterlin paradox’
We also observed that happiness anticorrelates significantly with obesity. A similar link between obesity and happiness has previously been reported
There are a number of legitimate concerns to be raised about how well the Twitter data set can be said to represent the happiness of the greater population. Roughly 15% of online adults regularly use Twitter, and 18–29 year-olds and minorities tend to be more highly represented on Twitter than in the general population
In this work we have only scratched the surface of what is possible using this particular dataset. In particular, we have not examined whether or not these methods have any predictive power–future research could look at how observed changes in the Twitter data set, as measured using the hedonometer algorithm, predict changes in the underlying social and economic characteristics measured using traditional census methods. In particular, we plan to revisit this study when census data for 2012 becomes available to investigate how changes in demographics across urban areas is reflected in happiness as measured by word use.
(PDF)