The Lexicocalorimeter: Gauging public health through caloric input and output on social media

We propose and develop a Lexicocalorimeter: an online, interactive instrument for measuring the “caloric content” of social media and other large-scale texts. We do so by constructing extensive yet improvable tables of food and activity related phrases, and respectively assigning them with sourced estimates of caloric intake and expenditure. We show that for Twitter, our naive measures of “caloric input”, “caloric output”, and the ratio of these measures are all strong correlates with health and well-being measures for the contiguous United States. Our caloric balance measure in many cases outperforms both its constituent quantities; is tunable to specific health and well-being measures such as diabetes rates; has the capability of providing a real-time signal reflecting a population’s health; and has the potential to be used alongside traditional survey data in the development of public policy and collective self-awareness. Because our Lexicocalorimeter is a linear superposition of principled phrase scores, we also show we can move beyond correlations to explore what people talk about in collective detail, and assist in the understanding and explanation of how population-scale conditions vary, a capacity unavailable to black-box type methods.


I. INTRODUCTION
Online instruments designed to measure social, psychological, and physical well-being at a population level are becoming essential for public policy purposes and public health monitoring [1,2]. These data-centric gauges both empower the general public with information to allow comparisons of communities at all scales, and naturally complement the broad, established set of more readily measurable socioeconomic indicators such as wage growth, crime rates, and housing prices.
Overall well-being, or quality of life, depends on many factors and is complex to measure [3]. Existing techniques for estimating population well-being range from traditional surveys [1,4] to estimates of smile-tofrown ratios captured automatically on camera in public spaces [5], and vary widely in the types of data they amass, collection methods, cost, time scales involved, and degree of intrusion. Partly in response to policy makers' desire for simple "one number" quantification of complex systems-arguably a general human proclivity-many measures are composite in nature. Two examples are (1) While such measures will always have their place, we venture that we must resist oversimplification. The dashboard of society should be just that-a rich set of incompatible instruments whose informational content may be observed individually and in total, not unlike the required input needed for flying a plane where knowledge of just a single number representing "things are going well" would be untenable. The construction of data-centric instruments for social systems that deliver more direct, interpretable measures is therefore of great importance as we move forward into the age of ubiquitous (but not complete) measurement.
With the explosive growth of online activity and social media around the world, the massive amount of realtime data created directly by populations of interest has become an increasingly attractive and fruitful source for analysis. Despite the limitation that social media users in the United States are not a random sample of the US population [7], there is a wealth of information in these data sets and uneven sampling can often be accommodated.
Typeset by REVT E X arXiv:1507.05098v4 [physics.soc-ph] 10 Jan 2017 Indeed, online activity is now considered by many to be a promising data source for detecting health conditions [8,9] and gathering public-health information [10,11], and within the last decade, researchers have constructed a range of online public-health instruments with varying degrees of success. The maturing of these and related instruments along with theoretical models will ultimately fundamentally inform the limits of characterization and predictability of social systems.
In the next two subsections, we cover related research and then describe our approach to measuring the "caloric content" of text.

A. Previous work
For a general overview of work relevant to our present effort, we briefly summarize related research concerning public health and well-being in connection with a range of social media and online data sets.
In the difficult realm of predicting pandemics [12], Google Flu Trends [13] enjoyed early success and acclaim. Initially based very simply on search terms, the instrument proved unsurprisingly to be imperfect and in need of a more sophisticated approach [14].
In work by several of the current authors and colleagues, Mitchell et al. measured the happiness of tweets across the US and found strong correlations with other indices of well-being at city and state level, such as the Gallup Well-being Index; the Peace Index; the America's Health Ranking composite index of Behavior, Community and Environment, Policy and Clinical Care metrics; and gun violence (negative correlation) [15]. Using the same instrument in 10 languages, the Hedonometer, we have also shown that the emotional content of tweets tracks major world events [2,16].
Paul and Dredze found that states with higher obesity rates have more tweets about obesity, and states with higher smoking rates have more tweets about cancer [11]. They also found a negative correlation between exercise and frequency of tweeting about ailments, suggesting "Twitter users are less likely to become sick in states where people exercise." They further found health care coverage rates to be negatively correlated with likelihood of posting tweets about diseases.
Chunara et al. recently found that activity-related interests on Facebook are negatively correlated with being overweight and obese, while interest in television is positively correlated with the same [17].
In an analysis of online recipe queries, West et al. found that the number of patients admitted to the emergency room of a major urban hospital in Washington, DC for congestive heart failure (CHF) each month was significantly correlated with average sodium per recipe searched for on the Web in the same month [18].
Eichstaedt and colleagues [19] have demonstrated that psychological language on Twitter outperforms certain composite socioeconomic indices in predicting heart dis-ease at the county level. They were able to show in particular that the expression of negative emotions such as anger on Twitter could be taken as a kind of risk factor at the population scale.
On a US county level, Culotta [20] found that Twitter activity provided a more "fine-grained representation" of community health than demographics alone with the prevalance of particular words that indicate, for example, television habits, or negative engagement.
Finally, in work directly related to our present study, Abbar et al. [21] have recently performed a similar analysis of translating food terms used on Twitter into calories. They found a correlation between Twitter calories and obesity and diabetes rates for the US, and explored how food-themed interactions over social networks vary with connectedness, finding suggestions of social contagion. While our approach and results are largely sympathetic, our work incorporates estimates of physical activity which we will show provides essential extra information regarding health; introduces a phrase extraction method we call serial partitioning; and leads to an online implementation, paving the way for a real-time instrument as part of our proposed 'panometer. ' We also note that we carried out our work concurrently and independently.

B. Lexicocalometrics
From the preceding list of studies, it has become clear that we can estimate population-scale levels of health and well-being through social media. Here, we examine the words and phrases people post publicly about food and physical activity on Twitter on a statewide level for the contiguous United States (48 states along with the District of Columbia). As we explain fully below in Sec. II A and Methods and Materials, Sec. IV, we group categorically similar words and phrases into lemmas, and we then assign caloric values to these lemmas using the terms and notation "caloric input" for food, C in , and "caloric output" for activity, C out . We define the ratio of caloric output to caloric input to be a third quantity, "caloric ratio": While we will focus largely on the three quantities C in , C out , and C rat , we will also explore "caloric difference", an alternate combination of C in and C out involving a single parameter: where 0 ≤ α ≤ 1. We use "phrase shifts" [2] to show how specific lemmas-e.g., "apples", "cake with frosting", "white water rafting", "knitting", and "watching tv or movie" contribute to the caloric texture of states across the contiguous US. We then correlate all three values with 37 measures relating to health and well-being, and we find statistically strong correlations with quantities such as high blood pressure, inactivity, diabetes levels, and obesity rates. For ease of language, we will generally speak of phrases rather than lemmas. We have also generated an accompanying online, interactive instrument for exploring health patterns through the lens of "Twitter calories": the Lexicocalorimeter. An initial, fixed version of the instrument may be accessed at this paper's Online Appendices, http://compstorylab. org/share/papers/alajajian2015a/, with a evolvable, production version housed within our larger measurement platform http://panometer.org at http:// panometer.org/instruments/lexicocalorimeter (all code for these sites can be found at https://github. com/andyreagan/lexicocalorimeter-appendix). We note that while our online instrument is based on Twitter, it may in principle be used on any sufficiently large text source, social media or otherwise, such as Facebook.
From this point, we structure the core of our paper as follows. In Sec. II, we establish and discuss our findings in depth. Specifically, we: (1) Outline our text analysis of a Twitter corpus from 2011-2012 Sec. II A), reserving full details for Methods and Materials in Sec. IV; (2) Present caloric maps of the contiguous US contrasting the 48 states and DC through histograms and phrase shifts (Sec. II B); and (3) Examine how C in , C in , C rat , and C diff (α) correlate with a suite of measures relating to health and well-being. In the Supporting Information, we provide a sample of confirmatory figures as well as all shareable data sets (e.g., IDs for all tweets). We offer concluding thoughts in Sec. III.

A. Estimating calories from phrases
We used all available geotagged tweets from 2011 and 2012 (around 50 million) from a bounding box of the contiguous US, using Twitter's garden hose sample (which is a sample of approximately 10% of all tweets, including those that are not geotagged) and the geotag feature to determine from which of the 48 continental states and the District of Columbia each tweet came. From this sample, we counted the total number of times each food and physical activity phrase in our database was tweeted about in each of the 48 continental states and the District of Columbia (see Sec. IV and Dataset S1 at https:// dx.doi.org/10.6084/m9.figshare.4530965.v1 for all tweet IDs). We then used these counts to determine the average caloric input C in from food phrase tweets and the average caloric output C out from physical activity phrase tweets as follows.
First, we equate each food phrase s with the calories per 100 grams of that food, using the notation C in (s). (We also explored serving sizes but the databases available proved far from complete.) We then compute the caloric input for a given text T as: (3) where f (s| T ) is the frequency of phrase s in text T , p(s| T ) is the normalized version, and S in is the set of all food phrases in our database.
Second, for each tweeted physical activity phrase, we use an estimate of the Metabolic Equivalent of Tasks, or METs, which we then converted to calories expended per hour, assuming a weight of 80.7 kilograms, the average weight of a North American adult [22]. Analogous to C in (T ) above, we then have where now S out is the set of all phrases in our activity database. We emphasize that both our food and exercise phrase data sets and Twitter databases are necessarily incomplete in nature. The values of C in and C out are thus not meaningful as absolute numbers but rather have power for comparisons. We also acknowledge that our equivalences are crude-e.g., each mention of a specific food is naively turned into the calories associated with 100 grams of that food-and later on we address our choices in more depth. Nevertheless, our method is pragmatic yet-as we will show-effective, and offers clear directions for future improvement.
For simplicity and ultimately because the results are sufficiently strong, we did not filter tweets beyond their geographic location. Tweets may thus come from individuals, restaurants, sports stores, resorts, news outlets, marketers, fitness apps, tourists, and so on, and further improvements and refinements may be achieved by appropriately constraining the Twitter corpus.
Finally, we take the ratio of C out (T ) to C in (T ) to obtain the text's caloric ratio C rat (T ). In general, we observe that a higher value of C rat (T ) at the population scale would appear to be intuitively better, up to some limit indicating negative energy balance. We note that C rat = 1 is not salient and should not be taken to mean a population is 'balanced calorically'. As we discuss later, using the difference, what we call Caloric Difference, a generalization of C out − C in , generates similar results but, from a framing perspective, we have reservations in creating a scale with a 0 point given the approximate nature of our measures.

B. Caloric maps of the contiguous US
We now move to our central analysis and exploration of how our lexicocalorimetric measure varies geographically. We start with visual representations and then continue on to more detailed comparisons.    Deviation from national activity−caloric avg.  , and S3 show the specific rankings according to these two variables and also Crat (see Fig. 3). The overlaid phrase lemmas are the most dominant contributors to Cin and Cout-almost universally "pizza" and "watching tv or movie".
In Fig. 1, we show two choropleth maps of our overall 2011-2012 measures of Twitter's caloric input C in and caloric output C out . For both maps and those that follow, quantities increase as colors move from light to dark green.
These maps immediately allow for some basic observations which we will delve into and harden up as our analysis proceeds. For the food calories map, we see C in Deviation from national food−caloric avg.
FIG. 2. The same choropleth maps for Cin and Cout presented Fig. 1 but now with phrases whose increased usage contribute the most to a population's Cin and Cout differing from the overall averages of these measures. See Sec. II D. For example, tweets from Vermont, which was above average for both Cin and Cout for 2011-2012, disproportionately contain "bacon" and "skiing". Michigan was above average for Cin and below for Cout in 2011-2012, and the most distinguishing phrases are "chocolate candy" and "laying down". See Figs. 5, S2, and S3 for ordered rankings.
is generally largest in the Midwest and the south while Colorado and Maine stand out as states with the lowest calories. We see a different texture in the activity calories map with the highest caloric output according to our measure appearing in the three-state block of Wyoming, Colorado, and Utah, as well as Vermont. Tweet-based caloric output drops to a low in Mississippi and the surrounding states, while Michigan also appears to have a low value of C out . For the food and activities maps in Fig. 1, we also show the most dominant phrase for each population's C in and C out scores. Almost uniformly, "pizza" (high calorie food) and "watching tv or movie" (low calorie activity) are the lemmas with the largest contributions, a function of both volume and caloric scores. Only Mississippi ("ice cream") and Wyoming ("cookies") are exceptions, though "pizza" is still near the top for both.
In Fig. 2, we present the same choropleth maps from Fig. 1, but now with the phrase most distinguishing a population. Specifically, we show phrases whose increased prevalence most contributes to moving a population's Twitter calorie scores away from the overall average for the contiguous US. For example, if a population's C in is above average, we find the food phrase whose frequency coupled with its caloric content most strongly moves the population's C in up from the average. (We explain in full how we determine these phrases later with phrase shifts in Sec. II D.) We now see a diverse spread of terms. We find a number of phrases make for reasonable representations: • "lobster" in Maine and Massachusetts; • "grits" in Georgia; • "skiing" in Vermont, New Hampshire, and Utah; • and "running" in Colorado and a number of other locations.
Prototypical unhealthy foods rise to the top in various states: • "donuts" in Texas; • "cake" in Mississippi; • "chocolate candy" in Louisiana; • and "cookies" in Indiana.
By contrast, a few "virtuous" foodstuffs appear such as "green beans" in Oregon and "tomato" in California.
Our activity list also includes some rather low intensity ones and we see: • "eating" rising to the top in Texas, the south, and a number other states; • "watching tv or movie" in Pennsylvania and elsewhere; • "sitting" in Tennessee; • "talking on the phone" in Delaware; • "getting my nails done" in New Jersey; • and simply "lying down" in Michigan. Now, we do not pretend that these phrases all come from individuals diligently recording their present meals or activities. Apart from tweets from individuals, our database contains tweets from companies, advertisers, resorts, and so on. And some phrases are problematic in their generality of meaning, most especially "running" (the word "run" currently has the most meanings in the Oxford English Dictionary). Nevertheless, as we dig deeper into all the phrases found for a particular state, we will continue to find commonsensical lexical patterns.
In Fig. 3, we show a choropleth map for caloric ratio, C rat . We see that the highest values of C rat are found in Colorado, Wyoming, and Vermont, and secondarily for Maine, Minnesota, Oregon, and Utah. Low values of C rat appear in the region comprising Mississippi, Louisiana, Alabama, and Arkansas, as well as West Virginia.
An initial visual comparison of of Figs. 1 and 3, suggest that C out is more well aligned with C rat than C in . The reason is that for the present version of the Lexicocalorimeter, C out has a larger dynamic range than C in , roughly 250 to 285 versus 160 to 210 giving ratios of 210 160 1.31 and 285 250 1.14. We could assert that C in is fundamentally less informative but: 1. In Sec. II E, we will find that some measures relating to health and well-being correlate more strongly with C in and some with C out ; 2. We may adjust the dynamic range of either measure by rescaling, introducing a kind of tunability [2] to the instrument (a feature we will reserve for future iterations); and 3. Because our food phrase database is a factor of 10 smaller than our activity phrase one, revisions of our instrument may elevate the power of C in .
To provide some support for point 1, we compare C out and C in in Fig. 4 (see also Fig. S1). Importantly, we see that the two measures are indeed not well correlated, indicating they contain different kinds of information (Pearson correlation coefficientρ p 0.13, p-value = 0.39). This demonstrates why we might expect C in or C out to separately correlate more strongly with other population-level measures, and justifies forming a dashboard using both C in and C out as well the composite measure of C rat .
Regarding point 2 above, we have evidently made a number of choices in computing C in and C out that mean we have already introduced an arbitrary tuning of the ratio C rat (e.g., assuming 100 grams of a food and an hour's worth of activity). Having no principled way of rescaling (i.e., one that is not a function of the data set being studied), we have chosen to leave the measures as computed. As we discuss later, in future iterations we envisage for the Caloric Difference version that introducing tunability of the dynamic ranges of C in and C outaltering the bias of the measure toward food or activitywill allow the Lexicocalorimeter to be refined for a range of purposes such as estimating correlates of diabetes levels versus cancer rates (see Sec. II E).

C. Rankings for the contiguous US
Having taken in the maps of our three measures C in , C out , and C rat , we now explore the rankings quantitatively, first through the histograms shown in Fig. 5. We order the 48 states and DC by C rat (rightmost plot) and all bars are relative to the overall average of the specific measure. Numeric rankings for each measure are given next to each bar. In Figs. S2 and S3, we present the same histograms re-sorted respectively by C in and C out .
As was indicated by our inspection the choropleth maps, we do indeed see that C rat is more strongly driven by C out than C in due to the former's larger dynamic range. The states with the highest values of C rat achieve their scores through high levels of C out but more variable levels of C in . Wyoming (23), Vermont (21), and Utah (25) are all middling in C in while Colorado (48) and Maine (49) have the lowest ranks for caloric intake. At the trailing end, we see by contrast that low activity ranks are coupled with high ranks for caloric intake.
A few of the more anomalous states are both evident in the C in and C out histograms and as those appearing farthest away from the best line of fit in the scatter plot of Fig. 4. South Dakota has both high values of C in and C out (ranks of 1 and 7) that arrange to give it a ranking of 25 for C rat . Maryland ranking 42nd and 45th in C in and C out , is the only state in the 'bottom' 10 of both measures.

D. Phrase shifts
In our work on measuring happiness, we have developed and extensively used "word shifts" to show which words make a given text appear more positive than another text in aggregate (see [2] and [16]). Such visualizations not only provide our necessary test, but also allow us to draw insight from the lexical tapestry of texts. Here, we will explain and use analogously constructed phrase shifts for both C in and C out to examine the states at the extremes of our C rat rankings, Colorado and Mississippi. Interactive food and activity phrase shifts for the 49 regions of the contiguous US form a central part of our online Lexicocalorimeter: http: //panometer.org/instruments/lexicocalorimeter.
We start with two texts: a base "reference text" T ref , and a "comparison text" T comp which we wish to compare to T ref . In this paper, we will use the Contiguous US as the reference text (weighting the phrase distributions of each state equally), but in principle any text can be used (e.g., in comparing two states, one would be selected as a reference). Our interest is in determining which words or phrases most contribute to or go against the difference in estimated calories.
stands for in or out. Following [2] and using Eq. (3), we can express the difference as We now have a sum contributions due to all phrases. We normalize these contributions as percentages and annotate their structure as follows: where We use the symbols +/− and ↑ / ↓ to respectively encode whether the calories of a phrase exceed the average of the reference text, and whether a phrase is being used more or less in the comparison text. We call δC i/o (s) the "per food/activity phrase caloric expenditure shift". Finally, we sort phrases by the absolute value of δC i/o (s) to create each phrase shift.
These shifts display phrases that fall into four categories: +↑, yellow: Phrases representing above average quantities (here calories) being used more often. Examples: "cookies" for Mississippi in Fig. 6B and "rock climbing" for Colorado in Fig. 6C.
-↓, pale blue: Phrases representing below average quantities being used less often. Examples: "watching tv or movie" for Mississippi in Fig. 6B and "laying down" for Colorado in Fig. 6C.
+↓, pale yellow: Phrases representing above average quantities being used less often. Examples: "chocolate candy" for Colorado in Fig. 6A and "running" for Mississippi in Fig. 6D.
-↑, blue: Phrases representing below average quantities being used more often. Examples: "reading" for Colorado in Fig. 6A and "catfish" for Mississippi in Fig. 6B.
Note that depending on the quantity, higher or lower may be "better" and the four categories flip signs in their support. For example, C in and C out increase with +↑ phrases; after we examine correlations with health and wellbeing measures in Sec. II E, we will be able to interpret this as "bad" for C in and "good" for C out . At the top of each phrase shift, the bars indicate the total contribution of each of the four types of phrases, and the black bar the net change. We see that the four net changes arise in different ways.
• Fig. 6A: Colorado is lower than average for C in largely due to tweeting more about relatively low calorie (per 100 grams) foods: "noodles", "egg", "pasta", and "turkey". We also find less tweets about high calorie foods such as "candy", "cake", and "cookies." Going against these phrases, we see Colorado does tweet relatively more about "bacon" and "olive oil", and less about some relatively lower calorie foods "chicken", "ice cream", "shrimp", and "corn". We note that this does not mean these foods are low calorie in absolute terms ("ice cream" is a good example), just that 100 grams of them are low calorie in comparison to the US baseline.
• Fig. 6B: Mississippi almost equally tweets less about a variety of low calorie foods, e.g., "pasta", "banana", and "crab" (pale blue bar) while also tweeting more about the complementary range of such foods including "shrimp", "peaches", and "pineapple" (dark blue bar). The modest net gain is mostly due to a small increase in tweeting about high calorie foods such as "cake", "cookies", and "sausage".
• Fig. 6C: For physical activity, tweets from Colorado show a preponderance of relatively high caloric expenditure phrases (+↑, yellow) including "running", "skiing", "hiking", "snowboarding" and so on. Tweeting less about low effort activities is the only other contribution of any substance-Colorado tweets less about "eating", "laying down", and "watching tv or movie".
• Fig. 6D: Mississippi's low ranking in activity is largely due to tweeting less about high output activities (+↓, pale yellow): less "running", "dancing", "walking", and "biking". The second most important category is an increase in low output activity phrases such as "eating", "attending church", and "talking on the phone." In Figs. S4, S5, S6, and S7, we complement the four phrase shifts of Fig. 6 by showing the top 23 phrases for each of four ways phrases may contribute. Interactive phrase shifts for all of the contiguous US are housed at http://panometer.org/instruments/ lexicocalorimeter.
Overall, we find the lexical texture afforded by our phrase shifts is generally convincing, but we expect future improvements in our food and activity data sets will iron out some oddities (we again use the example of ice cream). We also note that phrase shifts are very sensitive and that terms that seem to be being evaluated incorrectly may easily be removed from the phrase set, and that doing so will minimally change the overall score for sufficiently large texts.

E. Correlations with other health and well-being measures
We now turn to a suite of statistical comparisons between our three measures-caloric input, caloric output, and caloric ratio-and a collection of demographic, behavioral, health, and psychological quantities.
We use Spearman's correlation coefficientρ s to examine relationships between C in , C out , and C rat and 37 variables variously relating to food and physical activity, "Big Five" personality traits, and health and wellbeing rankings (a total of 111 comparisons) [4, 6, 24-33]. To correct for multiple comparisons, we calculate the qvalue for each correlation coefficient using the Benjamini-Hochberg step-up procedure [34] (the q-value is to be interpreted in the same way as a p-value). We then consider correlations in reference to the standard significance levels of 0.01 and 0.05.
We must first acknowledge that many of the variables we test against our measures are highly correlated with each other. The food and physical activity-related variables are in the areas of physical activity levels, produce intake and availability rates (including trends in public schools), chronic disease rates, and rates of unhealthy habits. Many of these variables are well known to be influenced by diet and physical activity (e.g., obesity rates [25]), and others may be less directly related (e.g., percent of cropland in each state harvested for fruits and vegetables [28]).
To give some grounding for the full set of comparisons, we show in Fig. 7 how six demographic quantities vary with caloric ratio C rat . We see strong correlations with |ρ s | ≥ 0.68, and the highest value for Benjamini-Hochberg q-value is 5.8×10 −7 .
We present a summary of all results in Tab. I where we have ordered and numbered demographic quantities in terms of ascending Benjamini-Hochberg q-values for C rat . For comparison and to further demonstrate the robustness of our approach, in Tabs. S1, S2, and S3), we reproduce the same analysis with the inclusion of liquids and for a differential measure C diff (α) = αC out − (1 − α)C in , both with and without liquids. Here, we choose to set the effective means of C out and C in equal across the statewide averages (i.e., α C out = (1 − α) C in ), resulting in α = 0.598. Overall, we find little variation in our results whether we use C rat and C diff (0.598).
Caloric input C in results were more mixed. Chron-ic disease-related rates were also significantly correlated with C in , with the exception of adult diabetes, childhood overweight and obesity, and high cholesterol, after correcting for multiple comparisons. The variables relating to unhealthy habits (smoking (#16) and binge drinking rates (#26)) both correlated significantly with all three of our measures with the one exception of binge drinking and caloric input. The direction of correlations for these two habits are opposite each other (e.g., negative for smoking and C rat , positive for binge drinking and C rat ), consistent with recent work on alcohol consumption [35].
The two variables relating to physical activity rates (percent of population that has had no physical activity in past 30 days (#1), and percent of population that has been physically active in past 30 days (#2)) correlated significantly with all three of our measures. The two measures relating to rates of physical and mental health (average number of poor mental health days in past 30 days (#24), and average number of poor physical health days in past 30 days (#27)) correlated significantly with both C out and C rat , but did not correlate significantly with C in .
The four variables relating to fruit and vegetable con-  [31] 0.23 1.31 × 10 −01 -0.5 6.11 × 10 −04 0.04 8.10 × 10 −01 33. % cropland harvested for fruits/veg [28] 0. 19   sumption rates all correlated significantly with all three of our measures. The variables relating to presence of produce in the state (percent of cropland in each state harvested for fruits and vegetables (#33), percent of census tracts with a healthy food retailer within one-half mile (#35), and percent of schools offering fruits and vegetables at celebrations (#31)) were significantly cor-related with C in but were not correlated with C out or C rat . Variables relating to local food (number of farmers markets per 100,000 people (#28) and Strolling of the Heifers locavore score (#29)) were not significantly correlated with C in , but were significantly correlated with C out .
Our health and well-being ranking variables included the CNBC quality of life ranking (#5), Gallup Wellbeing ranking (#9), America's Health Ranking overall state rank (#10), life expectancy ranking (#11), Brain Health ranking (#20), Gini index score (#23), and George Mason's overall freedom ranking (#36). Caloric ratio correlated with all of these variables except for George Mason's freedom ranking (which did not correlate with any of our three measures). C out correlated significantly with all of these measures except for the Brain Health ranking and the freedom ranking. caloric input C in did not correlate significantly with the CNBC quality of life ranking, Gini index score, or freedom ranking. Regarding correlations with the Big Five personality traits, Pesta et al. noted that "Neuroticism...emerged as the only consistent Big Five predictor of epidemiologic outcomes (e.g., rates of heart disease or high blood pressure) and health-related behaviors (e.g., rates of smoking or exercise)" [36]. Additionally, "neuroticism correlates with many health-related variables, including depression and anxiety disorders, mortality, coping skill, death from cardiovascular disease, and whether one smokes tobacco" [36]. Here, in keeping with these observations, we found that neuroticism (#25) was indeed the only Big Five personality trait that correlated significantly and negatively with caloric ratio.
We also tested our three measures against two measures of socioeconomic status-median income (#17) and percent of state with a bachelor's degree or higher level of education (#21)-and found these correlations were significant for all three of our measures.

III. CONCLUDING REMARKS
Our Lexicocalorimeter has thus, when applied to Twitter, proved to find and demonstrate a range of strong, commonsensical patterns and correlations for the contiguous US. We invite the reader to explore our online instrument, a screenshot of which is shown in Fig. 8.
Given the complex relationships between health, wellbeing, happiness, and various measures of socioeconomic status, it is rather difficult to say that we are only measuring health or only measuring well-being. We are also measuring socioeconomic status to some extent. However, the correlations between caloric ratio and measures of socioeconomic status are not as strong as the correlation of caloric ratio with many of the other measures. Given the above, we believe that the caloric content of tweets can be used successfully, along with other well-being and quality of life measures, to help gauge overall well-being in a population.
There are many potential forward directions. A promising avenue is to incorporate tunability to the Lexicocalorimeter by manipulating its dynamic range. While we chose the caloric ratio C rat for its generality in the main body of this work, there is more flexibility in the measurement of caloric difference: C diff (α) = αC out − (1 − α)C in . Though a universal approach is unclear (α should be independent of the particular data set being studied), we may profit from the versatility of C diff (α) when focusing on a single demographic. For example, if we are interested in diabetes rates, we could tune the instrument to obtain the best correlation with known levels, and thereby create a real-time estimator. To do so, we would tune α and find the value that gives the highest correlation between C diff (α) and diabetes rates for a given set of populations. Of course, we could use a "black box" method to generate a more optimal fit, but in basing our instrument on food and activity words, we have a far more principled approach that grants us the opportunity not just to mimic but to understand and explain patterns that we find. In particular, our word shifts will be of great use in showing why our hypothetical estimate of diabetes is varying across populations.
We fully recognize that the Twitter population is not the same as the general population; Twitter users differ from the general population in terms of race, age, and urbanity [7]. However, we currently have no reliable way to know, for example, the true age, race, gender, and education level of individual users and as such, are not able to adjust for these factors. While we were able to vet our food and physical activity lists to some extent (as described in Methods and Materials), we could not realistically go through every tweet to be certain that the phrase was being used in the way that we thought. We realize that even if the phrases are being used as we imagine, it does not necessarily mean that the person who tweeted actually performed the physical activity or ate the tweeted-about food (West et al. address a similar issue in inferring food consumption from accessing recipes online [18]).
We also currently do not know at what point our metric breaks down at smaller time scales (e.g., months or weeks) or for smaller spatial regions (e.g., city or county) level. Our preliminary research shows that the physical activity metric on its own may be quite effective at the city level, but the food measure may not be accurate on a smaller scale. We have also found the physical activity list to be robust to random partitioning [37], whereas the food list was not. We believe that these preliminary findings may be due to several factors: (a) the size of the food list (just over 1400 phrases) is much smaller than the physical activity phrase list (just over 13,400 phrases); (b) there are generally more tweets about physical activities in our list than the foods in our food list; and (c) the amount of data within a city may not be a large enough sample for any food-based Twitter metric. We note that we have not tried using the metric on counties or Census block or tract groups, and it may be that these are more conducive to the metric.
We propose to use crowdsourcing as a way to build a more comprehensive food phrase list that includes commonly eaten foods with brand names as well as food slang that we did not capture here. Ideally, we would arrive at a food phrase database similar in scale to that of our existing physical activity phrase list. However we move How do I look in these tweets? Gauging well-being through "caloric content" of tweets   forward, we believe it is clear that the Lexicocalorimeter we have designed and implemented is already of some potency and may be improved substantively in the future.

IV. METHODS AND MATERIALS
In order to attempt to estimate the "caloric content" of text-extracted phrases [37] relating to food (caloric input) and physical activity (caloric output), we needed comprehensive lists of foods and physical activities and their respective caloric content and expenditure information. Here, we explain in detail how we constructed these phrase lists and assigned calories to each phrase.
In dataset S1 (https://dx.doi.org/10.6084/m9. figshare.4530965.v1), we provide message IDs for all tweets that are part of our study, and we make both this dataset and other material and visualizations available at the paper's Online Appendices (http://compstorylab. org/share/papers/alajajian2015a/, and as part of our Panometer project at http://panometer.org/ instruments/lexicocalorimeter. We have drawn on Twitter's Gardenhose API which has been provided to the Computational Story Lab by Twitter.

A. Calorie estimates for phrases
We used the USDA National Nutrient Database [38] to approximate the caloric content of foods, and the Compendium of Physical Activities from Arizona State University and the National Cancer Institute [39] to approximate average Metabolic Equivalent of Tasks (METs) for physical activities, which we converted to calories expended per hour of activity [39]. Because the foods listed in the USDA National Nutrient Database are not described in a way that people talk about food, we created a list of food phrases used on Twitter by starting with a kernel of basic food terms from the USDA's MyPlate website's food group pages [40]. If the food phrase was not specific, such as "cereal", we chose the most popular version of that food in the United States via an informal Google search at the time of the study (in this instance, Cheerios). If a brand name food was not in the USDA National Nutrient Database, we chose the closest match we could find. (Please note that this means that data in appendix may be inaccurate when searching brand name items.) This approach yielded examples of foods in the food groups of fruits, vegetables, grains, proteins, dairy, oils, solid fats, and "empty calories" (e.g., junk food), and built up a list of nearly 1400 food phrases used on Twitter. For the main results we present in this study, we did not include drinks or soups (liquids) in our list. We found there is very little change in our findings when liquids are included, as we discuss below, and we have omitted them at present both for simplicity and because we were not satisfied with a straightforward way of balancing liquid and solid nutrition estimates. Note that we have included ice creams, oils, and some other items that may act as liquids, and these could be separated out for future versions of our instrument.
For physical activity, we used the physical activities listed in the Compendium to build up a list of nearly 14,000 physical activity phrases used on Twitter. The order of magnitude of difference between the length of the two lists exists because of the difference in the number of terms that went into creating each list and the rates at which people tweet about foods vs. physical activities.

B. Phrase extraction
A major obstacle to the development of the food and physical activity lists is the determination of those phrases used by individuals that most accurately represent a food or physical activity. Various methods exist which may help one ascertain information about the frequency of usage of higher-order lexical units [37]. However, we require one that not only determines reasonable estimates of frequency of usage, but further, does so with nuance regarding context. For example, one should not count the phrase "apple" as having occurred if it appeared within a larger phrase that was recognized as meaningful, such as "you're the apple of my eye." To accomplish these goals, we define a low-assumption text segmentation algorithm, which we refer to as serial partitioning.
Serial text partitioning is a greedy algorithm (see Alg. 1) for finding distinct, coherent subsequences (phrases) within a sequence (clause). It relies on the directionality of a sequence, and so is particularly adept for processing text into multi-word expressions for many modern languages. The algorithm relies on an objective function, which we will generally refer to as L. At a high level, the algorithm seeks to find find the largest subsequences possible, following a chain of optimizing, growing subsequences.
In the context of this article, we define L relative to a text T as follows, providing pseudocode below. First, let f : S → R ≥0 be the random partition frequency function [37] under the pure random partition probability (q = 1 2 ) for the text T . We then apply the model of context developed in [41] under the parameterization q = 1, so that a given phrase s is a member of (s) contexts C s (e.g., the phrase s = (N ew, Y ork, City) is a member of three contexts, labeled C s = {( * , Y ork, City), (N ew, * , City), and (N ew, Y ork, * )}). Then for C ∈ C s , we consider the context-local likelihood probabilities: and prescribe to s the likelihood-minimizing context which chooses the context-pattern that is most prevalent in T . The objective function for this instantiation of serial partitioning is then defined as and referred to as the local likelihood of a phrase s.
Algorithm 1 Serial text partitioning of a (left-to-right) directional clause, given an objective function L : S → R ≥0 (whose maximization is desired, in this case) that is zero on the empty phrase (·), and a clause t = (t 1 , · · · , t (t) ), consisting of (t) words. Note that for any a, b ∈ S, a b = (a 1 , · · · , a (a) , b 1 , · · · b (b) ) denotes the concatenation of phrases, and that for convenience, a single sequence element, a i , may be treated as sequence of one term, (a i ). P ← (·) init. the partition.

4:
for i ∈ (1, · · · , (t)) do s ← ti 10: return P We manually applied the following criteria for constructing both food and exercise phrase lists. For a phrase to be included, it had to be a phrase that used the food or physical activity word(s) in a way that pertained to eating or physical activity; we excluded phrases that were part of hashtags, Twitter user names, song lyrics, or names of organizations or businesses, and phrases that appeared four or fewer times were not included. Misspellings and alternate spellings were included if we happened upon them (for example, "mash potatoes" instead of "mashed potatoes"), but we did not go out of our way to search for them. We queried questionable phrases to be sure that the majority of their uses were referring to the item of interest. Because we were building up from a small list, some specific versions of foods were included while more general forms were not. For example, because we built phrases up from "strawberry," "strawberry jam" was included while we did not conduct a larger search for "jam". In another example, in building phrases up from "bacon," "bacon wrapped dates" turned up so we included those dates but did not conduct a larger search for all possible "dates". (Note: We removed the physical activities category 'sexual activity' from the study because the task of determining meaning and context was too difficult. ) We searched for phrases containing the physical activities in multiple tenses in order to capture as much information as possible. For example, for the activity type shoveling snow, we searched for the forms of shovel, shoveling, and shoveled. Tweets were initially converted to all lowercase text, so we were assured that we were not missing data due to capitalization. To match each food phrase with its closest caloric data, we found the most closely corresponding food from the USDA National Nutrient Database, counting all vegetables and fruits in their raw form unless the phrase indicated otherwise. Similarly, we entered meats as roasted or cooked with dry heat, not fried, unless the phrase indicated otherwise or there was no homemade option. We used the nutrition content of homemade versions of foods (for example, baked goods) rather than store-bought foods unless the phrase indicated otherwise. Our approach, while systematic, was not exhaustive, nor is it the only way of taking on this chal-lenge; there are certainly other methods that we expect to yield similar results.
Finally, we lemmatized the food phrases by their code in the USDA National Nutrient Database. If there were food phrases that were more general in each set of phrases that held the same code, we used the more general phrase as the lemma.
We lemmatized the activity phrases by their METs and activity category. Activity categories were largely the same as listed in the Compendium with slight changes due to items in Compendium being listed in a Miscellaneous category, etc. This yielded instances of physical activity phrases that were in the same activity category but were very different with the same METs being included in the same lemma. From this level of lemmatization, we then used our best judgement to break these lemmas down further until proper phrases were included in each lemma.   Per activity phrase caloric expenditure shift Activity rank

S9
Health and/or well-being quantityρ s for C diff q-valρ s for C in q-valρ s for Cout q-val