Social Media Fingerprints of Unemployment

Recent widespread adoption of electronic and pervasive technologies has enabled the study of human behavior at an unprecedented level, uncovering universal patterns underlying human activity, mobility, and interpersonal communication. In the present work, we investigate whether deviations from these universal patterns may reveal information about the socio-economical status of geographical regions. We quantify the extent to which deviations in diurnal rhythm, mobility patterns, and communication styles across regions relate to their unemployment incidence. For this we examine a country-scale publicly articulated social media dataset, where we quantify individual behavioral features from over 19 million geo-located messages distributed among more than 340 different Spanish economic regions, inferred by computing communities of cohesive mobility fluxes. We find that regions exhibiting more diverse mobility fluxes, earlier diurnal rhythms, and more correct grammatical styles display lower unemployment rates. As a result, we provide a simple model able to produce accurate, easily interpretable reconstruction of regional unemployment incidence from their social-media digital fingerprints alone. Our results show that cost-effective economical indicators can be built based on publicly-available social media datasets.


A. The datasets
Twitter provides an extremely rich and publicly available data set of user interactions, information flows and, thanks to the geo location of tweets, user movements. Nevertheless, the representativeness of this geo-located Twitter as a global source of mobility data has still received sparse attention. In this sense, while [13] present a promising and extensive study regarding global countryto-country movements (mostly driven by tourism), within-country human flows (comprising not only internal tourism but also, in a greater extent than country-to-country travels, visiting and commuting) still need further investigation. Therefore, throughout this work we will compare our findings using geo-located Twitter with similar study using commuting surveys.
For the Twitter analysis, we consider 19.6 million geo-located Twitter messages (tweet(s)), collected through the public API provided by Twitter for the continental part of Spain and from 29th November 2012 to 10th April 2013. In this dataset we consider that there has been a trip from place l to place k if a user has tweeted in place l and place k consecutively. We only keep those transitions when the first tweet and the second one are dated in the same day. We filter the trips database to avoid unrealistic transitions and keep only trips with a geographical displacement larger than 1km. By this method, 1.38 million of trips from 167,376 different users are considered in our work.
From those trips we construct the mobility flow T ij between municipalities, which measures the number of trips in our database in which the origin is within city i boundaries and destination lies within those of city j.
We also consider population and economical information about the municipalities from the Spanish Census (2011) [8] and unemployment figures from the Public Service of Employment (Servicio Público de Empleo Estatal, SEPE) [7]. In the latter case, registered unemployment (in number of persons) is given for each Spanish municipality by gender, age, and month. To get unemployment rates we divide register unemployment by the total workforce in the municipality, estimated as the number of people with age between 16 and 65 years.
All the collected data complies with the terms of service for the websites where they were downloaded.

B. Twitter as mobility proxy
Considering all of the available transitions in our database, one can compute the distance between origin and destination, the elapsed time of the transition and the number of trips per user among many other statistics. Using the method described in [26], the trip distance and the number of trips exhibit a clear Power-law distribution (KS statistics 0.05 and 0.06 with exponents -1.62 and -2.12 respectively) whereas for the elapsed times, the best option is to fit a exponentially-truncated Power-law distribution (KS statistic 0.046 with exponent -0.67). For all these parameters, focusing on the log-linear part of the distributions, self-similar behaviors arise when Twitter based mobility is analyzed (see Fig. A).
Twitter based inter-city flows can be well modelled by means of the The Gravity Law, which is one of the most extended methods to represent human mobility [1,19] , with applications in many fields like urban planning [23], traffic engineering [4] or transportation problems [9]. Gravity Law 10 −2 10 0 10 0 10 0.5 10 1 10 1.5 10 2 10 2. 5 10 3 x dens 10 −2 10 0 10 0.5 10 1 10 1.5 10 2 10 2. 5 10 3 x dens is also the solution to the problem of maximizing the entropy of the particle distribution among all the possible trips using statistical mechanics techniques [22,2]. Recently, it has also been used as a model for human mobility based on cell phone traces [20,10,21] and social media data at a global scale [13] and at the inter-city level [14]. The Gravity Model for human mobility assume that the flows between cities can be explained by the expression where T grav ij is the flow, in terms of number of people, between cities i and j, d ij is the geographical distance and P i and P j the population of every city respectively.
Given the data we can obtain the parameters of the model by Weighted Least Squares Minimization, where N is the total number of connections in the mobility graph and w ij is a weight proportional to the number of observed transitions between i and j. In particular we find that taking w ij = T 1.3 ij gives the best performance in the model. In our case, this model fits quite accurately the inter-city mobility based on Twitter GPS checkins (see Table A). Even though we are considering T ij not necessarily symmetric, the exponents of the populations are similar indicating that we are observing a similar flows in both directions between i and j.

C. Community structures in inter-city mobility graph
Typically, complex networks exhibit community structure, that is, there are subsets of nodes that are more densely connected among them comparing to the rest of the nodes. In mobility networks, whose nodes correspond to geographical areas, these communities are interpreted as zones with
high common activity and tend to be constrained by geographical and political barriers. We check whether this is also observed in our dataset by performing 6 state-of-art community detection algorithms: FastGreedy [5], Walktrap [16], Infomap [18], MultiLevel [3], Label Propagation [17] and Leading Eigenvector [15]. These six different algorithms exhibit different community structures in terms of number of communities, average size of community or modularity (see Table C). Members (municipalities) of the resulting communities are spatially connected except some few cases as Fig. C shows. We test the statistical robustness of the obtained communities by randomly removing a proportion p of the original links and performing the algorithms on this new graph G p . We will consider that communities are robust when the communities given for the original network G and G p are highly similar. In order to compare two arbitrary memberships to communities, we use the Normalized Mutual Information (NMI) method described in [6] which returns 0 when two memberships are totally different and 1 when we compare two equal memberships. We compute the NMI for each chosen algorithm performed on G and G p , for p between 1% and 10%, concluding that obtained community structures are robust because they are not broken when some randomly chosen links are removed (see Table B). As other works have shown, mobility graph communities are usually interpreted in terms of geographical and political barriers and a natural question is whether the mobility based com-NMI between G and G p for different p munities are related to any of these barriers. In Spain, there are different territorial divisions for administration purposes. In this work, we consider two of them: provinces, defined in 1978 Constitution, are 50 different heterogeneous aggregations of municipalities; and counties (comarca in Spanish terminology) which are traditional aggregations of municipalities mainly based on Spanish orography (rivers, valleys, ridges, etc) and some of them are composed by municipalities of different provinces. We use again the NMI method to compare the communities structure given by the algorithms to the administrative limits. Except Leading Eigenvector algorithm, the rest of methods return communities that are quite related to provinces (N M I ≈ 0.7) whereas for the county administration limits, higher variability is observed. In this last case, the algorithm providing more relationship with county limits is Infomap, N M I ≈ 0.83. Therefore, Twitter based mobility summarizes the inter-city flows exhibiting that these flows are influenced by geographical and political barriers.  As we can see, different algorithms also give different spatial resolutions. While FastGreedy, MultiLevel and LeadingEigenvalue yield to a small number of large partitions, we got a higher spatial resolution in the partitions obtained by WalkTrap, InfoMap and LabelPropagation. Since we want to study unemployment at a finer spatial scale than provinces, we consider only those latter methods in our study. Note also that the counties partition has small modularity with the observed mobility graph and thus we have discarded it. Finally, in the main text we have used the partition obtained by InfoMap since, as explained before, they have more overlap with counties. However, as shown in Section I., our main results are similar for other partitions at different resolution levels. Specifically, LabelPropagation partition yields to very similar results as the InfoMap communities.

D. Twitter demographics and unemployment rates
Different age groups are not equally represented in Twitter. Recent surveys (2012) in Spain suggest that most (86%) of users in Twitter are 16 to 44 years old. Comparison of the percentage of users per age group with the total population within the same groups (see Fig. D) reveals that groups of ages above 35 years old are under-represented in Twitter. Thus our Twitter data will be more revealing when trying to describe unemployment in age groups below 44 years old. This is indeed what we find when we try to build a linear model for the rate unemployment in different age groups with the same Twitter variables: while unemployment rates for ages below 24 can be fitted to a linear model with R 2 = 0.62 we find that regression models for unemployment rates for ages between 25 and 44 have a R 2 = 0.52, while for ages above 44 we get only R 2 = 0.26. Table  D summarizes the results for the regression models of unemployment rates in each age group, showing that our Twitter variables have more explanatory power for ages below 44. Finally, in

E. Properties of Twitter variables E.1. Normalization and distributions
Heterogeneity between the values of variables constructed from Twitter is large but moderate, as histograms in Fig. E show. We did not find any geographical area with anomalous values in any of the variables considered. Variables are normalized in different ways: both the penetration τ i and misspellers rate ε i are defined as the number of users or misspellers per 100.000 persons (population); activity variables ν i are normalized as the percentage of tweets per time interval; finally, number of tweets that mention a specific term µ i are also given per 100.000 tweets published in the geographical area. Finally we have also considered potential bias in the entropy estimations due finite size effects of the sample, which could create spuriously high information values. To this end we have used the simple Miller-Madow correction to entropy estimation [24]. However, both the original and corrected estimations are highly correlated: for example, for the mobility entropies Pearson's correlation coefficient is 0.99 and MSE between both estimations is 0.09. Thus, there is no significant bias in our estimation of the entropies.

E.2. Correlation between variables
Variables are constructed to reflect the behavior of areas in the different dimensions of Twitter penetration, social or geographical diversity, activity through the day and content. Correlation between variables does indeed show that variables within each dimensions hold strong correlations between them. As we can see in Fig. F social and geographical diversities are highly correlated between them, an expected fact given the gravity law accurate description of flows of people between geographical areas, but also the amount of communication between them. Same behavior is found for the group of variables in the activity group, while content variables are less correlated. Finally we find that both the penetration rate τ i and fraction of misspellers ε i have a strong correlation with most of the variables. High correlation between variables might lead to collinearity effects [25] in the linear regression models, that is, some variables with predictive variable might have non-significant weights because they explain the same part of the variance. For instance, in Table E misspellers rate has a very strong predictive value but its p-value is too high to consider it significant. To test this hypothesis, we perform a principal component analysis (PCA) on the independent variables of the regression.  sity seem to explain large part of the variance; on the other hand, we find a perpendicular group of variables formed by temporal activity; finally, penetration rate and misspellers fraction seem to represent a different independent direction of data, with high collinearity between them. This might explain the low statistical significance in the models of section I.. In any case, the structure of the correlation matrix and the PCA results show that there is indeed information in all groups of variables and thus we have take a variable in each of them for our regression models. Each entry in the matrix is depicted as a circle whose size is proportional to the correlation between variables and the sign is blue/red for positive/negative correlations. Blank entries correspond to statistically insignificant correlations with %95 confidence. Right: Variables projection on the first two principal components given by PCA. We observe different groups of variables and collinearity between some of them.

F. Misspellers detection
In this work we will consider only tweets in Spanish, that is, since in Spain several languages live at the same time, depending on the part of the country, the first step is to reduce our Twitter dataset to those tweets that are written in Spanish. This task is carried out using the n-gram based text categorization R library textcat [11]. Then, in order to decide whether a tweet has a misspelling or not, we need to establish some patterns to select from our set of tweets. Since we want to be sure that a detected mistake corresponds to a real misspeller, we will not consider the following cases: • Lack of written accents. People tend to avoid writing accents when talking in a colloquial way.
• Mistakes derived from removing unnecessary letters. The most common cases are removing a h at the beginning of a word (in Spanish the letter h is not pronounced), or replacing the letters qu by k. We understand that these mistakes can be motivated for the limitation of length in tweets, and not for a real misspelling.
• In the same line, we neglect mistakes produced by removing letters in the middle of a word, whose pronunciation can be deduced without them. • We do not consider either mistakes related to features of specific areas in Spain. For example, in the south the pronunciation of ce and se is the same, what produces a big amount of mistakes when writing. However, since we want to extract objective and equitable conclusion over the whole Spanish geography, we neglect those misspellings that only appear in a specific area.
Likewise, we will consider as real misspellings the following mistakes: • Adding letters. For example, writing a h at the beginning of a word that starts with a vowel.
• Changing the special cases mp, mb by the wrong writings np, nb.
• Mixing up b with v, g with j, ll with y, and ex with es. These are typical mistakes in Spanish, because they have the same, or a very close, pronunciation.
• Confusing the verb haber with the periphrasis a ver.
• Separating a word into two ones, for instance, writing the word conmigo as con migo.
This way, our list of mispellings is composed of 617 common mistakes in Spanish, that cannot be attributed to the special features of Twitter or a specific region of Spain. Thus, one can expect that this selection provides an accurate and equitable method of detecting misspellers. Under these conditions, the number of users who wrote at least one misspelled word is 27055 (5.6% over the whole population). We analyze whether misspellers have different Twitter usage behavior from that people who do not make serious mistakes when publishing a tweet. Comparing the average number of tweets, it can be observed that misspellers tend to publish a larger number of tweets than those who did not made mistakes (144.71 against 23.72). This also emerges when the mean number of misspelling given the total number of tweets is considered. For users with less than approximately 30 published tweets in the observation period, the number of misspellings is almost zero whereas for users who publish more often, the mean number of misspellings scales sub-linearly with the number of tweets (exponent ≈ 0.33).
Supporting Information for Social media fingerprints of unemployment Since we have observed a segmentation of Twitter population based on how accurate they write, we consider the misspeller rate as a proxy of the educational level of the cities. Large number of previous works in the literature have revealed the relationship between the economical status and the educational level of geographical areas and therefore it is natural to ask whether the observed misspellers rate is related to economy driven by the unemployment rate. To test this hypothesis, we consider cities populated with more than 5000 inhabitants to avoid subsampled cases. We find a strong positive correlation between the probability of finding a misspeller in a city and the unemployment rate (0.372, 0.491).

G. Time window and unemployment
In the definition of the variables we have aggregated the Twitter activity within a 7 months time window (from December 2012 to June 2013). Since unemployment has a significant variation along time, we investigate here what is the correlation and explanatory power of the Twitter variables for the values of unemployment determined at different months through the same time window in which Twitter data was collected. Or if the variables collected in that time window are more correlated with past or future values of unemployment. Fig. H shows the explanatory value of the model when the linear regression is done for values of unemployment of different months before, during and after the Twitter data time window. Although there is a small seasonal effect along the year, we see that the explanatory power remains around R 2 = 0.6, which suggest that our Twitter linear model retains its explanatory power even though unemployment changes considerably throughout the year. It is interesting to note that R 2 decays a little bit during the summer which means that our variables are less correlated with summer unemployment. Finally, unemployment used in the main article is from June 2013, i.e. the last month in the time window used to collect the data.

H. Demographics does not explain unemployment
Since unemployment rates are very large for the group of young people, a natural question is whether only demographic variables could explain the heterogeneity of young unemployment rates found in the geographical areas. To test this end we have built four linear models: the first one (named Youth model in Table E) is composed by the rate of young population as the only explaining variable; the second ones are built based on only the Twitter variables considered in the main text (named Twitter model (I)) or just with those whose regression coefficients are statistically significant (Twitter model (II)); the third one is fitted with all the variables (named All variables model in Table E). In Table E we show the summary of the regression for each model. Focusing on the explained variance by the model in terms of R 2 , it can be checked that considering all Twitter variables is three times more explanatory than considering only the young people proportion. On the other hand, the comparison of R 2 for the Twitter model with the one for All variables and Youth model shows that the rate of young population does not provide a significant explanatory power.
This semi-partial analysis shows that our Twitter variables retain a high explanatory power when the effect of young population rate is controlled.

I. Unemployment models for other geographical areas
While municipalities are very heterogeneous demographically, other administrative areas exist in Spain at large scales that could be used for our model of unemployment. As mentioned in section C., the smallest administrative division of Spain we have considered is that of the 8200 municipalities. At larger scales we have the 326 counties (comarcas in spanish) which are aggregations of municipalities. Finally, the largest geographical scale we considered is defined by 50 provinces (provincias in Spanish). In this section we compare the performance of our Twitter model for unemployment for the variables defined in those administrative areas and relate it to the geographical communities detected and used in the main paper (see section C.). Not all the areas at different administrative divisions are considered in the model. To minimize the effect of areas in which the number of geo-tagged tweets is very small, we only consider the 1738 municipalities which have a Twitter population π > 10. Similarly, we only consider the 198 counties with π > 100. As we can see in Table F the model has a large explanatory power for areas equal or bigger than counties. As expected R 2 increases as the number of areas in the model is smaller, but the description level of the model is very low for provinces, for example. The best performance (high R 2 and high geographical description level) is attained at the level of the detected communities. Other partitions obtained with a different community-finding algorithm yield to similar results, as shown in Table  F for LabelPropagation.

J. Relative importance of the variables
To asses the relative importance of the variables in the unemployment model we have used several methods. They all give qualitatively the same results, with some variations for the statistically insignificant variables. Specifically, we have use