The Twitter of Babel: Mapping World Languages through Microblogging Platforms

Large scale analysis and statistics of socio-technical systems that just a few short years ago would have required the use of consistent economic and human resources can nowadays be conveniently performed by mining the enormous amount of digital data produced by human activities. Although a characterization of several aspects of our societies is emerging from the data revolution, a number of questions concerning the reliability and the biases inherent to the big data “proxies” of social life are still open. Here, we survey worldwide linguistic indicators and trends through the analysis of a large-scale dataset of microblogging posts. We show that available data allow for the study of language geography at scales ranging from country-level aggregation to specific city neighborhoods. The high resolution and coverage of the data allows us to investigate different indicators such as the linguistic homogeneity of different countries, the touristic seasonal patterns within countries and the geographical distribution of different languages in multilingual regions. This work highlights the potential of geolocalized studies of open data sources to improve current analysis and develop indicators for major social phenomena in specific communities.


Introduction
Modern life, with its increasing reliance on digital technologies, is opening unanticipated opportunities for the study of human behavior and large scale societal trends.Cell phones have been playing a pivotal role in this revolution, serving as ubiquitous sensors, and the default point of contact for online activities [1,2].As a whole, mobile clients for microblogging platforms, social networking tools, and other "proxy" data of human activity collected in the web allow for the quantitative analysis of social systems at a scale that would have been unimaginable just a few years ago [3][4][5][6].In particular, the possibility of using mobileenabled microblogging platforms, such as Twitter, as monitors of public opinion, social movements and as tools for the mapping of social communities has generated much interest in the literature [7][8][9][10][11][12][13][14].At the same time it is crucial to understand to which extent the picture of socio-technical systems emerging from digital data proxies is a statistically sound and how well it does scale to a planetary dimension [15].
In this paper, we perform a comprehensive survey of the worldwide linguistic landscape as emerging from mining the Twitter microblogging platform.Our large-scale dataset, gathered over approximately two years, at an average rate of 6.5 × 10 5 GPS-tagged tweets per day, contains information about almost 6 million users and provides a uniquely fine-grained survey of worldwide linguistic trends.By coupling the geographical layer to the identification of the language of single tweets we are able to determine the detailed language geography of more than 100 countries worldwide [16].
Although previous studies have investigated the language dynamics [17] of Twitter, those analysis have focused on specific, yet interesting, aspects concerning the combined study of language and geographical analysis in Twitter, and a global picture is still lacking.For instance, most represented languages have been identified for the Top-10 more active countries [18], language-dependent differences have been pointed out in the user activity related to the posting and conversations patterns [19], and language has been shown to be a strong predictor for the formation of follower/followee relations [20].For this reason and for the sake of assessing the generality and planetary scalability of our analysis, we have first focused on the reliability of geospatial trends extracted from our dataset.Interestingly, we find a universal pattern describing users' activity across countries, and a clear correlation between Twitter adoption and the Gross Domestic Product (GDP) of a country, further characterized by well defined continent-dependent trends.
The high quality of the dataset permits the study of the spatial distribution of different languages at different scales from aggregated country-level analysis to the neighborhood scale.In particular we can drill down data of linguistic macro areas and single out heterogeneities at the country and regional level, scrutinizing the cases offered from Belgium and Catalonia (Spain) as examples.Furthermore we explore the resolution offered by the data at very fine level of granularity and inspect the city and neighborhood levels, taking as case studies the spatial distribution of French and English languages in Montreal (Canada) and inspecting linguistic majorities in New York City (USA).We find that Twitter is able to reproduce the geospatial adoption of languages for a wide range of resolution scales.We contrast our results against census data, and discuss the possible sources of discrepancies between the two.Finally, we broaden our perspective by addressing the seasonality patterns in the language composition of the Twitter signal.We use touristic countries such as Italy, Spain, and France to single out clear seasonal trends like, for instance, the increase of English and other languages during the summer holiday season.Overall, our analysis highlights the potential of Twitter data in defining open source indicators for geospatial trends at the planetary scale.
The paper is structured as follows.In section 2 we go over data selection criteria as well as statistical measures regarding the universality of users behavior.Within this framework, we investigate several relevant examples in language geography (section 2.1) and explore the temporal dimension for seasonal patterns (section 2.2).A discussion (section 3) of the results is followed by a thorough description of the data sets and methodology used (section 4).

Results
Our analysis is based upon Twitter data gathered in approximately 20 months between October 18, 2010 and May 17, 2012, at an average rate of 6.5 × 10 5 GPS-tagged tweets per day (see Table 1 for exact numbers).The dataset includes 3.8 × 10 8 tweets produced by 6.0 × 10 6 users located in 191 countries, 110 of which generated the amount of data necessary for a significant statistical analysis of language detection.Our language detection methods allowed us to identify 78 languages.Our analysis is restricted to GPStagged tweets in order to preserve maximum level of geographical detail, taking into account both live GPS updates and device stored locations.The amount of geolocalized signal could in fact be increased by considering different kinds of metadata, like for example self reported locations [13], but these procedures would not allow us to reach the level of granularity and detail we aim to.Further details about the data collection and analysis procedures, as well as on the (live) GPS metadata, can be found in the Methods section.Overall, considering the recent literature, and to the best of our knowledge, the amount of GPS-tagged data we have gathered is certainly remarkable not only in terms of volume, but also for the covered geographical and temporal extension.
Fig. 1 illustrates the potential of inspection at different resolutions, from continent to city level, highlighting the detailed structure that is visible at each scale.Countries are easily identified along with their major metropolitan areas, and even within specific cities it is possible to observe a high degree of details.Coupling this geographical resolution with language detection tools (see Methods) provides us with a remarkable view of how languages are used in different areas.However, Twitter adoption is not homogeneous across different countries.Fig. 2 ranks countries in descending order in terms of Twitter adoption, defined as the ratio between Twitter users and total population (i.e.Twitter users per 1, 000 inhabitants).The emerging picture is highly heterogeneous, as expected, since our data come exclusively from smartphone devices that are consequentially tied to the availability of local infrastructures.In order to support the hypothesis that economic diversity is a primary source of heterogeneity in the Twitter adoption (in mobile devices), we investigated whether the Gross Domestic Product (GDP) of a country could serve as a predictor of microblogging adoption.Fig. 3 shows that this is the case, the GDP and the Twitter users per capita being clearly correlated.Moreover, different continents (identified by different color codes in Fig. 3) cluster together, indicating, that cultural as well as socio-economic factors concur at once in determining the observed pattern.
Geographical analyses at any scale require the aggregation of the signal produced by different users, and it is crucial to have a clear understanding of the patterns of single user activity.One might suspect that usage patterns at the individual level may show large heterogeneities across country and thus cultures.In order to test statistically the presence of different usage patterns we gather the number of tweets per unit time sent by each single identified user.From this data we construct the probability density function p (N ) that any given user emits N tweets per considered unit time.In our analysis we considered as reference unit time one day.Furthermore, the p (N ) distribution can be analyzed by restricting the statistical analysis to users belonging to a specific country, a specific language or both.Interestingly, Fig. 4 shows that the distributions exhibit a universal shape irrespective both of country (panel A), language (panel B), or the weight of each countries on specific languages (panel C).As we will see this finding is pivotal for an unbiased comparison of different geographical and linguistic scenarios.Any dependence of the activity distribution upon the language or location of the users would have reduced the array of possible analysis.It is worth stressing also that the curves overlap each other naturally, i.e., with no need for any rescaling or transformation.Although this feature indicates a very strong statistical homogeneity at the population level, the observed distribution turns out to span almost 4 orders of magnitude.The broad nature of this universal distribution is clear evidence of strong individual level heterogeneity.For this reason, in order to avoid distortions due to extremely active users, we consider only the proportion of tweets emitted by each user in a given language.Thus, a user i that tweets in a set, L, of different languages, L = {A, B, C, . . ., Z}, will contribute to each language X for a fraction N i X / Y N i Y .We define N i X the total number of tweets written by the user in language X.We adopt the same normalization also for the position of the user.The reasons for this normalization are multiple.First, the amount of tweets collected for each user ranges over several orders of magnitude.Very active users, as well as automatic bots, might therefore distort or mask the signal coming from "common" individuals.Second, tourism might be a strong source of noise when trying to understand the demographics of a country or of a city.Touristic locations in the South of France or Italy might for example exhibit a high proportion of tweets in English or German.

Language analysis at different geographic scales
The ranking of languages in our signal is presented in Fig. 5, where the ordering is determined by the number of users we observe for each one of them.As expected, English is largely dominant.Spanish occupies the second position despite being almost 6 times less popular.Interestingly, these languages are followed by Malay and Indonesian, reflecting the fact that Indonesia is a very active country in absolute terms, even though in terms of users per capita the country is only ranked in the 30th position (see Fig. 6).Here the effect of each countries population size becomes clear.A large country as Indonesia does not need a large per capita Twitter penetration to make its language very visible in Twitter, while much smaller Netherlands does.And in fact the Netherlands is the second country in terms of users per capita (see Fig. 6), making Dutch the 8th most common language.
It is worth stressing that our statistics do not reflect the overall estimates of language speakers in the world.According to Ethnologue: Languages of the World [21], when native and secondary speakers are considered together Standard Chinese leads the ranking (1.0 × 10 9 speakers), followed by English (5.0 × 10 8 speakers), Spanish (3.9 × 10 8 speakers), Hindi (3.0 × 10 8 speakers) and Russian (2.5 × 10 8 speakers), with Malay/Indonesian ranked as 8th (1.6 × 10 8 speakers).These discrepancies do not prevent us from extracting meaningful information in countries where Twitter is sufficiently high to serve as an accurate mirror of the population, but it serves as a reminder that we are observing the worldwide linguistic landscape through the lenses of a (specific) microblogging platform which, for example, is not available in China.Also the age composition of Twitter users must be taken into account if one is to compensate for differences with respect to the official census data [22].
Country level.When we color each tweet according to its language and display them on a map we see immediately that most content produced within each country is written in its own dominant language (see Fig 6-A).This is further confirmed in Fig 6-B, which shows the extent to which the dominant language prevails over other idioms in each country.In Figure 7 we plot, for each of the Top 20 countries (by number of tweets), the fraction of users tweeting in each language.Interestingly, countries like France and Italy, which are characterized by a well defined and substantially homogeneous linguistic identity, emit more than 20% of their tweets in English and other languages.Since the most common language in Twitter is English, this is perhaps not surprising.It is in fact reasonable that even users of non-English speaking countries choose to Tweet in English as a form of reaching out to a broader audience.
Regional level.To understand the geospatial heterogeneity of different linguistic backgrounds, we drill down data to small -within-country-scales.It is interesting, for instance, to look at the spatial distribution of the different languages in multilingual regions.Figure 8-A illustrates the geographical distribution of languages used in Belgium, where the North part of the country uses predominantly Flemish, while in the South of the country the dominant language is (Walloon) French.Overall, Flemish accounts for 36.3% of the users, while French is the language of 14.7% of the users within the country borders, i.e.Dutch is 2.5 times more popular than French.Census data set the Dutch to French ratio (as first Languages) to 1.5 [23].The result emerging from the Twitter analysis is qualitatively correct, the quantitative mismatch being explained by the different Twitter penetration in neigboring France and Netherlands, whose dominant language is of course French and Dutch.In the first case, the number of users per 1000 inhabitants is 0.85, while in the second is 6.34, more than 7 times higher (see also Fig. 2).The Dutch speaking population of Belgium finds itself embedded in a much richer Twitter environment, and consequently is more involved in the microblogging activity.
Moving to a within-country scale, Figure 8-B shows the linguistic distribution in Catalonia, an autonomous region of Spain.Here Catalan and Spanish are clearly intermixed (particularly in Barcelona), even though Spanish is the most popular language, with a share of 49.0% of the users where Catalan represents 28.2% of the signal, making that Spanish 1.7 times more popular than Catalan.Interestingly, the Spanish to Catalan ratio is 1.25 when the habitual language of adults living in Catalonia is considered, according to a survey performed in 2008 by the Institute of Statistics of Catalonia [24].In this case the Twitter data is close to the census data, although some considerations are in order.First, census data do not take into account the presence of tourists, whose Twitter activity is on the other hand recorded.Second, Twitter users may be biased towards the most common languages, in order to reach a wider audience.This interpretation is corroborated by the fact that while in our dataset Catalan and Spanish account for the 77.2% of the users, they represent the habitual language of 93.5% of the population according to the above mentioned survey.In the same way, English, which according to census data is customarily spoken by less than 0.01% of the resident population, is adopted by 15.2% of the users.Going at a deeper level of inspection, we see that the Catalan language is more widely used in the central and Northern part of the region than in the area of Barcelona and the coast connecting this city to Tarragona.Remarkably, this pattern agrees with the overall picture provided by census data [24], thus confirming once again the validity of online data in providing meaningful informations, even at the within-country scale.
City level.The high quality of the GPS geolocalized signal allows the inspection of the language demographics of single cities. Figure 9 shows the city of Montreal, where English and French are the most used languages.While English is significantly more popular (65.5% of users, vs. the French 26.9%), there appear to be spatial segregation, with French being more popular in the northern neighborhoods.Overall, the English is 2.4 times more popular than French in our signal, while the situation is the opposite according to census data surveying languages spoken at home, where French is 3.1 times more frequent than English [25].This reversal is not easy to interpret, but we speculate that the geographical location of Montreal, and the fact that we do not consider the entire metropolitan population, along with the fact that English is in general the privileged communication language in North America, are two factors that might play an important role.
The same analysis can be performed at the level of city neighborhood.In the case of New York City, a city known for its cultural diversity, several non-English speaking communities are already welldefined and documented [26][27][28][29][30].For this case study, we partition NYC, Long Island, and New Jersey state into districts, towns, and municipalities, respectively.We do not consider the signal in English (since it is the official language, and homogeneously predominant in the area) and we focus instead on the language exhibiting the second largest number of users inside each district/town.Some of the most popular communities are those of Spanish speakers in Harlem, Bronx, and parts of Queens [26].However, Spanish is shared by people from many different cultural backgrounds and it is also widely used across the United States.It is thus difficult to estimate the exact location and dimensions of these communities solely based on Twitter signal.In fact, it is clear that Spanish dominates as a second language in a number of districts of Figure 10.Remarkable, on the other hand, is the clear delimitation of other communities.The Korean communities in Palisades Park, NJ and Flushing, NY are of considerable size and also very socially active [27,28].Marine Park, NY, on the other hand, has a long history of Dutch immigration that dates back to the first European settlers in the area [29].Another notable example is the case of Coney Island, NY, which is home to the largest Russian community in the United States [30].The high resolution of our dataset allows us to visualize these communities without any a priori assumptions.

Seasonal variations
Now that we have a good characterization of the relative linguistic composition of each country we can assess the of use our data to study and analyze seasonal variations of language composition, as this would give us valuable insights onto population movements occurring over the course of a year.In particular, we might expect that during more touristic seasons one could observe a relative decrease in traffic occurring in the local dominant language and a corresponding increase in content being generated in foreign languages.In Fig. 11 we show the relative contributions of minority languages from users within a given country as a function of the month of the year.In particular we single out traditional touristic destinations, such as France, Italy, and Spain, where clear variations are indeed visible during the summer.
Our analysis allows not only to identify the aggregate touristic fluxes, but also to infer the regions of origin on the basis of the observed language.Of course, the pattern we observe are certainly slightly biased by the specificity of our observation point, so that for example the contribution of Dutch is likely to be constantly overestimated due to the high penetration of Twitter in the Netherlands.However, the possibility of observing seasonal fluxes is absolutely remarkable if we consider the low cost, both in terms of time and resources, that a Twitter survey requires, compared to more traditional approaches.Moreover, monitoring social networks allows us to gain a real-time perspective of the fluxes, which is of course extremely hard to achieve through demographic studies.

Discussion
In this paper we have characterized the worldwide linguistic geography as observed from the Twitter platform, aggregating microblogging data at different scales, from country level down to the neighborhood scale.Although we show that Twitter penetration is highly heterogeneous and closely correlated with GDP, we find that the statistical usage pattern of the microblogging platform turns out to be independent from such factors as country and language.This feature allows us to address different issues, such as linguistic homogeneity at the country level, the geographic distribution of different languages in bilingual regions or cities, and the identification of linguistically specific urban communities.Focusing on specific case-studies, we have shown that while Twitter trends mirror census data quite accurately, even though specific deviations might emerge when comparing data that can be influenced by the adoption rate of the microblogging platform or the fact that English is the most widely used language in Twitter.Finally, the analysis of temporal variations of the language composition of a given country opens up the possibility of observing traveling patterns and identifying in real time seasonal traveling and mobility patterns.
The presented results confirms the potential and opportunities offered by open access data -such as microblogging posts-in the characterization and analysis of demographic and social phenomena.

Data Collection
The datased was obtained by extracting tweets from the raw Twitter Gardenhose feed [31].The Gardenhose is an unbiased sample of 10% of the entire number of tweets provides a statistically significant real time view of all activity within the Twitter ecosystem.Twitter added support for explicit geotagging of tweets since November 2009, by providing API hooks that could be used by third party developers to embedded GPS coordinates within the metadata of each tweet.Since high quality GPS systems are increasingly common in mobile devices, this feature immediately became popular with mobile application developers and is currently available in hundreds of different twitter clients.On average, about 1% of the tweets contain GPS information

Language Detection
Automatically determining the language in which a certain text was written is problem of great practical importance for machine learning and data mining.Perhaps the better known example of this is a feature in Google's popular web browser, Chrome, that offers to translate a page from it's original language to the users native language has a feature that offers to translate a page to the users preferred language.The library that detects the original language of the page leverages Googles extensive experience with data mining and has been extracted from Chromes source code and made available separately as the "Chromium Compact Language Detector" [32], a library that was extracted from the open source version of Google's Chrome browser that is currently in use by millions of browsers around the world.To further ensure the accuracy of the result, we filter the results by using an uncertainty threshold within the language detector.

Geolocalization and Statistics
We restrict our analysis to tweets containing GPS coordinates, i.e. generated by using a smartphone with an Internet connection.This choice allows for the maximum geographical resolution, but inevitably reduces the volume of available signal.In fact, the data we have used for this paper constitutes just about 1% of the signal we have collected, which on its turn is approximately 10% of the total Twitter volume.
The amount of geolocalized tweets could be increased by considering self-reported informations.In fact, users are encouraged to provide their location information in the user profile, but it is not subject to any format restriction.Moreover, Twitter platforms do not prompt the user for an update of this field, thus any change to this metadata field has to be spontaneous and made voluntarily.For this reason, the information in the user profile is sometimes erroneous or has low granularity.While the research community is on a continuous quest to understand how to mine and geocode this data, doing so brings about many challenges [33].Moreover, when addressing temporal variations in mobility patterns, the use of smartphone GPS coordinates is required.
The metadata accompanying a tweet may also contain the geographical coordinates of a previous location in the field of self-reported location.These 'historical' locations might bias statistical measures involving mobility and/or fine graining, thus we considered them only in generating the language maps (Belgium, Catalonia, NYC).All sets of analysis performed at the country level make use solely of live-GPS coordinates.We consider only those countries for which our signal is generated by at least 200 users, normalized by their activity and location.So if a user emits 30% of her tweets from a given country she will contribute as 0.3 users to that country.110 countries satisfy this minimum user threshold.
Finally, it is crucial stressing that every set of statistical measures performed in this paper is done at the user level, in order to reduce the noise that bots or cyborgs might add to the analysis.If not suitably addressed, in fact, their presence could induce wrong conclusions on the day-to-day behavior of the average person [34].

6 FiguresFigure 1 .
Figure 1.Multiscale view of the geolocated Twitter signal.The large number of geolocated Twitter traffic allows for a high resolution characterization of human behavior.A) Europe B) Italy C) Lazio region D) Rome.The squares highlight the zooming areas..

Figure 2 .
Figure 2. Ranking of countries by users per capita.Ranking of countries as per average number of Twitter users over a population of 1000 individuals.

Figure 3 .
Figure 3. Users and GDP per capita.Correlation between country level Twitter penetration and GDP/capita.

Figure 4 .
Figure 4. User Activity.Probability density p(N ) of user activity (number of daily tweets N) grouped by country (A) and language (B), and by country while considering English tweets exclusively (C).Different curves collapse naturally, without any functional rescaling, indicating the presence of a seemingly universal distribution of users activity, independent from cultural backgrounds.

Figure 5 .
Figure 5. Languages by number of users.Languages ranked by total number of users.For clarity, only languages with more than 30 users are shown.

Figure 6 .
Figure 6.Geographic distribution of languages around the world.A) Raw Twitter signal.Each color corresponds to a language.Densely populated areas are easily identified, while, as expected, languages are well separated among European countries.B) Dominant language usage.The color of each country indicates the fraction of users adopting the official language in tweets.Gray represent countries without statistically significant signal.

Figure 7 .
Figure 7. Language share of the most active countries.Language adopted by users coming from Top 20 most active countries, ordered by number of English tweets.

Figure 8 .
Figure 8. Language polarization in Belgium and Catalonia, Spain.In each cell (600m resolution) we compute the user-normalized ratio between the two languages being considered in each case.A) Belgium.B) Catalonia.The color bar is labeled according to the relative dominance of the language denoted by blue.

Figure 9 .
Figure 9. Language polarization in Montreal, QC, Canada.English and French are considered.In each cell (200m × 200m) we compute the user-normalized ratio between English and French (excluding all other languages).Blue -English, Yellow -French.The color bar is labeled according to the relative dominance of English to French.

Figure 10 .
Figure 10.Language polarization in New York City, NY, USA.The second language by district or municipality (in the case of New Jersey state) is shown.Blue -Spanish, Light Green -Korean, Fuchsia -Russian, Red -Portuguese, Yellow -Japanese, Pink -Dutch, Grey -Danish, Coral -Indonesian

Figure 11 .
Figure 11.Monthly variations in Language use.Fraction of minority languages in specific countries as a function of the month.Increases in a specific language share indicate the presence of tourists visiting the country.Peaks are clearly visible during the local summer period.

Table 1 .
Basic metrics of the data set.Along with the total GPS signal, the fraction of live updates is reported (see Methods for details).