Mapping the Americanization of English in space and time

As global political preeminence gradually shifted from the United Kingdom to the United States, so did the capacity to culturally influence the rest of the world. In this work, we analyze how the world-wide varieties of written English are evolving. We study both the spatial and temporal variations of vocabulary and spelling of English using a large corpus of geolocated tweets and the Google Books datasets corresponding to books published in the US and the UK. The advantage of our approach is that we can address both standard written language (Google Books) and the more colloquial forms of microblogging messages (Twitter). We find that American English is the dominant form of English outside the UK and that its influence is felt even within the UK borders. Finally, we analyze how this trend has evolved over time and the impact that some cultural events have had in shaping it.


Introduction
With roots dating as far back as Cabot's explorations in the 15th century and the 1584 establishment of the ill-fated Roanoke colony in the New World, the British empire was one of the largest empires in Human History. At its zenith, it extended from North America to Asia, Africa and Australia deserving the moniker "the empire where the sun never sets". However, as history has shown countless times, empires rise and fall due to a complex set of internal and external forces. In the case of the British empire, its preeminence faded as the United States of America -one of its first colonies-took over the dominant role in the global arena.
As an empire spreads so does the language of its ruling class. Thanks to both its global extension, late demise, and the rise of the US as a global actor, the English language enjoys an undisputed role as the global lingua franca serving as the default language of science, commerce and diplomacy [1,2] (see Fig. 1). Given such an extended presence, it is only natural that English would absorb words, expressions and other features of local indigenous languages resulting in dozens of dialects and topolects (language forms typical of a specific area) such as "Singlish" (Singapore), "Hinglish" (India), Kenyan English [3], and, most importantly, American English [4] a variety that includes within itself several other dialects [5,6].
The transfer of political, economical and cultural power from Great Britain to the United States has progressed gradually over the course of more than half a century, with World War II being the final stepping stone in the establishment of American supremacy. The cultural rise of the United States also implied the exportation of their specific form of English resulting in a change of how English is written and spoken around the world. In fact, the "Americanization" of (global) English is one of the main processes of language change in contemporary English [7]. As an example, if we focus on spelling, some the original differences between British and American English orthography (most of which are the result of Webster's reform [8]) are somehow blurred and, for instance, the tendency for verbs and nouns to end in -ize and -ization in America is now common on both sides of the Atlantic [9]. Likewise, a tendency for Postcolonial varieties of English in South-East Asia to prefer American spelling over the British one has been observed, at least, for Nigerian English [10], Singapore and Trinidad and Tobago [11], regarding spelling and lexis, for Indian English [12] and Bahamas [13], regarding syntax, and for Hong Kong [14], regarding phonology. In addition, a growing tendency for Americanization has been observed for Philippine English, which, despite being rooted in American English, has experienced a rise in the frequency of American forms [15]. Although this Americanization is found in different registers, web genres have been highlighted as a text-type where American forms are preferred [16]. Electronic communication has indeed been considered to play a role in linguistic uniformity [17]. It is in this sense that this paper will make a contribution to the study of the Americanization of English, since a corpus of 213, 086, 831 geolocated tweets will be used to study the spread of American English spelling and vocabulary throughout the globe, including regions where English is used as a first, second and foreign language.
The study of diatopic variation using Twitter datasets is a relatively new subject [18]. The use of geotagged microblogging data [19] allows to quantitatively examine linguistic patterns on a worldwide scale, in automatic fashion and within conversational situations. The global extension and the real time availability of the data constitute major methodological advantages over more traditional approaches like surveys and interviews [20]. Importantly, the resulting corpora are publicly available [21], although due to their nature most of the literature has been concerned with lexical variation (for an exception that addresses semantic and syntactic variation, see Ref. [22]). Thus, different variables can be mapped after carefully removing lexical ambiguities [23]. A Bayesian approach shows good agreement between baseline queries and survey responses [24]. Machine learning techniques applied to Twitter corpora reveal the existence of superdialects [25,26], which can be further analyzed with dialectometric techniques [27]. Linguistic evolution in social media appears to be strongly connected to demographics [28]. Age and gender issues can be additionally introduced in the analysis [29]. Moreover, an investigation of lexical alternations unveils hierarchical dialect regions in the United States [30]. Twitter can be also employed in the study of specific varieties departing from the standard form [31]. However, online social media are more suitable for a synchronic approximation to language variation. If one aims at understanding the diachronic evolution of language, we need a corpus well established over time. This is available with the Google Books database [32], which has already been used for the analysis of relative frequencies that characterize word fluxes [33,34] or the applicability of Zip's and Heaps's law with different scaling regimes [35]. Here, we will complement our Twitter study of the Americanization of English with an analysis of the dynamical process that is taking place since 1800.
In this paper we analyze how English is used around the world, in informal contexts, using a large scale Twitter dataset. Due to the written nature of our corpus we consider in detail both how vocabulary and spelling of common words varies from place to place in order to understand how American cultural influence is spreading around the world. We complement this synchronic analysis with a diachronic view of how the prevalence of British and American vocabulary and spelling have evolved over time in British and American publications using the Google Books dataset. Figure 1. English tweets A heatmap showing the location of geolocated English tweets in our dataset that match our keywords.

Datasets
The goal of this manuscript is to analyze how English is used across both time and space. We study the geographical variation of English by using the Twitter Decahose from which we collect [36] all tweets written in English between May 10, 2010 and Feb 28, 2016 that contain geolocation information. The language is detected using Chromium Compact Language Detection library as in Ref. [36]. The tweets are then mapped to a grid of cells of 0.25 • × 0.25 • spanning the globe and resulting in 30, 898, 072 tweets matching our list of words. A heatmap illustrating the geographical distribution of matching tweets is shown in Fig. 1. Out of the general dataset we further select those tweets with spelling and vocabulary features that allow us to discern the variety of English used (see below for a detailed description).
The temporal evolution of English is analyzed using the Google Books dataset [32] of books published by both British and American publishers. The dataset contains the number of times individual words were used in books scanned by Google and dating back to the 15th century. However, due to the poor statistics in earlier periods, we restrict our analysis to the period between 1800 and 2010. Both data sources are different in nature: Twitter contains more colloquial expressions, while the language recorded in the books is more formal. As a result, these two sources, in combination, can provide a useful perspective on the spatio-temporal patterns developed or developing in English.

Metrics
The polarization, V c w , for a concept w in cell c during the data collection period is defined as the ratio: where A c w (B c w ) is the number of American (British) forms of the concept w observed in cell c. The polarization is then constrained to be in the [−1, 1] domain, with −1 corresponding to purely British and 1 being purely American forms.
The polarization of each cell, V c , is then determined by taking the average polarization over all words observed in cell c: where W c is the number of different words observed in cell c. Similarly, the polarization score of a country is defined as the average polarization taken over all the cells within that country. By considering the average polarization we are able to compare countries of varying sizes.
In the case of Twitter, the polarization signal is measured over the complete time period of the database since, as it is not long enough to allow for large variations in the language use patterns. On the other hand, when the time evolution of written language is considered with Google Books the space is not relevant, beyond the country of origin of the published book, and we add an index referring to the year y considered. The polarization V y is then defined as: where V y w is the concept polarization for year y and W y refers to all the books published in the country considered, the US or the UK, during year y.

Results
In our analysis, we consider two factors of differentiation between American and British English: Spelling and Vocabulary with different word lists used for each case. The complete list of words and expression used in each case can be found in the online supplementary material. It is the result of compiling information in reference books [9] and online sources such as the Oxford Dictionaries 1 . The words in the list were subsequently checked in two widely-used representative corpora of British and American English [37,38]. Only pairs of words in which one of the members exhibits a significantly higher frequency in either of the two varieties were considered for inclusion in the list. Inflectional forms (e.g., solicitor, solicitors, solicitor's, solicitors' ) as well as derived (e.g., amphitheater ) and compound forms were also included in the search (e.g., sportscenter ).
Let us start by considering how the Vocabulary used for common terms such as lorry/truck or motorway/freeway changes around the world by defining the ratio of each cell as described above. The results are plotted in Fig 2. Unsurprisingly, we find that the British Islands are tendentially blue while the United States is predominantly red as befits the representatives of each trend. Interestingly, Western Europe where English teaching has traditionally followed British norms the American influence is undeniable. Most areas are depicted in various shades of red while some of the largest international metropolises such as Madrid, Paris, Amsterdam, Berlin, Milan or Rome are visible in light shades, in no doubt due to their role as touristic and transportation hubs, see Fig. 3(left). A more marked British influence is easily seen in former colonies (see also Fig. 5) such as South Africa, Australia, New Zealand ("the only large areas in the Southern hemisphere where English is spoken as a native language" [9], and which have reached a very advanced phase of development, according to Schneider's 2007 Dynamic Model [39]) or India (where English is spoken as a non-native language, but which has followed an exonormative model, i.e., strongly based in the British rules [40]) displaying large areas of blue side by side with tell tale patches of white in the most international areas such as Pretoria, Melbourne, Sidney, Auckland, New Delhi or Mumbai. Furthermore, countries such as the Philippines (one of the few Postcolonial varieties of English with an American superstratum [39]), as well as Taiwan, South Korea and Japan (where English is spoken as a second language) attest their strong American influence with full displays of red.
Regarding Spelling, the case for American influence becomes even stronger as displayed in Fig. 4. The British Isles attain significantly lighter shades of blue as do the former British colonies with South Africa, Australia and New Zealand becoming predominately red. This dichotomy between spelling and vocabulary, illustrated in Fig. 3 for Europe, is perhaps a testament to the conflicting forces of traditional formal education and media influence. Individuals who studied in school systems that subscribe to the British form of English are more prone to continue writing words in the way they originally learned them. However, through the influence of American dominated television and film industries they have acquired new (American) vocabulary. This can be clearly seen in Fig. 5 where we plot the average polarization for both Vocabulary and Spelling for 30 countries around the world, including countries belonging to Kachru's [41] inner circle, i.e., where English is spoken as a native language (e.g., UK, Ireland), outer circle, i.e., where English is spoken as a second language (e.g., India, South Africa) and the expanding circle, i.e., where English is spoken as a foreign language (e.g., Portugal, Finland, Russia). Interestingly enough, in all expanding circle territories the American orthography and vocabulary dominate, and the same happens, obviously, in the United States and in the Philippines, a former American colony. The bottom part of the figure includes inner and outer circle varieties, where American vocabulary is also chosen over British forms, with the notable exception of India, UK and Ireland, whose green bars are always towards the left hand (British) side of the ratio spectrum. India's alignment with the UK is clearly the result of an exonormative model and postcolonial prescriptivism in this former colony of the United Kingdom [40,42]. Surprisingly, we find that in some ex-colonies which still hold strong ties with the British empire, such as South Africa, Australia and New Zealand, the drift towards American vocabulary is unmistakable.
We now consider a temporal view of how English as a language is evolving. Using the word counts provided by the Google Books digitalization efforts, we measure the Vocabulary and Spelling average ratio per year for books published by American and British publishing houses. An analysis of the resulting timelines as shown in Fig. 6 provides several interesting insights. First, we can see that the divergence in spelling between the American and British forms has significantly increased in the last 200 years. Indeed, from this time series we can pinpoint the beginning of the trend to around 1828 when Noah Webster published An American Dictionary of the English Language [43] with the explicit goal of systematizing the way in which English was written in America. As [44] puts it: "He is certainly responsible for establishing (though not inventing) the common differences between traditional British and American spellings" the final -or versus -our in color, labor, savor, and the like; -er versus French -re in theater, center, meter ; and the simplification of final -ck as in physic, music, logic. This is now considered to have been the first American-English dictionary and it started the Merriam-Webster series of Dictionaries that is still dominant today. The US vocabulary curve follows a similar but less pronounced trend as it takes longer for new words to be created than for people to agree on a common spelling form.
Another interesting feature of these timelines is the pronounced "Britishization" of American English in the years following World War II as seen by the declining slope that extends until after 1960. This can likely be explained by the large influx of European migrants that moved to America in search of a better life away from a destroyed or warring Europe. In the immediate aftermath of WWII congress passed the War Brides Act in 1946 and the Displaced Persons Act in 1948 to facilitate the immigration to the US by the people affected by the war. It is estimated that between 1941 and 1950 over 1 Million people [45], mostly of European descent, immigrated to the United States that at the time had a population of 150 million. In the following decade, this number doubled to over 2 Million [46].  Interestingly, while the ratio timelines within the United Kingdom had been towards becoming ever more British, we find a significant change of trend in the last 20 years of our dataset, corresponding to the period after the fall of the Berlin wall and the end of the Cold War that left America as the world's only superpower. It is the status quo resulting from the aftermath of this trend that we are able to observe in the Twitter analysis above.

Conclusions
The way in which languages evolve in time and change from place to place has long been the focus of much interest in the linguistic community. With the advent of new and extensive corpora derived from large scale online datasets we are now able to take on a more quantitative approach to tackling this fundamental question. In this work we analyze two datasets that, when taken together, are able to provide a bird's eye view of the way English usage has been changing over time and in different countries.
The picture we are able to paint is particularly stark. The past two centuries have clearly resulted in a clear shift in vocabulary and spelling conventions from British to American. This trend is especially visible in the decades following WWII and the fall of Berlin Wall. Indeed, when we consider the current status quo as seen through the lens of Twitter, it becomes clear that only in the countries where British influence has been strongest, such as ex-colonies with a strong exonormative influence (in Schneider's terms [39]), are British conventions still dominant to some degree.
It should be noted that both datasets we utilize in our analysis are intrinsically biased. Books are typically written by cultural elites. Also, despite their increasing democratization, GPS enabled mobile devices are, in many countries, only available to middle and higher economic strata. As a result, there are certainly factors of linguistic evolution we are missing but the fact that both datasets agree on the general picture means that we are able to capture, at the very least, the underlying trends.

Acknowledgments
BG thanks the Moore and Sloan Foundations for support as part of the Moore-Sloan Data Science Environment at NYU. LL-P thanks the Spanish Ministry of Economy and Competitiveness for funding under the grants FFI2014-53930-P and FFI2014-51873-REDT.