Immigrant community integration in world cities

As a consequence of the accelerated globalization process, today major cities all over the world are characterized by an increasing multiculturalism. The integration of immigrant communities may be affected by social polarization and spatial segregation. How are these dynamics evolving over time? To what extent the different policies launched to tackle these problems are working? These are critical questions traditionally addressed by studies based on surveys and census data. Such sources are safe to avoid spurious biases, but the data collection becomes an intensive and rather expensive work. Here, we conduct a comprehensive study on immigrant integration in 53 world cities by introducing an innovative approach: an analysis of the spatio-temporal communication patterns of immigrant and local communities based on language detection in Twitter and on novel metrics of spatial integration. We quantify the Power of Integration of cities –their capacity to spatially integrate diverse cultures– and characterize the relations between different cultures when acting as hosts or immigrants.


Introduction
Immigrant integration is a complex process involving a multitude of aspects such as religion, language, education, employment, accommodation, legal recognition and many others. Its study counts with a long tradition in sociology through concepts such as immigrant assimilation [1], structural assimilation [2] or immigrant acculturation and adaptation [3]. Over the last years, there have been advances in the definition of a common framework concerning immigration studies and policies [4], although the approach to this issue remains strongly country-based [5]. The outcome of the process actually depends on the culture of origin, the one of integration and the policies of the hosting country government [6]. Traditionally, spatial segregation in the residential patterns of a certain community has been taken as an indication of ghettoization or lack of integration [7]. While this applies to immigrant communities, it can also affect to minorities within a single country [8]. The spatial isolation reflects in the economic status of the segregated community and in social relationships of its members [9]. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 In global terms while international migration flows have remained almost stable over the last 20 years [10,11], political and economic upheavals such as the Arab Spring and the Syrian civil war have brought the problem of migrants and their integration to the forefront of world news and even the academic press [12,13]. A good part of newcomers concentrates in cities, and particularly in the large metropolises known as World Cities. These are centers that attract specialized immigration, driving important social and cultural transformations in cities worldwide [14]. The concept of Global or World Cities emerged in the 80s [15,16] as strategic territories that articulate the international economic structure. According to Sassen [15], Global Cities are not only characterized by growing multiculturalism but also by a rising social polarization, which was finally materialized into an increasing social spatial segregation and gentrification processes. This assertion is still under debate in the area of social sciences, requiring its settlement further empirical evidence [17,18]. Furthermore, immigrant integration has been the focus of many research studies, most of which conducted from national perspectives especially in European countries and the USA [5,6,8,19,20], and it is still in dare need of information sources beyond national census [12,13,21].
In parallel, the last few years have brought a paradigm shift in the context of socio-technical data. Human interactions are being digitally traced, recorded and analyzed in large scale. Sources as varied as mobile phone records [22][23][24][25][26][27][28][29][30][31][32][33][34], credit card transactions [35], or Twitter data [36][37][38] have been used to study mobility and land use in urban areas. Most of these works have been carried out in the zones where data was available, mostly inside cities or single countries. Twitter data has, however, the particularity of extending beyond national borders and, therefore, it allows researchers to analyze mobility and city hierarchies at an international level [36,39]. Besides activity and mobility, the content of the tweets bears also a wealth of information starting by the language in which the text is written. The spatial distribution of languages has been investigated in Refs. [40][41][42], exploring as well the relations between languages trough multilingual individuals, and in Refs. [43,44], where the spatial extension of Spanish and English dialects was examined. Of course, one of the weak points of Twitter as data source is its representativeness. This question has been boarded in Refs. [37,[45][46][47], finding acceptable coverage for the American, British and Spanish populations in terms of geographic allocation, race, religion and mobility, although the data shows a bias towards younger individuals. In this context, it is of special interest the mix of location and language detection. This issue opens the door to characterize foreign users in short visits, temporal or permanent stays. Arribas-Bel [48] published a first exploratory work on this direction using Twitter and census data in Amsterdam. Contemporarily, the use of phone call records to foreign countries has provided a picture of communities with external connections in the area of Milan [49]. When it comes to immigrant integration, there are less works but one that deserves mention is a study recently published by [50] who looked at the social ties (friendships and affinities) between immigrant communities by using data from Facebook. There have been diverse attempts to measure the degree of immigrant integration over the last years [51] by introducing a quantitative index, the Composite Assimilation Index (CAI), that quantifies the degree of similarity between native-and foreign-born adults in the United States, based on US census data. In [50], a similar measure of integration is considered based on the relative proportion of ties between immigrant people born in the US, compatriots living in the US, and inter-group friendships with immigrants from other countries.
In this work, we introduce a novel approach to quantify the spatial integration of immigrant communities in urban areas worldwide. By analyzing language in Twitter data, we are able to assign languages to each user paying special attention to those corresponding to migrant communities in the city considered. The individuals' digital spatio-temporal communication patterns allow us to define as well areas of residence. With this information, we perform a spatial distribution analysis through a modified entropy metric, as a quantitative way to measure the spatial integration of each community. The metric can be expressed in a bipartite network with the culture of origin in one side and the hosting cities, countries and languages in the other. These results lead us to categorize the cities according to how well they integrate immigrant communities and also to quantify how well hosting countries integrate people from other cultures.

Materials and methods
We selected 53 of the most populated cities in the world (see Fig 1a) and analyzed the geolocalized tweets originating in each city between October 2010 and December 2015 as captured from the Twitter API (see S2 File of the Supporting Information for an example of a query). The data was collected respecting Twitter's terms of service and privacy conditions. Several items are extracted from each tweet: user ID, geographical coordinates (latitude and longitude), date and time and the text of the tweet. In order to get a coherent picture in the different Africa has been not considered due to the lack of data. We cover each city with a square grid in order to keep a homogeneous spatial division over the whole urban area where the users are going to be distributed (b), selecting resident users and their most frequent location thanks to their activity over space and time (c). In addition, we assign the users' most probable native language (d) and perform a spatial analysis over the cities (e) to get information about the population distribution in function of the language spoken by the users. time zones, we convert the Twitter UTC time into the local timezone for each city. Before starting with the analysis, it is necessary to filter out non-human users from the dataset. This is fundamental in order to prevent result pollution by signals coming from automatic tweet generators (bots), which are not rare in social networks [52]. We found and disregarded tweets generated at the same time (with the precision of the second) by the same account. Moreover, we discard users who tweet more than three times per minute. Finally, we detect the speed of users moving through consecutive locations in order to filter out those traveling faster than a reasonable speed in urban areas (100km/h or 62mph). This procedure leaves us with a total of 350.9 millions of tweets posted by 14.5 millions of users in the 53 cities (see Table A in the S1 File of the Supporting Information (SI) for detailed numbers per city).
We will propose below a metric to assess spatial segregation of immigrant communities that is not highly sensitive to the specific borders of the area studied. However, everything has its limits. The mix of local and immigrant population is different in urban and rural areas. It is important thus to attain a balance and ensure that the region considered contains the city, where the signal on immigrants is stronger, but it does not extend unnecessarily far from it. This means that we should agree on a city definition that can be applied around the world and it is large enough to include the whole metropolitan area. Unfortunately, generic definitions such as the Larger Urban Zone (LUZ) definition of Eurostat for Europe does not exist at the global scale. There are plenty of different ways of defining cities, with, for example, methods based on urban growth, percolation, attraction or fractal theory. All these methods require third party data such as population, built-up area or flows of commuters that is not easily available in a consistent form everywhere. To side step this difficulty, we use a very pragmatic definition based only on the Euclidean distance and consider all activity within a frame of 60 × 60 km 2 centered on the barycenters listed in Table B of the S1 File of the SI to belong to the city itself, dividing each city area using an equally spaced grid of 500 × 500 meters (Fig 1b).

Definition of the user's place of residence
As represented in Fig 1c, the place of residence of every user is defined as the most frequented grid cell between 8pm and 8am local time. To ensure that a user shows enough regularity and that he/she is really living in the city, and not just a visitor for a small period of time, we applied three filters: a minimum number of consecutive months of activity C, a minimum number of hours spent by the user in the most frequented cell N measured out of his/her consecutive tweets, and Δ as the ratio between N and the total number of hours of activity for each user (number of hours during which he/she has posted at least one tweet). The source code used to extract most visited locations from individual spatio-temporal trajectories is available online (https://github.com/maximelenormand/Most-frequented-locations).
Users who are active within a given city for at least three consecutive months are considered to be residents, so this establishes the first condition C ! 3 months. The values of the other two parameters were determined empirically. In Fig A of S1 File(Supporting Information), we plot the evolution of the number of users left in the dataset as a function of Δ for different values of N = [5,10,15,20] in each of the 53 cities. As the shape of the curves is similar for different values of N, it does not seem to be a natural features that would allow us to define a clear cutoff. We fix Δ ! 0.2 and N ! 5, as a trade-off between being relatively sure about the users' residence area and keeping enough number of users to have proper statistics. Table C in the S1 File of the Supporting Information lists the final number of residents per city after this data cleaning procedure. Note that there are at least 1000 reliable users per city.

Language assignment
At this point, we are interested in introducing a method to determine which languages each user speaks, or at least in which languages he/she tweets. If any of these languages is proper of an immigrant community, this most likely will identify the user as a member of that community. To do this, the language in each tweet is detected using the version 2.0 of the Chromium Compact Language Detector (CLD2), which returns the languages detected along with a confidence assessment. CLD2 implements a Bayesian classifier for detecting language from UTF-8 text. Twitter entities (urls, mentions, hashtags) that may difficult our language detection efforts are removed, and only the remaining text was given as input to CLD2. To obtain reliable results, we keep only tweets for which the detector returned a language with confidence level of at least 90%. Also, we aggregate close languages to take into account the uncertainty in the identification of mutually intelligible languages and dialectal varieties (see Table D in S1 File of the Supporting Information for more details).
As can be expected, there are users tweeting in more than one language. We create a dictionary of the occurrences of each language in each users' tweets pattern. English is one of the most frequent language per user, because of its diffusion as lingua franca for spreading information to the highest number of Twitter followers. Still since we are interested in finding the language representative of users' community of origin, we propose a language algebra in order to extract this information from the user's dictionary. Let us define as Local the official language of each city. There are cases where there can be more than one Local language coexisting in the same city, like Catalan and Spanish in Barcelona, French and Flemish in Brussels or French and English in Montreal. The same occurs for Dublin and Singapore (see Table E in the S1 File of the SI for a complete list of cities and languages). After defining the Local languages in each city, we assign to each user its most frequent language. In case of bilingual/multilingual users, we set as user's language the one which differs from English or the Local unless these are the only two languages in the dictionary. In this latter case, we define the user as speaker of the Local language. In case of three languages spoken by the same user, we adopted the same hypothesis, assigning to the user the third language spoken apart when only one or both between English and Local are in the dictionary. In general, we take the most popular language in the dictionary other than English and the Local ones. If there are only Local languages and English, we keep the Local. English can be only assigned if it is the only one in the dictionary. The final number of users left for the analysis with a reliable residence cell, per language identified and per city are displayed in Table F of the S1 File of the Supporting Information. We consider languages in each city with 30 users or more.

Bipartite spatial integration network
To quantify the spatial segregation of each immigrant community in every city, we build a bipartite spatial integration network H (see Fig 2). Every language is connected to the cities where the corresponding immigrant communities has been detected. The weight of an edge between language l and city c, h l,c , corresponds to the level of spatial integration measured with a new metric inspired by the Shannon entropy, but modified to take into account the finite character of the sampling of communities in our Twitter database. Shannon entropy-like descriptors have been used before in this context especially when considering the spatial segregation of ethnic minorities in the US cities [53]. Recalling that the cities have been divided in equal area grid cells and focusing first only on one generic city c, we can directly calculate from the data the fraction of users of a certain community l having their residence at cell i, p l,i . This allows us to define an entropy per language community l: where N is the total number of cells and the index i runs over all the cells. Δx 2 is the area of the cells, it is added to make the entropy stable against changes of spatial scale as proposed in Ref. [54]. We take as unit the area our 500 × 500 m 2 cells and, thus, a change in cell size as those shown in the Supporting Information for 1 × 1 and 2 × 2 square kilometers requires a correction factor 4 and 16, respectively, as expressed in Eq (1). The distribution of the population is generally heterogeneous, so s l,c by itself is not telling us anything about characteristic features of the community l. To overcome this and also to take into account the finite sampling size, we introduce next a random null model. The n l,c users associated to language l in city c are drawn at random over the city cells according to the total distribution of users to obtain new fractions p r l;i for language l in each cell i, and then we evaluate the following entropy: This process is repeated R times to smooth out fluctuations and in this way we obtain an average hs rand l;c i. Here, we are interested in the limit of large number of realizations, R, in which the users speaking language l would be distributed at random within the local population (fully integrated). The reason to repeat the procedure instead of using in a single run the distribution of the full population is to maintain the effect of the finite number of users speaking l. The speakers of this community l can be more or less concentrated in certain areas than the general population. To assess this effect, we define for each city c and detected language l the ratio: To make the metric further comparable across cities, we further normalizedĥ l;c by the value obtained for the local language(s) spoken in city c,ĥ loc;c (Table E in the S1 File of the Supporting Information). If more than one local language is present in the city, the data for all these languages is aggregated to obtain a joint value ofĥ loc;c . The final definition of the ratio of entropies is thus: In this way, the information provided takes as baseline the local population and will inform us whether a specific group is spatially segregated or not. According to this definition

Evaluation of the migrant communities spatial distribution's accuracy
Twitter has the advantage of being a global source of data, but also the disadvantage of having several uncontrollable biases. Young people are usually over-represented [45,47], and most likely the people belonging to the diverse communities are adopting the technology in different ways. If the use of geolocated Twitter is widespread in the host country, this will depend on the maturity of the migrant community: second and third generations are more likely to behave as locals and to adopt generalized technologies in the host population than first generations. On the other hand, things may vary if the technology is already commonly accepted in the country of origin of the community. Certainly, there are communities that are not detected. According to the National Institute for Statistics of Spain (INE, http://www.ine.es), a total of 45,728 and 54,599 Chinese citizens are residing in Barcelona and Madrid provinces, respectively, in 2016. Provinces are territorial divisions that enclose the urban areas and that loosely correspond to the area of analysis taken for our Twitter data. However, the number of users detected tweeting in Chinese is below the threshold of 30 with a valid residence cell and, therefore, this community does not appear in either of these cities. In the case of this particular group, there may be various reasons for this situation including the relative novelty of Chinese migration to Spain with most of this people belonging to the first generation, as well as the existence of alternatives to Twitter in China such as Sina Weibo. The important question here is thus not whether we find all the communities, but whether we are able to say something meaningful about those detected.
Going step by step, let us consider first the influence of the geographical area chosen on the structure of the bipartite network between language communities and cities paying special attention to the weights of its links. For this, recall that we have selected areas of 60 × 60 km 2 around the barycenter of the 53 cities considered. These areas have been further divided in cells of 500 × 500 m 2 , which are the basic units of the analysis. The 53 cities are large megalopolis, still one can wonder if a square frame of 60 km side is enough to cover all of them, or whether we are including rural areas that could pollute the results. To check the stability of the network in function of the size of the city boundaries, we evaluate the relative error among the edge weights for different side sizes (20,40,80 and 100 km) using as reference the original 60 × 60 km 2 frame. In particular, the relative change l,c of the link weights in the bipartite spatial integration network taking as reference the 60 km side frame is computed as follows, where h ref l;c represents the edge weight for 60 × 60 km 2 frame. Box plots displaying the distribution l,c values for different frame side sizes can be found in Fig 3a. The network weights are stable for frame side sizes ranging from 40 to 80 km. Beyond these values, the differences are increasing, the influence zone is too limited or extended far away from the center into rural areas or other neighboring cities. The value of 60 km for the side size is thus a safe choice. It is also worth nothing than the number of detected languages increases with the size of the frame. This number is however quite stable for box sizes ranging from 40 to 80 km (± 6% of the reference value). We perform the same analysis over the cell side size, taking as reference the 500 m side frame. Results are still quite stable increasing the size to 1000 and 2000 meters, respectively, as shown in the Fig G in the S1 File of the SI.
A next question to consider concerns the minimum number of users needed to obtain a stable measure of h l,c . The number of users for whom we can detect a residence area per community are not very high (Table F in the S1 File of the SI), and in addition we have set a threshold of at least 30 users to accept the data of a community. Where this value is coming from? To get a first impression of the effect that the user number has on h l,c , we select some of the most populous migrant communities, delete a fraction of their users at random and plot in Fig 3b the value of h l,c as a function of the remaining users. Every random extraction produces a different value of h l,c , so in the plot we depict the average and the error bars obtained from the standard deviation. Besides, we mark with a shadowed areas the values between which h l,c lies for the extractions with the largest number of users. The results depend on the particular community, but in general the values of h l,c enter in the shadowed areas between 10 and 100 users, 30 corresponds to the middle ground in logarithmic scale. A more systematic check can be seen in Fig  3c. There, a scatter plot with every value of h l,c for couples language-city is depicted as a function of the number of users associated to the particular community. After 30 users, there is no more clear dependency between h l,c and the number of users so it must reflect the spatial distribution of the communities. It is also possible to perform a more detailed check in a controlled environment by introducing a null model in which the local population is randomly but uniformly distributed across the grid forming the city, while the immigrant population can only appear in a subset N 0 of cells. In those cells the immigrants are also distributed uniformly and randomly. By tuning the number of immigrant users and N 0 , one can explore how the metric h l,c reacts to finite numbers (see Fig 3d). When the number of immigrants detected is smaller than N 0 , they are indistinguishable from the local population and thus the ratio h l,c starts in one. As the number of immigrant users gets over N 0 , the fact that their residence is restricted to a certain area of the city becomes evident and h l,c decays towards a fixed value. As can be seen in the inset of Fig 3d, the main control parameter of the null model is the ratio between the number of immigrant users and N 0 . The curves showing h l,c as a function of the number of immigrants collapse by considering them as a function of such ratio. In general terms, the metric h l,c reaches a stable value once the number of immigrants is between 10 and 20 times larger than the cells where the community concentrates N 0 . This model is a worstcase scenario for testing h l,c , since the immigrants distribute uniformly while in more realistic applications if a ghetto exists the concentration density will not be uniform. In this latter case, lower number of users are required to measure the stable value of h l,c .
Finally, we have been also able to run a comparison between the spatial distribution of the communities detected in three cities for which the data from census offices was available. These cities are Barcelona, London and Madrid, and for the comparison we use data from the so-called Continuous Register Statistics in Spain and the Census Office in the UK. In the Spanish case, the information is collected when people residing in a certain area must inform the municipal authorities for tax purposes and to obtain social services such as health care. The smallest spatial units for this dataset are census tracts, so Twitter data must be translated into the same geographical units (see the Supporting information for further details). We employ the Anselin Local Moran's I [55] to analyze the level of spatial correspondence of the main migrant communities. This metric provides information on the location, size and spatial coincidence of four types of clusters: a) high-high clusters of significant high values of a variable that are surrounded by high variables of the same variable; b) high-low clusters of significant high values of a variable surrounded by low values of the same variable; c) low-high clusters of significant low values of a variable surrounded by high values of the same variable; and d) lowlow clusters of significant low values of a variable surrounded by low values of the same variable. The details are included in the Supporting Information, but a summary with the most important results for a set of linguistic communities common to the three cities are shown in Table 1. The comparison between the location of the residence areas detected with Twitter and those registered in the census is in general good and significant, except for some of the immigrant communities such as Arabic in Barcelona and Madrid or East-Slavic in Madrid where the results lose significance and are compatible with a random distribution.

Power of integration
Once the limits of the data and the method to assess the spatial segregation levels of foreign communities have been checked, it is the moment to advance and study what can be said about the way that the cities integrate the foreign groups detected in Twitter. To this end and starting from the bipartite spatial integration network, we perform a clustering analysis based on the distribution of edge weights h l,c . For each city c, the weights of the edges are sorted in descending order and stored into a vectorẼ c . This vectorẼ c contains thus the information on how many foreign linguistic communities have been found in the city c and it quantifies how they are integrated. We can compare next the vectorsẼ c of pairs of cities to assess whether they behave in a similar way respect to the integration of external communities. Similarity metrics usually require the two vectors compared to have the same length. This difficulty can be overcome easily by adding zeros at the end ofẼ c until reaching the maximum length observed in the network L max , namely, for London. We then perform a clustering analysis to find cities exhibiting similar distribution of edge weights by using a k-means algorithm based on Euclidean distances. The results of the analysis are confirmed by repeating the clustering detection with a Hierarchical Clustering Algorithm yielding the same results (see Fig B in the S1 File of the SI). Fig 4a shows the three clusters (C1 in blue, C2 in red and C3 in green) obtained after applying the clustering algorithms. These three clusters are characterized by the different rhythm of decay of the entropy values inẼ c as can be seen in Fig 4b. The first cluster C1 including cities like London, San Francisco, Tokyo or Los Angeles shows the slowest decay. These cities contain in general a number of communities, which are spatially distributed closely mimicking the local population. In the other extreme, the cluster C3 comprises cities with few or none migrant communities and displaying a high level of spatial segregation for the groups detected. In some cities of this club such as Guadalajara or Lima, we could only detect after applying filters the local languages. However, there are others like Toronto, Miami, Dallas, Rome or Istanbul for which the number of communities is comparable to the cities in the other clusters but the decay of the entropy is way much faster. The communities in their respectiveẼ c are highly isolated in comparison with the local population or with similar communities in cities of C1.

Immigrant community integration in world cities
Finally, there is a middle ground in the cluster C2 containing cities as New York, Paris, Philadelphia, Chicago and Sydney. We introduce a new metric in order to summarize the distribution of entropy and to assess the city's Power of Integration (Table G of the S1 File of the Supporting Information). This metric is defined, for each city, as: where L c is the number of languages spoken in city c and L max is the maximum number of languages across the whole set of cities, Q2 is the median value of entropy and IQR the interquartile range used as a measure of dispersion. P c is maximum when the median of the entropy ratio distribution is one or over, IQR = 0 and the number of languages hosted by city c is the maximum. On the other extreme, it tends to zero when there are no hosted language, the languages are spatially isolated with Q 2 = 0, or when the IQR = 1 covering the full range of values. The top three ranking cities in each cluster according to the Power of Integration are displayed in Fig 4a. According to the full ranking of cities by their Power of Integration (Table G in the S1 File of the SI), the metric is able to capture the contribution in the spatial integration process within each urban area: cities belonging to cluster C1 comprises values of P c ranging from Tokyo's 0.41 to London's 0.79; the former city shows good integration of massive communities Immigrant community integration in world cities coming from South Korea, Philippines and China. On the other side, the British capital shows almost full spatial mixing of a very large number of foreign communities. Cities belonging to cluster C2 are characterized by values of P c ranging from Jakarta's 0.10 (characterized by mixing segregation behaviors in a scenario of spatial uniformity of most of the communities) to the 0.37 reached on the urban area of Philadelphia; here we found several communities that are uniformly spread within the city, whereas segregation appears focusing on the Arabic speaking community. The cluster as a whole mixes first segregation behaviors in a scenario of several communities involved in the process. Finally, cluster C3 is when both low number of immigrant communities are not well uniformly distributed within the urban areas, proved by the fact that P c are very low. Brussels's 0.01 is due to the low values of entropy of the Turkish community within a scenario of few immigrant communities. Toronto, on the other side, is characterized by a very high number of immigrant communities (comparable to cities found in the cluster C2), not being well spatial integrated within the urban environment. This leads to a P c value of 0.12. Note that the clusters are obtained directly from the similarity between vectorsẼ c for each city, and later their character is explained by using the decay of the ratios h l,c in the vectors and P c .

Language integration network
The bipartite spatial integration network can be also be projected into the language side to gain insights on the level of integration of languages into the different countries (see Table H in the S1 File of the SI). We do the analysis at the country level because we assume that the integration of the immigrant communities is similar across the cities of the same country. When there are more than one city in the country, we take the average value of the entropy h l,c to build the network. The best and the worst cases of integration are displayed in Fig 5 left and right. Before proceeding to the analysis, it is important to mention that English has been excluded from the network because of its role as lingua franca [56]. Moreover, the role of English is dominant mainly in the worst links in terms of integration (see Fig C in the S1 File of the SI for more details). We select two thresholds of levels of integration of language in countries: in the top set (Fig 5 left) the strong Power of Integration of UK cities (London and Manchester) sets its dominant role in uniformly spatial integrating several communities. Several patterns of uniform spatial integration appear, such as the Italian community in Venezuela, and the Spanish-speaking in Germany, Singapore and Turkey; the latter country shows uniformly distributed communities of Spanish people (due to historical migrations of Spanish Jews dating as far back as the 15th century), and Kurdish (largest ethnic minority in Istanbul). South-Slavic and East-Slavic communities keep their traditional presence in Russia and Germany. Increasing the threshold of the link weights, UK leads in the role of hosting diverse communities and some other patterns emerge, such as the German presence in Japan and UK. By contrast (Fig 5 right), Arabic rises as the most common spatially segregated community followed by French-speaking communities that appear to be spatially concentrated in other European countries such as Germany and Turkey. Increasing the threshold further, results in more forms of segregation appearing in Canada (East-Slavic, French and Tagalog), Australia (Malay and Japanese), Brazil (French) and Philippines (Italian and Spanish). Note that the segregation can occur on the two extremes of the economic spectrum: poor people may need to live in ghetto-like areas but also wealthier communities may concentrate with respect to the general local population as it seems to be the case for Italian and Spanish speaking minorities in the Philippines or the English speaking community in Rome.

Discussion
People are constantly moving within cities and countries, looking for jobs, experiences or just for better life conditions, facing the fact of the integration in habits and laws of new local cultures. Migration flows have been studied so far by means of surveys and census data that cover . In addition, we include an extra 10% of links (dash-lines) to the network, those between 10% and 20% best links (their spread is in the boxplot (b)). In the network only nodes that belong to the top set are highlighted. Similarly, on the right, the worst levels of spatial integration of languages in countries are shown. We filter out the bottom 10% links according to the entropy distribution (their spread of values is in the boxplot (c)), and add an extra 10% of links to the network (dash-lines), those links between the 10% and 20% worst in the ranking. Their spread is in the boxplot (d). As before, only the nodes that belong to the worst set are highlighted.
https://doi.org/10.1371/journal.pone.0191612.g005 from the number of people living outside their country of birth to place of residency to features of the labor market. However, census and surveys have the disadvantage of a very high cost, geographical limitations and, typically, they have slow update frequencies. Recent works by experts in the area highlight the dare need of more agile data sources about mobility and settlement patterns of immigrant and refugee communities. Rather than using these classical sources, in this work we explore the capability of the online social networks to provide information about the integration of immigrant communities. In particular, we use Twitter to connect users to their residence place and via a language algebra to determine their cultural background. This allows us to study how spatial and linguistic characteristics of people vary within the cities they are living in, and how the cities spatially integrate the diversity of languages and cultures characteristic of the global metropolises today. It is necessary to admit the potential biases of the data: the social network penetration through socio-economic hierarchies, age, generations and countries is different. This is precisely the reason why we do not detect all the possible communities in the cites under consideration. Still we have introduced a method compressing a metric that is not so sensitive to the small numbers in the users detected. As can be seen, in the validation exercise the results in the cities where we can compare with the census are significant for communities with more than 30 users. This method in general measures how well different communities are spatially integrated/segregated within urban areas. Our findings provide a new way to observe the patterns of historically immigration of people to urban areas, and any potential changes that might arise in the areas of residence. We are able to move beyond the estimation of past, current and foreshadowed global flows toward a better comprehension of the integration phenomena on a city scale. Residents' online communications can thus let us assess in an indirect way if the cultural background has been kept inside communities, although impacted on different levels by local welcoming and hosting policies. This method provides an extra alternative to the toolkit of researchers in sociology and urbanism as well as direct view in close to real time on the potential problems of integration that may appear in different areas of the cities, a knowledge that can be of great value to public managers.
Supporting information S1 File. Pdf file containing the SI. This file includes 10 tables (Table A: