Delineating Geographical Regions with Networks of Human Interactions in an Extensive Set of Countries

Large-scale networks of human interaction, in particular country-wide telephone call networks, can be used to redraw geographical maps by applying algorithms of topological community detection. The geographic projections of the emerging areas in a few recent studies on single regions have been suggested to share two distinct properties: first, they are cohesive, and second, they tend to closely follow socio-economic boundaries and are similar to existing political regions in size and number. Here we use an extended set of countries and clustering indices to quantify overlaps, providing ample additional evidence for these observations using phone data from countries of various scales across Europe, Asia, and Africa: France, the UK, Italy, Belgium, Portugal, Saudi Arabia, and Ivory Coast. In our analysis we use the known approach of partitioning country-wide networks, and an additional iterative partitioning of each of the first level communities into sub-communities, revealing that cohesiveness and matching of official regions can also be observed on a second level if spatial resolution of the data is high enough. The method has possible policy implications on the definition of the borderlines and sizes of administrative regions.

t 11 is the set of pairs the same community under C and C ; t 01 is the set of pairs not in the same community under C but under the same community in C ; t 10 is the set of pairs in the same community under C but not under the same community in C ; t 00 is the set of pairs not in the same community under C and C ; and n 11 , n 01 , n 10 , n 00 are their respective cardinalities, and n 11 + n 01 + n 10 + n 00 = n(n − 1)/2. The R [3] and F [4] indices are then given by: R = 2(n 11 + n 00 ) n(n − 1) F = n 11 (n 11 + n 10 )(n 11 + n 01 ) (1) which are two ways of quantifying how well the partitions match pairs of locations. A perfect match between two partitions will have R, F = 1. For the case of two completely unrelated clusterings, both indices are in general strictly larger than zero, more so for R [4]. Therefore, to have a baseline, we calculated the average indices over 1000 random reshufflings of locations in given partitionings of administrative regions, denoted byR r andF r . The classical measures R, F are asymptotically invariant of the problem size n, and only weakly ninvariant for finite n, therefore being well suited for different data sets as in our case. However, their base lines can vary heavily (R between 0.5 and 0.95, F between 0 and 0.6), making their linearity and usefulness doubtful even after normalization [1]. To have a measure grounded in another, informationtheoretical approach, we also use the variation of information VI, defined as where H(C) is the Shannon entropy of partition C and I(C, C ) the mutual information between the two partitions C and C , where P (i) = |Ci| n is the probability that an element of T chosen at random belongs to community C i ∈ C, and P (i, j) = |Ci C j | n the probability that an element belongs to C i ∈ C and to C j ∈ C . The VI has mathematical properties that are in line with our general intuition of what "more different" and "less different" should mean for two clusterings of a set. Although the VI is bounded by log 2 n, we are not making use of its normalized form (division by log 2 n) since it does not allow comparisons between distances obtained on different data sets [1].

The Algorithm
To the extracted communication networks we apply a standard modularity optimization approach for community detection [5,6]. The approach scores all the edges of the network according to their relative strength with respect to the weight of the nodes they connect and aims to maximize the cumulative score inside the communities, preferring edges with a positive score and avoiding those with a negative score.
The particular optimization algorithm [7] (further referred as "Combo") is a novel enhancement of the technique used by [8]. The idea is an iterative improvement of the partitioning in terms of the modularity score, starting from a trivial case where all nodes are gathered into one community following three steps of improvement: 1) dividing a community into two new communities, 2) joining two communities into one, and 3) shifting a part of one community to another existing community. The technique of splitting the community for the purpose of the first and third operations is based on the Kernigan-Lin approach [9] in a way similar to the refinement procedure suggested by [5]. The algorithm effectively avoids getting stuck in local maxima and usually produces the best modularity scores compared to the majority of modularity optimization algorithms known so far including Newman's greedy optimization heuristic [10], Newman's spectral optimization with refinement [5], simulated annealing [11] and another fast aggregation algorithm for large networks recently suggested by [12] known as Louvain method. It allows to handle networks up to 5,000-10,000 nodes in a reasonable time which corresponds to the dimension of our networks. We use this particular algorithm here for an improved partitioning quality for the large networks, see Fig. S3 and Table S1. Although major qualitative properties of the partitioning remain the same and do not considerably depend on the particular algorithm used, see also section "Independence on the algorithm" below, as high as possible modularity scores are reached, together with an improved quality of boundary definitions and lowered levels of noise (less "islands").
The intriguing property of the modularity optimization approach is that the resulting network division has no predetermined number of partitions. Only the raw topological information of the input network determines the range of communities detected -the algorithm may detect any number of communities between 1 (the full network) and N , the number of nodes (where each community is made up of one single node). Further, the algorithm does not fix the sizes nor the distribution of sizes of the detected groups, and it is not limited by any spatial constraints. Note that on the one hand, in cases where the algorithm produces boundaries which match official boundaries this can carefully be interpreted as having a "natural" validation. On the other hand if the algorithm does not so, the reasons -apart from a genuine deviation of human interaction regions from official boundaries -could also include low population density near the border making boundaries visually floating but leaving modularity scores practically unchanged, and other possible minor statistical fluctuations. Due to such influences, the boundaries produced by the algorithm cannot always be treated as definitely exact, as they may be shifted slightly. However, the cores of detected regions have been shown to be stable [8].

Stability analysis
The findings presented in the main text are based on a particular partitioning algorithm, and on a number of data sources possibly featuring different measurement errors or modes of collection. Therefore, here we proceed with an analysis of the variations of outcomes from running the algorithm multiple times, and of the strength of the partitions under perturbations, to understand the robustness and limits of the approach.

Robustness of algorithm
The used "Combo" method involves a stochastic element and therefore does not necessarily produce the same partitioning result with every run [7]. To test possible variations, we ran the algorithm 10 times for Portugal. All 10 results gave exactly the same results, showing that for the case of large spatial networks of communications there are practically no variations.

Independence on the algorithm
To test the robustness of the partitions in respect to the clustering algorithms used, we applied three additional standard algorithms besides our "Combo" method: the Louvain method [12], Newman's greedy optimization heuristic [10], Newman's spectral algorithm with refinement [5]. Resulting clustering overlap indices with the administrative regions are shown in Table S1. Generally the indices show comparable levels of overlap, in most cases slightly increasing for partitioning with better modularity; while the index VI is lower when the similarity is higher. The finding of similarity to official regions is also qualitatively independent of the algorithm used, as well as the "natural balance" of the number of regions, see number of communities in Table S1. Finally, see Fig. S3, using the examplary case of Portugal (but holding for the other countries as well), the geographical cohesiveness of resulting communities is also confirmed by all algorithms -apart from some minor noise, resulting communities consist of spatially adjacent locations. Therefore, the main qualitative properties of the partitioning seem to be independent of the algorithm, but using higher performance algorithms helps improving the overall quality of the delineation of boundaries and reducing noise levels.

Robustness of network data
To probe the robustness of the resulting partitions to noise in the data collection, we ran community detection on several realizations of each network, each of which we perturbed with increasing levels of noise, and compared the results with official administrative areas.
We devised the perturbations based on a simple model for the formation of the network: we assumed that, for each of the N phone calls, source and destination nodes are chosen according to a multinomial distribution, in which connectivity likelihoods f ij are approximated by the observed fractionsf ij = N ij /N for boostrapping purposes.
As a simplifying step, we decided to treat each pair of nodes as independent of all other pairs, and thus governed by a binomial, which is the corresponding marginal distribution.
As a further step to ensure the numerical feasibility of the generation of each network, we approximated the binomial distribution by using the corresponding Gaussian, in accordance with the de Moivre-Laplace theorem.
Perturbed networks were thus generated as where n is a normal random variable with zero average and unit standard deviation, and w is an arbitrary weight which we varied to obtain a better understanding of the solidity of partitioning. When considering duration of calls D ij , the generated network incurs an additional factor: where D = ij D ij /N is the average duration of each call.
For each network, we varied w from 0 (original network) to 10.5; then, for each w we generated 10 networks, computed the corresponding community structure, and compared it to the original administrative area using the cluster overlap index R.
Results are shown in Supplementary Fig.S1. It should be noted that employing the alternative indices, F and V I, yields precisely the same qualitative behaviour.
In the case of Belgium, moderate perturbations (w 2) of the network result in sizable variations in the resulting partition, which stabilize when the noise increases. Even higher levels precipitate the overlap, as the underlying network is overcome by fluctuations. Similar analyses apply to Portugal and other networks.
The values of w at which steep slopes happen in Supplementary Fig.S1 can be interpreted as levels of noise which are just strong enough to shake subsets of links responsible for the stability of the partition; this point of view also justifies the corresponding net increase of standard deviation for R. However in all cases values of w for which the variations become considerable are substantially higher than the normal average value 1 expected for the random model introduced above.
The variations we observe in the resulting partitioning need not be strictly interpreted on the base of the multinomial model: while the latter describes noise from one particular process, variations in link weights can also do away with systematic biases from other sources, such as geographical variations in market share, or conspicuous deviations off ij , the observed fractions, from the "real" likelihood f ij . It is in this spirit that we can understand increases in R as the result of a more reliable network.    Figure S1. Cluster overlap index R comparing the noiseless partitions with partitions having different levels of noise, for Belgium and Portugal. For each noise level w we produced 10 realizations of the perturbed network. Markers and error bars denote the means and standard deviations of these realizations, respectively. Up to a noise level of 1, there is practically no difference in the produced partitions. For higher noise levels, R drops lower in Portugal than in Belgium because of the difference in spatial resolutions n. Figure S2. Partitioning of Portugal with different levels of noise. With increasing noise, pairs of partitions start to merge together. One can also notice the appearance of small disjoint "islands".