Spatial interactions in urban scaling laws

Analyses of urban scaling laws assume that observations in different cities are independent of the existence of nearby cities. Here we introduce generative models and data-analysis methods that overcome this limitation by modelling explicitly the effect of interactions between individuals at different locations. Parameters that describe the scaling law and the spatial interactions are inferred from data simultaneously, allowing for rigorous (Bayesian) model comparison and overcoming the problem of defining the boundaries of urban regions. Results in five different datasets show that including spatial interactions typically leads to better models and a change in the exponent of the scaling law.


Introduction
One of the pillars of the study of cities as complex systems is the existence of statistical laws that apply "universally" to urban regions in different locations [1][2][3][4]. Examples include the Zipf's law of city sizes, the gravitational law of population movement, and-the focus of this paper-scaling laws between observables y and the population x of cities. All these laws have their origin in the first half of the XX century and continue to be investigated in increasingly rich datasets [5][6][7]. In particular, the scaling law (1) was discussed for the area of cities since the 1940s [8], can be viewed as a form of increasing return to scale [1,9], and has been the subject of many recent studies [4,[10][11][12][13][14][15]. Originally, urban laws were seen as akin to the empirical laws of classical mechanics, the basis of a sociophysics theory [7,8]. A modern trace of this simplistic view is the fact that models and explanations of the origin of these laws are typically presented independently from the statistical analysis in support of their validity, e.g., the data analysis supporting (1) is based on straight-line fits of log y vs. log x regardless of the explanation for its appearance. This undermines the statistical nature of the laws (evident from the large fluctuations) and is unable to select between the many alternative models that "explain" their origin (which often predict different fluctuations and can thus be tested).
The need for careful data-analysis methods to investigate statistical laws in complex systems has been extensively discussed for power-law distributions such as Zipf's law [16][17][18] scrutiny is being applied to the methods used in scaling laws in urban systems [13,14,[19][20][21] and reveal the limitations of the traditional linear-fitting approach: it relies on several simplifying assumptions, it is unable to deal with y = 0 in the data, it makes it difficult to compare to alternative models and to assess whether the scaling is indeed non-linear (β 6 ¼ 1), and it treats each city equally so that results are sensitive to cut-offs and fluctuations in the data of the many small cities. These limitations motivated us to introduce in Ref. [19] a model of urban scaling that focuses on individuals instead of cities, effectively giving more weight to the largest cities. Fig 1 compares this and alternative fitting models for the dependence of the Gross Domestic Product (GDP, y) on the population of cities (x) in two countries. A limitation of data-analyses methods of scaling laws (1) that persists is that they ignore the crucial element of any urban data: their spatial component [1]. The importance of location for scaling laws has been recognized [11] and modelled [15], but not incorporated into the data analysis. Linear fitting and all methods proposed in Ref. [19] assume that observations in different cities are independent from each other and thus independent of their location. Not surprisingly, the scalings show spatially-correlated fluctuations [11] and are sensitive to the definition of city boundaries [13,14]. For instance, in the results in Fig 1 (left panel) we highlight one of Brazil's municipalities ("São Caetano do Sul"-SP) that lies within Brazil's largest metropolitan area (around "São Paulo"-SP). We see that the GDP of this municipality is much larger than expected by any of the models and it is natural to suspect that this is at least partially due to its proximity to other urban areas. This effect is enhanced by the fact that Brazil's data is aggregated according to administrative areas (municipalities), which often do not reflect connected urban regions. Still, the problem of defining appropriate urban areas is not trivial [5,14] and spatial proximity should play a role regardless of the chosen urban unit. In fact, Fig 1 (right panel) shows that in USA, where data is given for metropolitan areas, a similar effect appears (e.g., "San-José-Santa Clara"-CA close to "San Francisco"-CA, or "Trenton"-NJ between "New York City"-NY and "Philadelphia"-PA).
Here we propose the first framework to investigate scaling laws (1) that accounts simultaneously for the following three crucial points: (i) it is based on generative models (Sec. 2); (ii)  (11). The estimated scaling exponent β of the different models are shown in the caption. Cities close to large urban areas are highlighted. The City model gives more weight to larger cities and therefore leads to a value of β that is different from the linear fit [19]. it accounts for spatial interactions between different urban areas; and (iii) it allows for rigorous statistical analyses (Sec. 3), including model comparison and the inference of parameters. Results in 5 datasets from Brazil and USA show (Sec. 4) that, in most cases, models that account for spatial interactions provide a better description of the data and that the scaling exponent β depends on the spatial scaling, in agreement with previous observations [13,14] of the dependence of β on the urban unit.

Generative process
We are interested in modelling the process that generates data compatible with the scaling law (1). The starting point of our model is the widespread interpretation that Eq (1) reflects a change in people's efficiency (or consumption) depending on the amount of interactions available to them [12]. Accordingly, we consider a generative process in which tokens (e.g. a patent, a dollar of GDP, a piece of infrastructure) are assigned to (produced or consumed by) an individual person j with probability p p (j).
Consider j = 1, . . ., M persons living in i = 1, . . ., N cities, on which the population of the city i is given by x i and X ¼ P N i x i . A total of Y � ∑ i y i tokens are (randomly) assigned to the X persons. In the absence of any other information, this defines our first (null) model: (P) Per-capita model: All tokens Y are distributed with equal probability to all persons j as in a constant per-capita attribution, p(j) = 1/X. In this case, the probability p c that a token is attributed to city i is given by where c(j) is the city in which j lives and δ(x) = 1 for x = 0 (otherwise δ(x) = 0).
This model corresponds to a linear (trivial) scaling law, β = 1 in Eq (1). A super-linear β > 1 (sub-linear β < 1) scaling is obtained if a token is more likely to be assigned to someone living in a more (less) populous city. In this spirit, in Ref. [19] we assumed that the probability that a token is assigned to person j depends on the population around j as Here we generalize this idea to account for spatial interactions between j and other individuals j 0 that live in other cities (i.e., c(j)6 ¼c(j 0 )). We introduce a quantity A j , defined as the total attractiveness of individual j due to all its interactions, and use it as a weight on the probability of assignment of a token as where Z(β) is the normalization constant (i.e., P X j p p ðjÞ ¼ 1). If β = 1, the probability p p (j) is the same for all j as in the per-capita model and we recover Eq (2). For β > 1, p p (j) grows with the interactions A j in line with a super-linear scaling. For β < 1, p p (j) decays with A j in line with a sub-linear scaling.
The attractiveness of an individual A j certainly depends on a multitude of factors that could be included in the model, depending on data availability and research interest. Here, we focus on pairwise interactions a j,j 0 between individuals j and j 0 separated by a distance d = d j,j 0 . We obtain A j as the total interaction of j and all other individuals j 0 by summing over all j 0 The distance d j,j 0 � 0 does not need to be a distance in a mathematical sense and, in practice, depends on the availability of data. Below we use the geographic (geodesic) distance between cities (another natural choice would be the commuting time). The pairwise (spatial) interactions a j,j 0 is discussed below and will lead to three different specific models.

Spatial interactions
In order to explore the formalism above we now consider simple dependencies of the pairwise interaction a(d) on the distance d � d j,j 0 between two persons j, j 0 . In general, we are interested in functions a(d) that monotonically decay with d from a(0) = 1 to lim d ! 1 a(d) = 0. Choosing another value at a(0) leads to the same results because of the normalization of p p (j) in Eq (4). Our framework can be applied to any function a(d) suitable to model spatial relationships, data will reveal us which one is more suitable.
The simplest choice of a(d) is (C) City model: in which interactions occur only within the same city (d = 0). From Eq (5) we get A j = x c(j) , i.e., we recover the scaling law (1) and Eq (3) (the model of Ref. [19], Sec. 4.2).
Spatial interactions beyond city limits can be incorporated using more general functions a (d). Here we start this investigation with functions a(d;α) that depend on a single parameter α that is measured in the same units of d (e.g., km) and sets a scale for spatial interactions such that a(α;α) = 1/2 (i.e., at a distance d = α the interactions decay to a factor 0.5 of the interaction at the same city d = 0). Furthermore, we wish to recover the choice (6) in the limit of small α, i.e., a(d)!a C (d) in Eq (6) for α ! 0 + . Two choices of a(d;α) that satisfy these properties (and also a(0;α) = 1 and lim d ! 1 a(d, α) = 0 for any α) are: (G) Gravitational model: inspired by models of gravitational interactions (for large d the interactions decay as a * 1/ d 2 , one can also replace the power 2 by an additional parameter) [3,8,15].
(E) Exponential model: For α ! 1, the distances do not matter, everyone is equally linked to everyone else, and the P-model is retrieved. Altogether, the four models discussed above are summarized in Table 1

Likelihood
We now discuss how the likelihood of our models can be computed from the data. We assume that (x i , y i ) data is available at locations i = 1, . . ., N. We denote the locations i as cities but we stress that this does not need to correspond to any urban definition of cities as the spatial interaction between different regions can be accounted explicitly in our models by choosing an appropriate function a(d). We assume also that a measure of distance d i,i 0 between all pairs of cities is available (e.g., the geodesic distance between the centroid of the cities).
Besides their location (city), individuals are indistinguishable. Therefore, the probability p c (i) that a token is assigned to city i is given by a sum of p p (j) over persons j on city i (i.e. c(j) = i), which contains exactly x c(j) � x i terms where we used Eq (4) and consider that x i � 1 for all i. The last equation defines the attractiveness of an individual in city i as This can be thought also as the number of effective interactions available for an individual in city i so that A i = x i in the city model (6) and A i � x i otherwise (e.g., for the gravitational and exponential models). It depends only on the population x i of all cities and on the distances d i,i 0 between cities, e.g. through Eqs (7) or (8), and therefore A i can be computed independently of the data y i .
The expected number of tokens in city i is given by The probability of observing y i tokens in each city of size x i is a multinomial distribution This corresponds to the likelihood P(D|M, θ) of the data D � {y 1 , � � �, y N }-since the populations (x 1 , � � �, x N ) are fixed-for a given model class M and given parameters θ. It is convenient to write the log-likelihood as 3 Data analysis

General framework
The models described above contain strong simplifying assumptions Our focus on the scaling relationship led to the assumptions that individuals are identical and that the token assignments are independent. and therefore our approach here is not to test whether the data is compatible with them (we know it is not While in linear fitting the number of observations equals to the number of cities, our model focus on individuals j and tokens of output y (X = ∑x i , Y = ∑y i ) so that the number of observations is much larger and the expected fluctuations (for large cities) are much smaller than the fluctuations in the data. This accounts only to fluctuations of the (random) assignment of tokens and neglects fluctuations (present in the data) due to measurement imprecision and due to factors that are not part of our model.) but instead to compare the different models. This means that instead of the likelihood P(D|M, θ) that models generate the data D = {y 1  computed from the three terms in the right hand side: • P(D) depends only on the data, act as a normalization, and does not affect the choice between models.
• P(D|M, θ) is the likelihood and is evaluated numerically from Eq (13). This is facilitated by two observations: (i) the two first terms in the log-likelihood (13) are independent of the models so that the variation across M and θ depends only on the last term; (ii) in this last relevant term, the parameter α enters only in A i through the dependence on a(d) so that for a fixed α the dependence of the matrix d i,i 0 is reduced to the vector A i . It is thus computationally more efficient to fix α, compute A i once, and then consider variations in β.

Estimation of parameters θ = {α, β}
The best parameters θ = {α, β} of a given model M are the ones that maximize the posterior P (θ|D, M). Since the priors are constant, this is equivalent to the maximization of the log likelihood (13) in the space of admissible parameters set by the priors.

Model selection
In the comparison of the different model classes M we account for the fact that models have different (number of) parameters θ by computing P(M|D), or equivalently, the description length by integrating over all parameters θ of model M The description length D corresponds to the size (in number of bits, for based 2 logarithm) of the optimal encoding of data and model [22]. Since the priors P(θ|M) and P(M) are constant, the crucial computational step is the integration of the likelihood over the parameters θ. When the number of observations Y = ∑ i y i is large (often the case for relevant urban scaling analysis), the likelihood is expected to be sharply peaked around the maximum-likelihood parameters θ. In this case, the description length D is dominated by the maximum log-likelihood and further approximations can be used to compute D (e.g., the Bayesian Information Criterion). However, one should be careful using these approximations to compare nonnested models (e.g., G and E) and around parameters θ in which the priors are discontinuous (as in the relevant case of α = 0).

Data
We apply the models and data-analysis methods described above to five datasets from two different countries. For Brazil, the data on three observables y-GDP, deaths due to external (non-natural) causes, and deaths due to AIDS-in the year 2010 is given for thousands of municipalities (administrative boundaries). For USA, the data on two observables y-GDP and miles of roads-in the year 2013 are given for hundreds of metropolitan areas. The USA cases can be considered as the paradigmatic examples of super-and sub-linear urban scaling laws [12]. In both countries, the average distance between two urban units is of thousands of km. The results of our analysis are reported in Table 2. The data and codes used in this paper are available in Ref. [23]. The data was collected from censuses and governamental agencies, was used in Refs. [12,17], and is available with further documentation and all codes used in this paper in Ref. [23].

The effect of α
We start investigating the central question of this paper: does spatial proximity between cities help to explain the observations y studied in urban scaling? And, if so, does it affect the scaling exponent β? The results in Fig 2 demonstrate that the answer to both questions is positive in most (but not all) cases. The top row of the figure shows that the value of the (maximum likelihood) exponent β for a fixed α changes significantly with α. The bottom row shows that often (Brazil GDP, USA Roads, but not in USA GDP) the best model is observed for α > 0. In these cases, there is an interval in α for which the model with geographic distance has a larger likelihood than the α = 0 case, compatible with the idea that the spatial scales we are accounting in this interval are meaningful (i.e., distances of 10-100 km that are relevant to interacting people).
The addition of spatial interactions does not trivialize the urban scaling law, differently from the effect of city boundaries reported in Ref. [14]. In fact, the non-linear scaling exponent β is often enhanced by the spatial relation α, i.e., super-(sub-) linear scalings β > 1 (β < 1) in the usual approach (at α = 0) show an even larger (smaller) value of β for the maximum-likelihood value of α. For instance, for Brazil GDP the estimation of β in the non-spatial models are 1.05 (linear fitting) and 1.17 (city model) while in the spatial models it is 1.24 (gravitational model) and 1.21 (exponential model). The same effect is observed in the case of sublinear scaling in the data for USA Roads Lengths, see Table 2.

Comparing different models
In all our five datasets the models with non-linear scaling (C,G, and E, for which β 6 ¼ 1) are preferred over the per-capita (P) model (negative DD in Table 2). In four of the five datasets, the models with spatial interactions (α 6 ¼ 0 in the G and E models) are preferred over the one (C-model) that ignores it. The exception is the case of USA GDP, for which the estimated value of α is zero for the G model and very small (1.65 km) for the E model. The description length D of the C model is smaller than the one in the G, E models by 2 and 3 bytes, respectively, indicating that the largest likelihood of the data obtained with α = 1.65 in the E model is not sufficient to justify its increased model complexity.
The comparison of the Gravitational and Exponential models reveal that both show a very similar behaviour as a function of α (Fig 2), similar inferred model parameters α and β, and similar description lengths D (Table 2). This indicates that the conclusions are not very sensitive to the functional form of a(d), used to account for spatial interactions (as long as they satisfy the natural constraints we used to propose a(d)). The most important distinction we found is between models that ignore spatial interactions (linear fitting, C model, and α = 0) and those that account for it (α > 0 in the G and E models). α, β are the parameters in each model that best describe the data. D is the description length (15) of the model M (measured in bytes, B) and is used to compare different models (the smaller, the better). The description length is reported (in megabytes, MB) for M = P and the difference to DðM ¼ PÞ is reported as DD ¼ DðMÞ À DðPÞ for M = {C, G, E}. The uncertainty σ β in the estimation of β is σ β � 0.01 in all cases, computed using bootstrapping [19]. The data and codes used to obtain these results are provided in Ref. [23]. https://doi.org/10.1371/journal.pone.0243390.t002

Increased interactivity
We now investigate how the spatial models introduced here change the number of effective interactions of individuals. In the introduction we discussed how the GDP of cities close to large urban areas were underestimated. Our analysis reveals that spatial interactions were not a strong factor in the USA GDP data overall. This was different for Brazil GDP, where the best model is the Gravitational model with α = 14.6. For these parameters, in Fig 3 we show the increased attractiveness-or number of interactions, A i in Eq (10)-that individuals in different cities in Brazil experience. It fluctuates significantly from city to city because it is an intricate function of the location of all cities, but it is clear that smaller cities are more affected than larger cities. For the case of "São Caetano do Sul", the attractiveness of the inhabitants of this municipality is 43.6 times larger than assuming that interactions occur only within the municipality (i.e., A = 43.6 x for α = 14.6 in M = G). The GDP of this city is 11.0 BR$ (Billion reais), much larger than the per-capita expectation of 2.1 BR$. The city model improves this expectation to 2.7 BR  Table 2).
https://doi.org/10.1371/journal.pone.0243390.g003 $, still too low but better than the linear-fit estimation 1.6 BR$. The best spatial model (G model with α = 14.6 and β = 1.24) improves the prediction to 4.5 BR$. Therefore, we conclude that spatial interactions can explain a considerable amount of the GDP of this municipality, even more than the inclusion of the non-linear scaling (β > 0), but that other factors remain significant.

Discussions
We introduced models of urban scaling laws that account for spatial interactions between individuals in different locations and that allow for rigorous statistical inference and model comparison. Results in five databases reveal that spatial interactions between cities leads to improved models and change the estimation of the urban scaling parameter β. Our approach shows how the problem [13,14] of the effect of the definition of the urban unit (city boundaries) on scaling laws can be solved by including spatial interactions between different locations explicitly in the model and inference.
The framework introduced in this paper can be extended to account for more sophisticated models (of interactions), beyond the four simple models introduced here. This could include more detailed information about the proximity and connectivity between different urban areas (e.g., commuting time) and incorporate ideas proposed in models of scaling laws [15], in models of the growth of cities [2,3], and in methods to define boundaries of urban regions [5,14]. It would be interesting to use these models to explore datasets at different spatial resolutions (e.g., neighbourhoods) and when additional information on the population in each location is available. The crucial point is that additional parameters and models for interactions should be inferred from the data together with the parameter β of the urban scaling law, avoiding arbitrary choices and leaving to the data and model-comparison techniques the choice between different approaches.