Population changes in residential clusters in Japan

Population dynamics in urban and rural areas are different. Understanding factors that contribute to local population changes has various socioeconomic and political implications. In the present study, we use population census data in Japan to examine contributors to the population growth of residential clusters between years 2005 and 2010. The data set covers the entirety of Japan and has a high spatial resolution of 500 × 500 m2, enabling us to examine population dynamics in various parts of the country (urban and rural) using statistical analysis. We found that, in addition to the area, population density, and age, the shape of the cluster and the spatial distribution of inhabitants within the cluster are significantly related to the population growth rate of a residential cluster. Specifically, the population tends to grow if the cluster is "round" shaped (given the area) and the population is concentrated near the center rather than periphery of the cluster. Combination of the present results and analysis framework with other factors that have been omitted in the present study, such as migration, terrain, and transportation infrastructure, will be fruitful.


Introduction
Population change is a central precondition to be considered in policy making and urban planning. In urban areas with high population concentrations, decentralization policies may be designed to mitigate congestion and environmental problems [1]. In developing countries, rapid growth of the number of urban dwellers is forecasted to exacerbate water shortage [2]. In rural areas facing population aging and shrinkage, how to ensure convenience of public transportation [3] and health care services [4] is a crucial issue.
The choice of the residential location is a main determinant of spatial patterns of population changes over time. People have been suggested to choose the residential location by considering residential environment attributes such as the accessibility to workplace measured by commute distance [5][6][7], school quality [8,9], and the crime rate [8,10]. Residential mobility is also affected by the individual's life course and household attributes such as age and income [7,10], job change [5], marital status [11], the numbers of children and drivers [10], and home ownership [7,11]. PLOS  In addition to these factors, spatial characteristics of the city and inhabited areas, which shape socioeconomic and geographical environments, may also impact spatio-temporal patterns of population changes. For example, urban sprawl is considered to be a consequence of uncoordinated and unplanned urban development [12] and results in scattered spatial patterns of employment and residences in suburban areas [13][14][15][16]. These spatial patterns would cause a long commute time due to poor accessibility to workplaces [17]. In contrast, compact urban growth and the diversity of land uses within the region enhance the accessibility to both work and non-work activities [18,19]. If the accessibility to workplaces and other activities influences residential decision-making, spatial patterns of inhabited regions are expected to affect dynamics of population changes.
There have been studies relating the population size or its change to spatial patterns of urban areas. For example, the population size of a region was shown to obey a power-law relationship with the area of the region in Norfolk in England [20] (also see [21] for an analysis of approximately 70000 cities in the world). In 78 regions in Israel, the population growth rate in sprawl regions was higher than in compact regions, where the sprawl and compact regions were defined in part by the shape of their boundaries [22]. Fractal dimensions are also useful tools for relating the population size/growth and spatial patterns of residential areas. For example, the fractal dimension of the central part of Tel Aviv metropolis and its population size concomitantly increased over time, and the observed fractal dimension was larger than that of the wider Tel Aviv [23]. In 20 urban areas in the US, the fractal dimension and the population size were positively correlated [24].
To the best of our knowledge, past studies on the relationship between spatial characteristics of regions and population changes examined a single or a small number of metropolitan areas of interest. Therefore, it seems to be unknown whether the relationship between spatial characteristics of regions and population changes can be generalized to a large number of metropolitan and non-metropolitan areas, even within a country. To address this question, one needs longitudinal data of population density with a high spatial resolution. Remote sensing technologies and the recent prevalence of mobile phones offer promising data on population dynamics at relatively low cost [25][26][27]. For example, the spatial distribution of the number of workers estimated from mobile phone data closely matched the counterpart calculated from the US census data [28]. The population density can also be estimated from the amount of night-time lights in satellite imagery [29,30]. Such data enable estimation of short-term human mobility within a day or week [31,32].
However, the accuracy of data obtained with these technologies is unclear. Furthermore, the population dynamics estimated by these methods may be susceptible to changes in the accuracy and coverage of the technology over time. In the present study, we use population census data of Japan with a high spatial resolution measured five years apart. To date, census data are probably advantageous to mobile phone or satellite data in tracking long-term population changes with a high accuracy. In fact, census data have been used for evaluating the accuracy of other techniques [28,29].
We explore spatial factors that contribute to the population growth in local clusters of inhabited areas. We hypothesize that the shape of the cluster of inhabited patches significantly affects the population change in the cluster. To test the hypothesis, we carry out statistical analysis to relate population changes in a cluster over five years, from 2005 to 2010, to the cluster's shape and other demographic and socioeconomic variables. We resolve the aforementioned limitations of the previous studies by exhaustively analyzing clusters of inhabited areas across Japan and by using the census data with which the local populations are accurately estimated.

Data set
We used data obtained from the population census of Japan in 2005 and 2010; the census is conducted every five years. The data consist of demographic information on a grid of cells of 500 m × 500 m covering the entire Japan [33][34][35]

City clustering algorithm
To determine the boundary of an inhabited area, we applied the city clustering algorithm [21,[36][37][38]. The algorithm calculates the connected components of populated cells, i.e., cells that contain at least one inhabitant, where we have defined the adjacency of cells by the von Neumann neighborhood (i.e., each cell has four neighbors in the north, south, east, and west). To find the connected components, we used the "decompose.graph" function provided by the R package 'igraph' [39]. This function takes a list of the pairs of connected cells and returns the list of the connected components. We refer to each connected component as cluster. We obtained 24165 and 24707 clusters in 2005 and 2010, respectively. In the following analysis, we focused on population changes over time in the clusters identified in 2005, which we denote by c. In other words, we compared the number of inhabitants in each cluster c between 2005 and 2010. It should be noted that we did not use the clusters identified in 2010.

Dependent variable
We denote by n i (t) the number of inhabitants in cell i at time t. We investigated dynamics of the number of inhabitants in each cluster c identified in 2005 (Fig 1). To this end, we adopted i.e., the number of inhabitants in cluster c as of 2005, as the offset variable (see Eq (7)). In this manner, we aimed to compareñ cluster

Independent variables
We used the following independent variables for each cluster observed in 2005 to explain the population change between 2005 and 2010.
First, the area of the cluster (denoted by S and referred to as Area) is defined by the number of cells constituting the cluster. Second, the population density (referred to as Density) is equal to the number of inhabitants in the cluster divided by S.
We quantified the shape of the cluster by the following two indices. We defined what we refer to as Roundness, originally proposed in Ref. [40], as S divided by the area of the circle whose diameter is equal to the longest Euclidean distance between two cells belonging to the cluster. We measured the position of a cell by the two-dimensional coordinate of the center of the cell. For example, the clusters shown in Fig 2 have four cells and have the longest Euclidean distance equal to ffiffi ffi 5 p (in the unit of the linear length of a cell), yielding a Roundness value of 1.019. A cluster whose shape is close to a circle yields a large Roundness value. For a given S, the line-shaped cluster yields the smallest Roundness value. Roundness can be regarded as a simplified variant of the box-counting fractal dimension [41]. The second shape-related index, where L is the perimeter of the cluster. For a fixed S, Irregularity is small when the cluster is close to square-shaped. The perimeter was used for characterizing spatial patterns of urban regions [20]. Frenkel and Ashkenazi [22] applied Eq (3) to quantify the level of urban sprawl. We note that measures similar to Irregularity were proposed decades ago [42,43]. Because of the scaling relation S = L 2/Irregularity , Irregularity can be interpreted as the fractal dimension of the cluster [44,45]. We quantified the hypothesized efficiency of communication or transportation within a cluster by the following two indices. We defined the expected distance between uniformly randomly selected two inhabitants in the cluster by where d ij is the distance between cells i and j, and the denominator of Eq (4) is a binomial coef- 2 is the probability that randomly selected two inhabitants in cluster c belong to cells i and j. Because Eq (4) has the dimension of the length, it may give rise to multicollinearity with S in multivariate regression. To mitigate this potential problem, we divided Eq (4) by ffiffi ffi S p , which has a dimension of the length, to define We have assumed the normalization factor of ffiffi ffi S p to make Eq (5) dimensionless if clusters are two-dimensional (with a large Roundness and/or small Irregularity value). In fact, clusters may be line-shaped or fractal-like, in which case Eq (5) would have a dimension of the length to some power. However, we expect that Eq (5) is less correlated with S than Eq (4) is. Therefore, we adopted Eq (5) as a dependent variable and referred to it as characteristic length (CL). We also adopted the coefficient of variation, which is defined by the standard deviation divided by the mean, of the number of inhabitants in a cell belonging to the focal cluster. This index quantifies spatial heterogeneity in the distribution of inhabitants within the cell and is referred to as Heterogeneity. . We used the following two demographic dependent variables. First, Gender refers to the fraction of female inhabitants in the cluster. Second, we estimated the average age of the inhabitants in a cluster, referred to as Age, as follows. Because the data set did not contain the average age for each cell, we approximated it by the average age of inhabitants in the prefecture to which a cluster belongs. The average age of inhabitants in each prefecture is available from the prefecture-level population census data carried out in 2005 [46]. The prefecture of a cluster was defined as the prefecture to which the cell with the largest closeness centrality [47,48] in the cluster belongs. In the calculation of the closeness centrality, we regarded the cluster as a network in which a cell was a node and two nodes were adjacent if they shared a side. Using the R packages 'rjson' [49] and 'RCurl' [50], we submitted the latitude and longitude of the cell with the largest closeness centrality to the reverse geocoding service provided by National Agriculture and Food Research Organization, Japan [51] and detected the prefecture in which the cell was located. When the reverse geocoding service returned no output because the submitted cell was located in the sea or for other reasons, we used a different data set with which one can determine the prefecture to which cells of 1 km × 1 km belong [52]. The 1 km × 1 km cells in this data set and the 500 m × 500 m cells in the census data were coaligned with each other in the sense that the division of a 1 km × 1 km cell into four cells yielded four 500 m × 500 m cells in the census data. If the 1 km × 1 km cell to which the 500 m × 500 m cell in question belonged to multiple prefectures, we plotted the latitude/longitude of the 500 m × 500 m cell on the map provided by Geospatial Information Authority of Japan [53] and visually determined the prefecture. If multiple cells had the same largest closeness centrality value, we used the average latitude and longitude of these cells to determine the cluster's prefecture.
Although the procedure for calculating Age is complicated, we decided to include it in addition to Gender for two reasons. First, Gender and Age are not strongly correlated (see Correlation coefficients section for the result). Second, these variables are likely to impact the birth and death rates in a cell in different ways. As for Age, the birth rate is relatively high among women in their twenties and thirties [54]. Therefore, a cluster having a large fraction of individuals in reproductive ages is expected to have a relatively large rate of population growth. However, if the value of Age is even larger, the population growth rate within the cluster is expected to be smaller because the death rate increases with age [55,56]. The value of Gender may reflect the efficiency of matching between male and female depending on the sex-ratio balance. The extant results are mixed regarding whether a male-biased or female-biased sex ratio drives marriage squeeze [57,58]. However, to the least, marriage squeeze may negatively impact the fertility rate [59] especially in countries such as Japan, where people tend to have children after marriage. In fact, that percentage of children born out of wedlock in Japan has been around 2% [60] and much lower than in other countries [61].
As a socioeconomic factor, we used the fraction of workers in the tertiary industry in the prefecture to which the cluster belongs [46] and referred to it as Tertiary. We determined the prefecture of a cluster in the same manner as in the case of Age.

Regression models
For analysis of count data, a Poisson regression model is often used (e.g., [62]). This model assumes that the dependent variable (ñ cluster c ð2010Þ in the present case) obeys a Poisson distribution given by where the conditional mean μ c is determined by In Eq (7), Eq (2) is used as the offset variable, the logarithmic link function is used, β 0 is the intercept, β i (i = 1,. . ., 9) is a regression coefficient, X i (i = 3,. . ., 9) is the ith independent variable (i.e., Roundness, Irregularity, CL, Heterogeneity, Gender, Age, and Tertiary), and subscript c on the right-hand side indicates that the values of the independent variables are for cluster c.
In the Poisson regression model, the conditional mean of the dependent variable is assumed to be equal to its conditional variance. However, as we will show in Descriptive statistics section, the conditional variance of the dependent variable is considerably larger than its conditional mean for the present data. This situation is called the overdispersion, which we tested by running an overdispersion test [63,64] (see also [65] for the usage of the R package 'AER'). The overdispersion test is carried out based on the statistic Because the null hypothesis was rejected (Descriptive statistics section), we used the negative binomial regression model. A negative binomial regression model [62] assumes that the dependent variable obeys a negative binomial distribution given by where Γ(Á) is the gamma function, and θ is a parameter that is assumed to be the same for all clusters. In Eq (9), the conditional mean, μ c , is given by Eq (7). The variance of the distribution given by Eq (9) is m c þ m 2 c =y. To fit the model, we maximized the likelihood with respect to β i (i = 0, . . ., 9) (Eq (7)) and θ using the glm.nb() function in the R package 'MASS' [66].
The Area and Density variables obeyed long-tailed distributions (Fig 3(A) and 3(C); also see Descriptive statistics section). Therefore, in Eq (7), we logarithmically transformed Area and Density to improve linearity between the dependent and independent variables. In fact, the logarithm of Area obeyed a much less long-tailed distribution (Fig 3(B)), and the logarithm of Density obeyed a distribution that roughly looks like a normal distribution (Fig 3(D)). For these two independent variables, a 1% increase in Area (or Density) corresponds to a β 1 (or β 2 ) % increase in the number of inhabitants in 2010 in a cluster observed in 2005. For X i (i = 3, . . ., 9), an increase in X i by one unit increases the number of inhabitants exp(β i ) times. The distributions of these independent variables are shown in Fig 3(E)-3(K). We used the same offset term Eq (2) in the multivariate and univariate regressions.
We also searched the multivariate regression model that minimized the Akaike information criterion (AIC) among the models that had any of the independent variables as main effects and any of pairwise interaction terms between the independent variables. To avoid large variance inflation factor (VIF) values due to the pairwise interaction terms, we normalized all independent variables to have a zero mean [67]. We used the stepwise backward elimination method to find the best model, i.e., by sequentially excluding the least significant term in terms of the AIC [68].

Descriptive statistics
Statistics of the dependent, offset, and independent variables are shown in Table 1. We find that the area of a cluster, S, the number of inhabitants in a cluster, and the population density in a cluster are heterogeneously distributed, as suggested by large coefficient of variation (CV) values for these variables. Moreover, the skewness for these variables is large. This observation is confirmed by long-tailed distributions of these quantities shown in Fig 4. In the following statistical analysis, we restricted ourselves to the clusters whose areas are at least ten cells because the geometry of smaller clusters would be strongly affected by the spatial discreteness.
We ran the overdispersion test to confirm that the assumption of the Poisson distribution of the dependent variable was violated (p < 0.001). Therefore, in the following we report the result of the negative binomial regression model.

Correlation coefficients
The Pearson, Spearman, and Kendall correlation coefficients between pairs of independent variables are shown in Tables 2-4, respectively. The signs of almost all of the correlation coefficients are consistent across the three types of correlation coefficient. Table 2 indicates that log(Area) and Irregularity are strongly correlated (Pearson correlation coefficient = -0.794). This result is consistent with the positive correlation previously observed between the city size and the spatial compactness of the city measured by a fractal dimension [24,69]. However, we concluded that the multicollinearity problem was not present because the VIF values were sufficiently small (4.206 and 3.247 for log(Area) and Irregularity, respectively). In general, VIF values for independent variables should be less than 10, preferably less than 5, for multivariate regression analysis to be justified [70,71].

Regression analysis
The results of the negative binomial regression are shown in Table 5. The contributions of log (Area) and log(Density) were significant at the 0.1% level, Irregularity and Age at the 1% level, and CL at the 10% level. The other variables (i.e., Roundness, Heterogeneity, Gender, and Tertiary) were not significant. Table 5 also indicates that a 1% increase in Area and Density is associated with an increase in the number of inhabitants in a cluster in 2010 (as compared to 2005) by 0.0113% and 0.0227%, respectively. An increase in Irregularity, Age, and CL by 1% is associated with a decrease in the number of inhabitants in a cluster by 3.27 × 10 −4 (= 1-exp(-0.0327×0.01)) times, 2.40×10 −5 (= 1-exp(-0.0024×0.01)) times, and 3.62×10 −4 (= 1-exp(-0.0362×0.01)) times, respectively. Because the total population in Japan only changed by 0.23% between 2005 and 2010 (Data set section), the contribution of these factors to the population change is non-negligible.
The results for univariate regressions are also shown in Table 5. The signs of all the significant regression coefficients in the multivariate regression (i.e., negative binomial regression) were consistent with the results for the univariate regression, lending support to the results obtained from the multivariate analysis.
We carried out the model selection in terms of the Akaike Information Criterion (AIC) among the negative binomial regression models that were allowed to include any main effects and pairwise interaction terms. The regression coefficients of the selected model are shown in Population changes in residential clusters Table 6. The selected model contained all independent variables. The result that the main effects of log(Area), log(Density), and CL are significant is consistent with that for the multivariate regression. However, the main effects of Irregularity and Age, which were significant in the Population changes in residential clusters Population changes in residential clusters multivariate regression, were not significant in the selected model, while some interaction effects between other variables and Irregularity or Age were significant. This result implies that the effects of Irregularity and Age qualitatively depend on other variables. Lastly, the main effect of Heterogeneity, which was not significant in the multivariate regression, was significant in the selected model. On the basis of the results for the multivariate regression, univariate regression, and model selection, we conclude that the main effects of Area, Density, and CL are significant according to the different criteria. In other words, the population growth of a cluster is associated with an increase in Area, an increase in Density, and a decrease in CL. In addition, the main effects of Irregularity and Age were also significant in the multivariate and univariate regression (but not in the model selected by the AIC).

Summary
We searched for potential drivers of population changes in terms of demographic, geometrical, and other properties of a cluster of inhabited cells. Unsurprisingly, we found that the area and the population density of the cluster were positively correlated with the population growth rate.
In addition, we found that a shape parameter for the cluster, Irregularity, and the mean distance between inhabitants within the cluster, CL, had negative impacts on the population growth. Age also had a negative impact on the population growth. In contrast, the fraction of female inhabitants, Gender, and that of tertiary-industry workers, Tertiary, had no significant contribution. The present results suggest that the population change is predictable to a certain degree from spatial characteristics intrinsic to the cluster, irrespectively of demographic factors.

Effects of variables characterizing the shape and heterogeneity of a cluster
Roundness was significantly correlated with Area. This result is inconsistent with the previous result showing no significant correlation between the city size and the anisometry, where the anisometry was defined by the ratio of the length of the major axis and that of the minor axis of the ellipse including the city cluster [69] and hence similar to Roundness. This inconsistency may originate from the different terrains in different cities and countries, the pattern of centralization of the population to urban areas of Japan such as Tokyo [72] in the present study, or other reasons; we do not have a clear explanation. Because urban sprawl is often negatively associated with the compact city [17,22], it is intriguing to associate urban sprawl with Roundness or Irregularity. However, urban sprawl is not solely characterized by the shape of urban areas but also by a discontinuous development of suburban areas, which may reduce the intraand inter-region accessibility [13]. To relate our approach to urban sprawl, we probably need to consider relationships between different clusters and the role of each cluster in wider geographical regions.
CL had a negative impact on the population growth rate. By definition, CL is small when highly populated cells are located near the geographical center of a cluster (Fig 2(B)) rather than when they are located in the periphery of the cluster (Fig 2(A)). Therefore, our results suggest that a cluster's population tends to grow if many inhabitants are located near the center of the cluster. A previous study showed that the values of indices characterizing urban regions (e.g., Moran, Geary, and Gini coefficients) were sensitive to the distribution of inhabitants in a confined region [73]. The present study suggests that the spatial distribution of inhabitants may affect the population growth rate as well as such urban indices. Investigating this issue warrants future work.
We did not pay attention to the change in the shape of the cluster over years. In fact, processes of urban growth, which are characterized by, for example, the population size, economic performance, and development of transportation systems, occur in tandem with changes in the shape of urban areas (e.g., [23,74,75]). Socioeconomic factors reflected in the shape of urban areas may influence inhabitants' residential decision making, which may in turn change the shape of urban areas.

Effects of the population size of a cluster
We used the population of the cluster in 2005 as an offset variable, not independent variable. We additionally analyzed the following linear regression model with the population of the cluster in 2005 as an independent variable: logðñ cluster c ð2010Þ=n cluster c ð2005ÞÞ ¼ blogðn cluster c ð2005ÞÞ. The population of the cluster in 2005 was positively correlated with the growth rate in the cluster over the five years (β = 0.021, p < 0.001). This result is inconsistent with the previous studies showing a smaller growth rate for clusters with a larger population [36] and the lack of correlation between the population of administratively defined cities and their growth rate [76]. The reason for this discrepancy is unclear. It may be because of the different definitions of the cluster change in the two studies or the aforementioned centralization of the inhabitants to urban areas of Japan.

Comparison with the gravity model
The gravity model and its variants explain spatio-temporal migration and population changes in various data [77][78][79]. The statistical explanation of population changes that we have found is different from the mechanisms implied by the gravity model and its variants.
First, let us assume that the unit of analysis is a cluster. Then, the gravity model assumes that migratory population flows are influenced by the attractiveness (often identified with the number of inhabitants) of the origin cluster and the destination cluster. In contrast, we ignored any interaction between clusters. Therefore, in our statistical approach, the rate of population change does not depend on the population of different clusters, differently from the prediction obtained by the gravity model. We neglected effects of other clusters because we did not have migration data. However, this decision does not imply that migration effects are unimportant (see the Limitations section below for more discussion).
Second, the proposed mechanism is also different from that provided by the gravity model even if one uses a single cell as the unit of analysis and applies the gravity model to population dynamics within a cluster. Given the shape of a cluster, we found that the CL negatively impacted on the population growth rate. This result implies that the spatial distribution of inhabitants within a cluster affects the population growth rate. In contrast, the gravity model applied to population dynamics within a cluster would describe migration dynamics within a cluster. Because intra-cluster migration implies that the number of inhabitants is preserved over time, the gravity model applied to a cluster would not predict whether the population of the cluster tends to increase or decrease. In sum, the present analysis is orthogonal to what the gravity model aims to explain.

Limitations
An important limitation of the present study is that we did not have an access to migration data. In general, the population change is decomposed into the natural increase (i.e., births minus deaths) and the migratory increase (i.e., immigration minus emigration). Because the census data used in the present study did not include the information about the population flow, we could not distinguish between the natural and migratory increases. Another limitation is that some dependent variables (i.e., Age and Tertiary) were estimated at the prefecture level due to the lack of data at the level of single cells.
We did not consider other information such as land use as independent variables, either. For example, steeper slopes and higher elevations negatively impact on urban expansion [80,81]. Regarding transportation systems, the distance to highways and major roads negatively impact on urban expansion [80]. Network structures of transportation systems are also related to the urbanization [74,81]. For example, the treeness of street networks is negatively correlated with the metropolitan population [81]. Urban planning is also an important factor driving urban expansion. For example, Ref. [82] evaluated effects of urban master plans on urban expansion in Beijing between 1947 to 2008 and showed that the effects were positive in all periods. Further longitudinal analyses including any of these variables with an appropriate spatial resolution will be valuable.