The availability of reliable socioeconomic data is critical for the design of urban policies and the implementation of location-based services; however, often, their temporal and geographical coverage remain scarce. We explore the potential for insurance customers data to predict socioeconomic indicators of Swiss municipalities. First, we define a features space by aggregating at city-level individual customer data along several behavioral and user profile dimensions. Second, we collect official statistics shared by the Swiss authorities on a wide spectrum of categories: Population, Transportation, Work, Space and Territory, Housing, and Economy. Third, we adopt two spatial regression models exploring both global and local geographical dependencies to investigate their predictability. Results show consistently a correlation between insurance customer characteristics and official socioeconomic indexes. Performance fluctuates depending on the category, with values of R2 > 0.6 for several target variables using a 5-fold cross validation. As a case study, we focus on predicting the percentage of the population using public transportation and we discuss the implications on a regional scope. We believe that this methodology can support official statistical offices and it could open up new opportunities for the characterization of socioeconomic traits at highly-granular spatial and temporal scales.
Citation: Donadio L, Schifanella R, Binder CR, Massaro E (2021) Leveraging insurance customer data to characterize socioeconomic indicators of Swiss municipalities. PLoS ONE 16(3): e0246785. https://doi.org/10.1371/journal.pone.0246785
Editor: Shihe Fu, Xiamen University, CHINA
Received: June 12, 2020; Accepted: January 26, 2021; Published: March 3, 2021
Copyright: © 2021 Donadio et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Statistical data used in this research are available online at the Swiss federal statistical office website https://www.bfs.admin.ch/. The insurance costumer data can be provided for research purposes by the insurance company "La mobilière" https://www.mobiliere.ch/.
Funding: This study was supported by La Mobilière Insurance, Characterization of Quality of Space of Urban System, in the form of funding awarded to the HERUS Lab for the salary of a postdoc for the duration of 2 years and through coupling the data of insurance customers with statistically available data (proposal reference 18219_181217). The funder had no further role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have read the journal’s policy and have the following competing interests: La Mobilière Insurance, Characterization of Quality of Space of Urban System, provided funding to the HERUS Lab for the salary of a postdoc for the duration of 2 years, as well as support through coupling the data of insurance customers with statistically available data. This does not alter our adherence to PLOS ONE policies on sharing data and materials. There are no patents, products in development or marketed products associated with this research to declare.
National Statistical Institutes (NSIs) play an important role in modern societies to release precise information on social, environmental or economical activities  in the form of a census. The census records key aspects such as the population living in an area, their age, gender, income, and it enables predictive scenarios to estimate the need for schools, residential homes or public services. Censuses are an essential component to estimate the indicators that measure the progress towards the realization of the Sustainable Development Goals . Official statistics on socioeconomic status are increasingly addressing a significant modernization of their production process, nationally and internationally . This is also due to the opportunities offered by the use of new data sources, such as mobile phone data , social media , satellite images , credit card transactions  and others [8–10]. One of the main challenges for the NSIs is to coherently integrate new and traditional sources of investigation, with an increasingly widespread orientation towards the construction of registers of integrated elementary data. Three important challenges arise: i) the data collection methodology and quality, ii) privacy and legal issues and iii) the processing, storage and transfer of large data sets. Data sources such as social media, and mobile phone records, do not have a well-defined target population, structure and quality (see Section 2 for a literature review) that make difficult to apply traditional statistical methods based on sampling theory. Privacy and legal aspects pose another challenge: the prevention of the disclosure of the identity of individuals is regulated and enforced by international laws, and ensuring an appropriate level of privacy is challenging in case of heterogeneous, and multi-source large scale data streams . Copyright and data ownership  provide a barrier to open sharing platforms. Moreover, data processing represents an additional challenge due to the technological difficulties in the storage and the transfer of large amount of heterogeneous information ensuring security . In this context, countries are increasingly favouring alternative means of gathering information, instead of the traditional techniques of sending out printed forms, interviewing people in person, or via the use of online questionnaires. Alternatively, they are looking to indirect means of collecting data, taking advantage of a wide spectrum of administrative data streams that act as a proxy for the variables of interest.
In this direction, customers insurance records represent a valuable input to model the socioeconomic substrate of cities, and an opportunity for policy makers and researchers to broaden the scope of their studies. Social scientists raised the issue of representativeness and sampling bias of large scale digital data. For example, in  the authors show how age, gender, ethnicity, socioeconomic status, online experiences, and Internet skills, influence the social network sites that users generally adopt. This has implications for the extent of the conclusions that a study could claim given a particular audience. Like census data, insurance customers records share a similar size, reliability, and structural complexity . However, they differ in their spatio-temporal granularity and collection costs. In fact, the information of insurance customers is collected constantly by the provider while the census runs generally with a multi-yearly frequency due to its organizational costs. A downside is the proprietary nature of customers records that could invalidate the possible benefits for a broader community. However, we embrace the vision of initiatives like Data Collaboratives: https://datacollaboratives.org. that propose a new form of collaboration, beyond the public-private partnership model, in which participants from different sectors, in particular companies, exchange their data to create public value. In this research, we develop a methodology to predict socioeconomic indicators at a city level using individual customers data from an insurance company in Switzerland. Swiss municipalities, sometimes also called communities, are the lowest administrative level in the country. The responsibilities of the 2.212 (as of 1 January 2019) Swiss municipalities is decided by each Cantons. These may include the provision of local public services such as education, medical and social assistance, public transport and tax collection. Their degree of centralization depends on the choice of the single canton. Municipalities are generally governed by a council headed by a mayor (executive power) and by the general assembly of all adult Swiss residents (legislative power). Many cantons leave the larger municipalities the option of opting for a city parliament. Swiss citizenship is based on the citizenship of a municipality. Every Swiss citizen is, first, a citizen of a municipality (right of citizenship of the city or of origin) and, then, of a canton (right of cantonal or indigenous citizenship). For all these reasons, our analysis adopts the municipality as a spatial unit of reference.
We propose a two-steps process to predict a wide range of socioeconomic indicators. First, we compute a set of behavioral metrics using customers activity logs concerning housing properties and vehicles insured by “La Mobili‘ere” in 2017. Second, we use those microeconomics indicators as explanatory variables on two spatial regression models to predict 12 socioeconomic indices of 170 Swiss municipalities for which we have reliable official statistical data as ground truth. In this work, we focus on indices related to six different categories, i.e., Population, Transportation, Work, Space and Territory, Housing, and Economy. We show that insurance data customers can represent a valid resource to model socioeconomic indicators at scale.
The rest of the paper is organized as follows. Section 2 describes how insurance data can benefit the urban data science research agenda and it provides an overview of the previous work in this area. Section 3 illustrates the two data sets that we use in this paper: insurance and census data. In Section 4 we describe the methodology and the modeling framework. Section 6 discusses the results presented in Section 5 and it provides a critical view on the limitations of the implemented approach. Finally, Section 7 summarize the importance and the impact of this research and it provides insights for future directions.
2 Related work
Researchers across various disciplines including sociology, demography and public health have been keen on examining how society functions observing populations at scale. Socioeconomic indicators of cities, which were investigated before the digital era, collected data using field studies  or surveys . The focus of those studies was mostly finding correlations between demographic factors differences in urban and suburban areas , crime rate , population health , residential segregation  or waste production . In all these cases the results were based on the active participation of individuals to surveys, that may have been affected by the tendency of respondents to alter their behavior knowing that they are monitored. On the contrary, the digitalization of the modern society allows to model human behavior by means of indirect and less intrusive data collection methodologies, that are often a byproduct of services designed for different purposes. In a scenario where the collectivity produces every day more digital footprints than we are able to process, the majority of the recent studies focused on the use of big data streams to predict and study socioeconomic indicators of cities and countries. An extensive body of work adopt digital traces such as cell phone records [4, 21], social media posts , vehicle GPS traces  or credit card transactions [7, 24] to model human dynamics at scale. For example, it has been shown that social media data can predict the interplay between demographic attributes and gender gap , the monitoring  and assimilation of migrants , unemployment  rate, or health outcomes . Social media are increasingly used for demographic attribute prediction . Exploiting user interactions with web search engines, Weber and Castillo  inferred gender, income, and ethnicity, while Weber and Jaimes  exploited the US census to highlight the differences in navigational and search patterns among several demographic groups. Gender and age are also inferred using call detail records from smartphone devices over a large population.
Our contribution belongs to this line of work, however, to the best of our knowledge, it explores for the first time the use of insurance customer records to predict census variables. Insurance data have been mostly used to study the impact of specific diseases [32, 33], to propose models of customers fraud detection [34, 35], to explore the correlation between census-based socioeconomic indicators and injury causes  or to evaluate disparities within health care systems .
In this work, we make use of two data sets: 1) the customer activity logs of a Swiss insurance company and 2) the official socioeconomic indicators of all the Swiss municipalities with more than 10,000 inhabitants. The indicators are collected by the initiative called Union des villes Suisse: https://uniondesvilles.ch/ that publishes statistics on 173 Swiss municipalities every year since 2006. In this research, we focus on a cross-sectional snapshot for 2017 . The two datasets have different spatial aggregations: while the information on the insurance customers is at zip code level; the socioeconomic indicators have been collected at municipality level. To match the spatial granularity, we restrict our analysis to the 170 municipalities that are present in both datasets. Fig 1 compares the spatial coverage in the two cases, while Table 1 highlights differences and similarities across several dimensions.
(A) Insurance customers at zip code level: we count around 1.35M units which represent nearly 15% of the Swiss population. (B) Number of inhabitants per municipality from the Union of Swiss Cities dataset. Note that the geographical coverage of the official statistics is limited to the Swiss municipalities with at least 10K inhabitants covering 9.35% of the country’s surface. To ensure a match between the two data sources, we limit the scope of the analysis to this subset of cities.
3.1 Insurance data
The dataset contains the housing and vehicles insurance policies of 1,341,328 anonymized customers of La Mobili‘ere who were active during 2017. La Mobili‘ere is a Swiss insurance group (brands: Die Mobiliar, La Mobili‘ere, La Mobiliare) that is organized as a holding company headed by a cooperative. The company was founded in 1826, making it the oldest private insurance company in Switzerland. With a market share of over 29%, it is the leader in the personal property insurance market. For each user, we have three classes of information: i) demographic, e.g., age, gender, zip code of the residential area, employment and civil status; ii) cars, e.g., how many cars are insured, the brand and the price of the vehicles, as well as the record of the claims and the respective compensations; iii) housing, e.g., the number of private buildings or houses insured, the price of the building and the logs of the claims. Customers’ details are aggregated at the level of the 170 municipalities for which we have official census data using the zip code as spatial reference; this step leaves us with 568,426 customers matching the geographical boundaries. Table 2 summarizes the information available.
3.2 Swiss census data
The official statistics for the Swiss municipalities are collected and made available online within the initiative Statistics of Swiss Cities that is is the result of a collaboration between the Union of Swiss Cities and the Federal Statistical Office (FSO) https://www.bfs.admin.ch/l. The report is published in the first quarter of the year and it presents varied facets of the urban life; we focus on six domains: population (p), transportation (t), employment (w), space and territory (s), housing (h) and economy (e). We collect a total of 86 indicators for each municipality: 11 for transportation, 29 for population, 11 for employment, 8 for space and territory, 18 for housing and 9 for economy. From the original dataset, we focus on the key target variables that are not redundant and that can be a proxy for quality of life in cities, such as the unemployment rate , use of public transportation  or investment in culture . As a result of this process, we restrict the analysis to two indicators for each domain: the fraction of foreigners and the rate of beneficiaries of social assistance (p), the number of private cars per person and the fraction of commuters using public transportation (t), the unemployment rate and the unemployment rate among women (w), the percentage of areas covered by buildings or green areas (s), the vacancy rate and the average area per inhabitant (h), and the municipal debt and fraction of investment in culture (e). The complete list of selected target variables is summarized in Table 3.
As a validation step, we test the representativeness of the insurance data along four dimensions: (a) total population, (b) percentage of foreigners, (c) percentage of population aged 20-40, and (d) percentage of population aged 0-19. Fig 2 shows, for each dimension, a scatter plot and the corresponding Pearson’s correlation coefficient computed using the official census data and the La Mobili‘ere customers base. A high degree of correlation (ρ = 0.91) can be observed for the total population variable, meaning that the insurance dataset mimics quite well the population distribution at the municipality level. Focusing on age, we observe a strong relation for the case of customers in the age range 20-40 (ρ = 0.8) while the correlation disappears (ρ = −0.05) for customers aged 0-19. This behavior is expected since children and teenagers are not usually the owners of insurance policies on vehicles or houses. Last, we observe a solid relation with the percentage of foreigners (ρ = 0.6).
Each point corresponds to a different municipality. On the top, we report the correlation between the total number of inhabitants and customers (A) and between the fraction of foreigners versus the fraction of insured foreigners (B). On the bottom, we show the correlation between the fraction of the population aged between 20-40 years (C) and 0-19 years (D) with the fraction of customers within the same age range.
It is worth noting that socioeconomic processes often manifest non-random spatial patterns that make close areas more similar than distant ones. Moreover, spatial effects do not apply only to the case of neighboring areas; on the contrary, a consistent body of literature in geography define spatial relationships between aerial units as a function of distance . Often this choice depends on prior knowledge about the area under study or a conceptualization of the interactions between neighboring locations with regards to the target quantity. In this work, we refer to the Moran’s I statistic  to assess the presence of spatial autocorrelation in the census variables. Moran’s I measures the global spatial autocorrelation of an attribute y measured over n spatial units using the following relation: (1) where wi,j are the spatial weights, with being the average across spatial units, and s0 = ∑i∑j wi,j. In our experimental settings, the spatial weights are computed using endogenous adaptive bandwidths with a Gaussian kernel function implemented in the Python package pysal: https://pysal.readthedocs.io/en/latest/. Table 4 shows how all the selected target variables are positively spatially autocorrelated, ranging from I = 0.56 to I = 0.8. This implies that municipalities that are closer in space tend to share similar socioeconomic characteristics.
In this section, we describe the methodological steps of our predictive pipeline. After a features selection module, we adopt multivariate linear regression to predict the socioeconomic indicators of interest using two spatially-aware models that capture the global and local geographical dependencies.
4.1 Features selection
The first step in constructing cost-effective predictors is to select the features that will best predict a given outcome variable. For each of the socioeconomic indicators in Table 3, we select a subset of explanatory variables from the initial pool of covariates summarized in Table 5 using the LassoLarsIC module available in scikit-learn see https://scikit-learn.org. To reduce model complexity and to prevent overfitting, LassoLarsIC adopts the Least Absolute Shrinkage and Selection Operator  (LASSO) model for fit and it relies on the Least Angle Regression  (LARS) and the Bayes Information Criterion  (BIC) for model selection, trying to find the right trade-off between fitting performance and the complexity of the model. We explore alternative regressors with variable section, in particular the Elastic Net: see https://scikit-learn.org approach that exploits an iterative fitting procedure along a regularization path, and the Multi-task LASSO: see https://scikit-learn.org method that adopts L1/L2 mixed-norm as regularizer. These approaches tend to select a wider range of variables with a minimal increase to the end-to-end performance for some of the target variables. In the final experimental pipeline, we adopt the LassoLarsIC implementation due to its ability to strike the right trade-off across the use cases under study.
note https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoLarsIC.html that relies on Least Angle Regression and the Bayes Information Criterion for model selection and to find a trade-off between the goodness of fit and the complexity of the model.
Since variable selection methods may suffer from model instability or potential bias in parameter estimates and confidence intervals (especially relevant in explanatory modeling), we implemented the methodology and practical suggestions proposed in [46, 47] to control for these effects. In particular, we aim at estimating the stability of the selection procedure to random perturbations of training samples. We implemented a subsampling without replacement routine that randomly selects 80% of the initial datasets and run the selection procedure on the subsample. The subsampling technique has been extensively studied showing its asymptotic consistency even in cases where the classical bootstrap fails . We performed 500 subsampling iterations and we computed the stability estimator proposed by Nogueira et al.  that is a frequency-based statistics, ranging 0 to 1 and monotonically increasing as the stability of the feature selection grows. The idea is that the stability measure is a linear function of the sample variances with a strictly negative slope. According to the proposed framework, stability values above 0.75 represent an excellent agreement of the feature sets beyond chance, between 0.75 and 0.4 intermediate to good agreement, while values below 0.4 represent a poor agreement.
4.2 Spatial Lag Model
To characterize the influence of neighboring spatial units, we implement a Spatial Lag Model  (SLM) where the local effects are encoded adding a term that contains a spatially lagged version of the dependent variable. SLM is an instance of Spatial Autoregressive Models where the additional term is treated as an endogenous variable. More formally, this can be expressed in matrix notation, as follows: (2) where y is the vector of observations on the dependent variable, X is the matrix of observations on the exogenous variables, W is the spatial weighting matrix of known constants, β is the vector of regression parameters and λ is the scalar autoregressive parameter. The variable Wy is typically known as the spatial lag of y.
In literature, different approaches have been adopted to find the most appropriate spatial econometric model specification given an empirical use case. In this work, following the discussion in , we use a theoretical rather than data-driven approach to guide the decision. In our scenario, we hypothesize the presence of global endogenous effects where the indirect influence of a spatial unit falls on the entire set of locations, producing high-order effects observable even in spatial units that are not directly connected. We adopt the Spatial Lag Model over different specifications such as the Spatial Error Model, to take into account this dimension. For completeness, we also run the Lagrange Multiplier  (LM) diagnostics based on the OLS residuals in their standard and robust forms that are at the core of the methodology described in . Consistently, the SLM specification was selected for two thirds of the dependent variables in Table 3 as the most likely to account for the observed spatial effects.
Focusing on the weighting scheme, we adopt a k-nearest neighbors (KNN) approach in which each spatial unit is connected to a fixed number of k closest neighbors; the hyperparameter k is optimized via a grid search to find the optimal value that maximizes the average R2 across all the target variables. We experiment with alternative distance-based weighting schemes that adopt kernel functions with adaptive bandwidth obtaining comparable results. We use the PySAL package spreg to implement the SLM in our experimental settings.
4.3 Geographical Weighted Regression
Geographical Weighted Regression (GWR) is a local form of linear regression used to model spatially varying relationships. In regression analysis, the strength and direction of association is indicated by the regression coefficients, with one coefficient given for each variable in the dataset. In GWR, instead of one global coefficient for each variable, coefficients are able to vary according to space. This spatial variation in coefficients can reveal interesting patterns which otherwise would be masked. The general formula for a GWR is an extension of Eq 3 where one regression is calculated for each point using spatial weights. (3)
The index i indicates the location of the city of interest. GWR basically fits a set of β coefficients for each location: (4) where Wi is the diagonal matrix of the spatial weights, unique to location i. There are various schemes for calculating the weights, nearest neighbors, cubic or exponential kernels. The weights are computed by the following: (5) where dij is the Euclidean distance between municipalities i and j and b is the bandwidth of the kernel that has to be chosen. For each municipality we calculate a vector of weights and then regress using the final formula (whether linear or with interactions), in order to estimate all indicators for each municipality. The parameters forming the GWR will be the focus of the analysis because of the non-stationarity of the problem. It is interesting to explore how the influence of certain explanatory variables changes from city to city and whether there are underlying tendencies. We estimate a bandwidth for each city and model. The optimal bandwidth is estimated by minimizing an information criterion; in practice, we adopt a corrected version of the AIC that, in contrast with the original definition, is a function of sample size . In more details, in a GWR model with a bandwidth b, the AICc is given by: (6) where is the estimated standard deviation of the residuals; n is the number of observations and tr(S) is the trace of the hat matrix S. The hat matrix is the projection matrix from the observed y to the fitted values . Each row of the hat matrix is calculated as follows: (7)
To test the performance of the predictive pipeline we refer to an out-of-sample validation where the estimation of the regression parameters and the hyper-parameter tuning are performed on a training set using a 80% split and cross-validation, while the predictive performance is tested on the hold-out data (remaining 20%). To cope with the heterogeneity of the population distribution in our sample and to allow to train and test the models with a sample that is representative of the entire spectrum of population size, we implement a stratification approach. It is worth noting that using a random sampling strategy instead, we could end up in the cross validation procedure with splits that contain only highly or low populated municipalities, introducing a bias in the evaluation pipeline. In this direction, we partition the municipalities in three classes: communes with less than 25k inhabitants, those between 25k and 100k, and those with a population higher than 100k. Note that there are five cities that belong to the last class: Zurich, Geneve, Basel, Lausanne and Bern, which represent the five main Swiss cities. In each round of the cross-validation, the procedure ensures that each fold is a fair representation of the whole distribution balancing the three classes. We adopt a 5-cross validation accordingly. In the results section, we report the average and the standard deviation of the models performance due to cross-validation.
In the first part of this section, we present the results of the features selection process for each of the target indicators. After applying the LassoLarsIC method the number of selected features spans from 2 (for the Housing indicator h1) to 9 (for the Economy variable e2). In the column Features of Tables 6 and 7 we present the selected features for each model. Overall, we observe a fair degree of robustness to random perturbations with the measure of stability that varies across dimensions. In particular, p1 (0.82) and t2 (0.77) show the highest stability that reaches an excellent level of agreement; w2, h2, w1, p2, e2, s2, t1, s1, and h1 cover a range between good (0.72) and intermediate (0.43) stability (variables are listed in decreasing order), while e1 (0.22) shows a poor agreement. This low value indicates how the model characterizing the municipal debt e1 is highly dependent on variations of the training set to define significant determinants. Consistently, e1 is also the indicator with the lowest performance in the predictive task, indicating how the insurance data is not really able to capture its behavior.
Switching the focus on the predictive task, Tables 6 and 7 summarize the results of the Spatial Lag Model for all the indicators. We present the direction and the intensity of the relations along with confidence intervals; significant determinants are marked in bold.
In the Population domain, the fraction of foreigners (p1) is positively (β = 0.46) linked to the demographic feature f4 that represents the fraction of foreigners customers of La Mobili‘ere and, in the same direction, to the fraction of women f7 (β = 0.265). Moreover, the percentage of people that receive social assistance (p2) is positively linked to the unemployment rate f1 (β = 0.298), and the average number of claims per car f15 (β = 0.157). We observe a negative relation with average price of the cars f9 (β = −0.417) and the average number of rooms f23 (β = −0.327) as indirect proxy for the social class.
In the Transportation domain, the number of cars per 1000 inhabitants t1 shows a negative relation with the unemployment rate f1 (β = −0.137) and the average class of furniture f21 (β = −0.138). A positive link is found with the market share f6 (β = 0.381), the average premium of the cars f19 (β = 0.165) and the average years of construction f27 (β = 0.154). Concerning the commuters that use public transportation t2 we observe a negative link with the market share f6 (β = −0.101) and the fraction of house owners f3 (β = −0.239). This could be explained by the observation that individuals living in rental houses show a higher frequency of ride-sharing use and commuting using public transportation than those who own their houses . A positive relation is found with the percent of insured cars f20 (β = 0.261), the 95th percentile of the class of the insured furniture f22 (β = 0.120), and the average insured sum per building f25 (β = 0.091).
Focusing on the Work category, the unemployment rate w1 is positively connected to a set of demographics features, primarily the fraction of foreigners f4 (β = 0.187), the fraction of women f7 (β = 0.206) and, as expected, the unemployment rate of the La Mobili‘ere customers f1 (β = 0.220). We observe an opposite relation with the average CCM of the cars f13 (β = −0.125), and the average number of rooms f23 (β = −0.212). For the case of the women unemployment rate w2, the dominant features are related to the economic characteristics of the items insured, being the average premium of the cars f19 (β = 0.286) in a positive relation and the average price of the cars f9 (β = −0.250) or average insured premium f33 (β = −0.268) linked negatively. These observations tend to indicate gender differences in the insurance sector. The fraction of foreigners f4 (β = 0.150) and the customers unemployment rate f1 (β = 0.151) behave accordingly to the previous case.
Within the Space and Territory category, both the variables percentage of building area s1 and percentage of green area s2 are negatively connected to the fraction of house owner f3 (β = −0.286 and β = −0.270 respectively).
In the Housing domain, the vacancy rate h1 appears to be positively related to the percentage of insured cars f20 (β = 0.167), while the average area per inhabitant h2 shows several positive links to the average age f2 (β = 0.142), the fraction of house owners f3 (β = 0.174), the market share f6 (β = 0.118), the average class of furniture f21 (β = 0.137) and the average number of rooms f23 (β = 0.313). Higher values for h2 corresponds to lower unemployment rate f1 (β = −0.161).
Finally, in the Economy category the municipal debt e1 is positively related to the average year of constructions of the buildings f27 (β = 0.236) that is in accordance with the literature where modern buildings have been considered a proxy for economic status . Moreover, the fraction of investment in culture e2 is negatively connected to the average year of the car f11 (β = −0.209).
It is worth noting that for a group of indicators, the corresponding predictive models identify significant relations with expected determinants: this is the case of the pair (p1, f4) where the fraction of foreigners is explained using the information on the nationality of La Mobili‘ere customers. A similar case happen for the pairs (p2, f1), (w1, f1), and (w2, f1). However, we think that these not surprising relations do not undermine the validity of the experimental settings for several reasons. First, the considered models identify alternative predictors that are complementary and cross-domain to the target indicators. Second, the observation that a variable constructed from a sample of customers of an insurance company is able to predict a census indicator at the national level is not trivial. This represents another suggestion of the validity of the data collected as a proxy for socioeconomic status. Third, to quantitatively evaluate the impact of these not surprising variables, we compare the performance of the original models with a variation where we remove them. The accuracy in terms of R2 for both the SLM and GWR models remains substantially stable for all the indicators, with an average penalty of 0.034 and 0.014 for SLM and GWR, respectively. Refer to Fig A.13 in S1 Appendix for a detailed comparison.
After the analysis of determinants, we focus on comparing the performance of the global (SLM) and local (GWR) spatial models to a standard multivariate linear regressor (OLS) to quantify to benefit of exploiting spatial relations. We measure the performance using the coefficient of determination R2. As shown in Fig 3(A), both the spatial models outperform OLS across target indicators, with a gain in performance up to 30%. It is worth noting that GWR is able to achieve satisfactory results across categories with R2 values ranging from 0.49 for s2 to 0.83 in the case of w1 or h2. This provides a hint on the potential of insurance customers data to characterize socioeconomic processes embedded in space. GWR is useful as an exploratory technique, as it allows the relationships between independent and dependent variables to vary locally and thus captures contextual factors; however, its usefulness as a prediction tool is debated when it comes to model generalizability. To test the ability to perform out-of-sample predictions we turn to stratified cross-validation as described in Section 4.4. As shown in Fig 3(B) and 3(C), we observe a decrease of the overall performance; however, especially for certain target variables, we are still able to achieve a reasonable performance on the validation set, for example, h2 = 0.6, w1 = 0.53, p2 = 0.49, and t2 = 0.49. The values of the performances of the models are also reported in the Tables A.1 and A.2 in S2 Appendix.
(A). Comparison of the performance between the spatial regression models (GWR and SLM) and standard multivariate linear regression (OLS). (B-C) Performance using stratified cross-validation for the full (black triangles), the training and the validation sets respectively. (B) Geographical Weighted Regression and (C) Spatial Lag Model.
At last, to shed light on the contribution of the insurance datasets, we refer to a set of baseline models in which each target indicator is predicted using the variables of the other categories from the census data For instance, let us model the fraction of foreigners p1 using the explanatory variables t1, t2, …, e2 from Table 3. In Fig 4 we report a comparison between the performance of the census-based baseline and the insurance-based models for the cases of SML and GWR. We observe overall a comparable performance using our approach, with the baseline having a positive delta of 0.019 on average across indicators. This is expected because the baseline is based on official census data where cross-correlation effects are present. However, it is worth nothing that in our reference scenario the census is not available and, as such, the baseline approach not feasible. The observation that insurance customers records are able to achieve comparable results is an additional proof of the potential of this approach.
In the first part of the paper, we showed how to predict a wide range of socioeconomic indicators using insurance customers activity logs. In this section we shift the attention to a specific use case that has a strong impact on urban mobility and citizens well being: the relation between commuting and public transport (the variable t2 in our settings). The use of public transportation is an important contributing factor to urban sustainability; it has a heavy environmental footprint reducing air pollution and traffic congestion, among the others. It has also positive financial benefits for families and communities as a whole, higher level of security and direct positive effects on well-being and healthier habits. We chose transportation to exemplify our data analysis as it is the third most important contributor to greenhouse gas (GHG) emissions in the European Union .
In Switzerland, transportation accounts for around 31% of the overall GHG emissions . As such, the question of which variables are able to predict the use of public transport is a key issue. In Table 6 and in Table 8 we report the values of the predictors for the global and the GWR models respectively (an analysis of the GWR statistics for the target variables is reported in the Figs A.1 to A.12 in S1 Appendix and in Tables A.3 and A.4 in S2 Appendix). While for the global model, the estimates of the parameters are the same across municipalities, in the GWR case, each municipality has its own local parameters. Accordingly, we report average, standard deviation, minimum, and maximum values as summary statistics. For the GWR, we observe a high variability in the intercept, this is mainly due to the high spatial autocorrelation. The GWR adapts the intercept so it is closer to its neighbors, and thus achieves higher accuracy. More detailed diagnostic information on the regression, such as the kernel bandwidth is provided in Table 9. Turning the attention to the coefficients, we observe that the fraction of customers that own a house (f3) is negatively correlated with the target variable: as expected, it has been observed that individuals living in rental houses show a higher frequency of ride sharing and public transportation adoption than house owners . As expected, also the percentage of insured cars (f20) is negative linked to the probability of commuting via public transport; the more cars an individual possesses the less she turns on the public system when it comes to mobility. Moreover, we observed a higher public transport adoption in major cities, e.g., Zurich, Basel, Bern and Geneva as shown in Fig 5(b). This is consistent with our analysis, in fact, the fraction of house owners (f3) is lower in major cities, where prices are higher and people tend not to settle and start a family. Another variable that seems to be significative is the market share (f6). We believe that the market share feature is representative information because even if, the insurance company La Mobili‘ere is a national company, is not used equally across the Swiss country because of the competition between different insurance companies. Moreover, having an insurance contract is mandatory also for renting an apartment, the information of the market share of a given company can tell us important information about a certain kind of population living in that area. One of the main characteristic of GWR is that the inferred relationships vary by locality, that implies each municipality has a different fitting performance R2 and coefficients. In Fig 5 we show spatial distribution of GWR accuracy in different regions. Mapping the local R2 values could provide a useful tool to identify areas where the independent variables might or might not explain the phenomenon under study. This could be useful, for example, to identify contextual anomalies that are linked to specific characteristics of a community. While we are able to achieve good results in several cities, the performance for the Grisons and Ticino cantons are low. These cantons are fairly small and isolated regions. For example, Ticino is highly influenced by the adjacency to Italy this influence is not captured by the model. Two of the main cities in Ticino; Lugano and Belinzona have a very low use of public transport as shown in Fig 5. Another interesting aspect is that the local R2 shows a clustered behavior, with adjacent areas having similar performance. Note that these clusters tend to match administrative boundaries and we can clearly distinguish regions such as Lausanne, Basel and St. Gallen (light blue), central Switzerland (orange) and the Valais (dark blue) in Fig 5. This phenomenon might be linked to the inherent diversity in the communities leaving in the different areas of the country.
(A) Spatial distribution of the local coefficient of determination R2 using GWR to predict the fraction of commuters using public transportation t2. (B) Comparison between predicted and actual values of the percentage of commuters using public transport.
The approach proposed in this paper has few limitations that should be carefully discussed. In details:
- Sample bias. Even if we showed a fair level of representativeness along different dimensions, the input dataset contains information only on the fraction of population that owns an insurance policy with a specific company. Several segments of the population are left out of the analysis, adding a validity bias in the results, especially for indicators that cover a wider spectrum of the society.
- Spatial granularity mismatch. Official statistics are available at the level of municipality and only for a subset of the communes, while the insurance customers data provides information at the finer granularity of zip codes. From one side, we have complete knowledge for a subset of the areas, while from the other side, a partial view with a higher coverage. Our analysis is restricted to the intersection between the areas covered by the official statistics and the insurance dataset and, given the spatial units aggregation adopted, it doesn’t provides tools to model the heterogeneity of social processes at a micro-level, e.g., neighborhoods in cities.
- Temporal evolution. In our analysis we currently focus on a static snapshot covering a year of statistics. However, socioeconomic conditions vary over time and in which extent and how fast this change is reflected into the insurance data records is something not explored yet.
- Data availability and privacy. The current approach is based on the assumption that customers data is available to the researchers to tackle relevant challenges that have a broad social impact. This raises two main issues related to privacy and the compliance to the current legislation especially in the European framework, and the sharing policy. Proprietary data is usually exploited for commercial advantages and profit within the organization, and not available to the broad scientific community. To ground a methodology to model social phenomena on the availability of proprietary data that is not in control of the policy makers raises few concerns on the actual implementation in a real scenario.
In this paper we proposed 34 different characteristics of individual socio-economic behavior quantifiable through the dataset of anonymized insurance customers, and then evaluated them on the example of Swiss municipalities. We showed that those quantities could be used for estimating economic performance of the regions in the country, as proposed geographical regression models technique demonstrated to perform well on the validation samples for predicting major official statistical quantities for different categories such as Population, Transportation, Work, Space and Territory, Housing and Economy at the level of Swiss municipalities. This approach is applicable in cases when official statistics are not available or they are inconsistent, and the experimental pipeline demonstrated its ability to reach comparable performance to a scenario with complete knowledge. Advantages and disadvantages of local and global spatial regression models have been discussed extensively, highlighting the the potential of insurance customers data to characterize socioeconomic processes embedded in space. In future work, we aim at applying our approach to the modeling of temporal variations, which is especially useful to study processes of urbanization and gentrification. We also aim at developing models for estimating attributes at finer geographical resolutions such as districts or neighborhoods.
The authors would like to thank La Mobili‘ere insurance for partially supporting this research and for providing the anonymized customers dataset.
- 1. MacFeely S. The Big (data) Bang: Opportunities and Challenges for Compiling SDG Indicators. Global Policy. 2019;10:121–133.
- 2. IEAG U. A World that Counts–Mobilising the Data Revolution for Sustainable Development. New York: United Nations. 2014.
- 3. Struijs P, Braaksma B, Daas PJ. Official statistics and big data. Big Data & Society. 2014;1(1):2053951714538417.
- 4. Pappalardo L, Pedreschi D, Smoreda Z, Giannotti F. Using big data to study the link between human mobility and socio-economic development. In: 2015 IEEE International Conference on Big Data (Big Data). IEEE; 2015. p. 871–878.
- 5. Antenucci D, Cafarella M, Levenstein M, Ré C, Shapiro MD. Using social media to measure labor market flows. National Bureau of Economic Research; 2014.
- 6. Watmough GR, Marcinko CL, Sullivan C, Tschirhart K, Mutuo PK, Palm CA, et al. Socioecologically informed use of remote sensing data to predict rural household poverty. Proceedings of the National Academy of Sciences. 2019;116(4):1213–1218. pmid:30617073
- 7. Hashemian B, Massaro E, Bojic I, Arias JM, Sobolevsky S, Ratti C. Socioeconomic characterization of regions through the lens of individual financial transactions. PloS one. 2017;12(11):e0187031. pmid:29190724
- 8. Giannone D, Reichlin L, Small D. Nowcasting: The real-time informational content of macroeconomic data. Journal of Monetary Economics. 2008;55(4):665–676.
- 9. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature. 2009;457(7232):1012. pmid:19020500
- 10. Zhou K, Yang S. Understanding household energy consumption behavior: The contribution of energy big data analytics. Renewable and Sustainable Energy Reviews. 2016;56:810–819.
- 11. De Montjoye YA, Radaelli L, Singh VK, et al. Unique in the shopping mall: On the reidentifiability of credit card metadata. Science. 2015;347(6221):536–539. pmid:25635097
- 12. Ritter J, Mayer A. Regulating data as property: a new construct for moving forward. Duke L & Tech Rev. 2017;16:220.
- 13. Zhang Y, Chen X, Li J, Wong DS, Li H, You I. Ensuring attribute privacy protection and fast decryption for outsourced data security in mobile cloud computing. Information Sciences. 2017;379:42–61.
- 14. Sivarajah U, Kamal MM, Irani Z, Weerakkody V. Critical analysis of Big Data challenges and analytical methods. Journal of Business Research. 2017;70:263–286.
- 15. Florescu D, Karlberg M, Reis F, Del Castillo PR, Skaliotis M, Wirthmann A. Will ‘big data’transform official statistics. In: European Conference on the QualityofOfficial Statistics. Vienna, Austria; 2014. p. 2–5.
- 16. Schnore LF. The socio-economic status of cities and suburbs. American Sociological Review. 1963; p. 76–85.
- 17. Dennison G, Dodd V, Whelan B. A socio-economic based survey of household waste characteristics in the city of Dublin, Ireland. I. Waste composition. Resources, Conservation and Recycling. 1996;17(3):227–244.
- 18. Cameron S. The economics of crime deterrence: A survey of theory and evidence. Kyklos. 1988;41(2):301–323.
- 19. Galea S, Freudenberg N, Vlahov D. Cities and population health. Social science & medicine. 2005;60(5):1017–1033.
- 20. Clark WA. Residential segregation in American cities: A review and interpretation. Population research and Policy review. 1986;5(2):95–127.
- 21. Gonzalez MC, Hidalgo CA, Barabasi AL. Understanding individual human mobility patterns. nature. 2008;453(7196):779. pmid:18528393
- 22. Podobnik V, Ackermann D, Grubisic T, Lovrek I. Web 2.0 as a foundation for Social Media Marketing: global perspectives and the local case of Croatia. In: Cases on Web 2.0 in Developing Countries: Studies on Implementation, Application, and Use. IGI Global; 2013. p. 342–379.
- 23. Santi P, Resta G, Szell M, Sobolevsky S, Strogatz SH, Ratti C. Quantifying the benefits of vehicle pooling with shareability networks. Proceedings of the National Academy of Sciences. 2014;111(37):13290–13294. pmid:25197046
- 24. Sobolevsky S, Massaro E, Bojic I, Arias JM, Ratti C. Predicting regional economic indices using big data of individual bank card transactions. In: 2017 IEEE International Conference on Big Data (Big Data). IEEE; 2017. p. 1313–1318.
- 25. Fatehkia M, Kashyap R, Weber I. Using Facebook ad data to track the global digital gender gap. World Development. 2018;107:189–209.
- 26. Zagheni E, Weber I, Gummadi K, et al. Leveraging Facebook’s advertising platform to monitor stocks of migrants. 2017.
- 27. Dubois A, Zagheni E, Garimella K, Weber I. Studying migrant assimilation through facebook interests. In: International Conference on Social Informatics. Springer; 2018. p. 51–60.
- 28. Llorente A, Garcia-Herranz M, Cebrian M, Moro E. Social media fingerprints of unemployment. PloS one. 2015;10(5):e0128692. pmid:26020628
- 29. Cesare N, Grant C, Hawkins JB, Brownstein JS, Nsoesie EO. Demographics in social media data for public health research: does it matter? arXiv preprint arXiv:171011048. 2017.
- 30. Weber I, Castillo C. The demographics of web search. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM; 2010. p. 523–530.
- 31. Weber C, Johnson M, Arceneaux K. Genetics, personality, and group identity. Social science quarterly. 2011;92(5):1314–1337.
- 32. Augustin M, Reich K, Glaeske G, Schaefer I, Radtke M. Co-morbidity and age-related prevalence of psoriasis: analysis of health insurance data in Germany. Acta dermato-venereologica. 2010;90(2):147–151. pmid:20169297
- 33. Heyer K, Herberger K, Protz K, Glaeske G, Augustin M. Epidemiology of chronic wounds in Germany: Analysis of statutory health insurance data. Wound Repair and Regeneration. 2016;24(2):434–442. pmid:26609788
- 34. Rawte V, Anuradha G. Fraud detection in health insurance using data mining techniques. In: 2015 International Conference on Communication, Information & Computing Technology (ICCICT). IEEE; 2015. p. 1–5.
- 35. Nian K, Zhang H, Tayal A, Coleman T, Li Y. Auto insurance fraud detection using unsupervised spectral ranking for anomaly. The Journal of Finance and Data Science. 2016;2(1):58–75.
- 36. Bell N, Arrington A, Adams SA. Census-based socioeconomic indicators for monitoring injury causes in the USA: a review. Injury Prevention. 2015;21(4):278–284. pmid:25678685
- 37. Berkowitz SA, Traore CY, Singer DE, Atlas SJ. Evaluating Area-Based Socioeconomic Status Indicators for Monitoring Disparities within Health Care Systems: Results from a Primary Care Network. Health services research. 2015;50(2):398–417. pmid:25219917
- 38. Brechbühl B, Müller L, Tschirren M. Statistiques des villes suisses 2017. Union des villes suisses UVS, Swiss Federal Statistical Office (FSO); 2017.
- 39. Glaeser E. Cities, productivity, and quality of life. Science. 2011;333(6042):592–594. pmid:21798941
- 40. Kong L. Introduction: culture, economy, policy: trends and developments. Geoforum. 2010;31(4):385.
- 41. Khan G, Qin X, Noyce DA. Spatial Analysis of Weather Crash Patterns. Journal of Transportation Engineering. 2008;134(5):191–202.
- 42. MORAN PAP. NOTES ON CONTINUOUS STOCHASTIC PHENOMENA. Biometrika. 1950;37(1-2):17–23. pmid:15420245
- 43. Santosa F, Symes WW. Linear Inversion of Band-Limited Reflection Seismograms. SIAM Journal on Scientific and Statistical Computing. 1986;7(4):1307–1330.
- 44. Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann Statist. 2004;32(2):407–499.
- 45. Schwarz G. Estimating the Dimension of a Model. Ann Statist. 1978;6(2):461–464.
- 46. Sauerbrei W, Buchholz A, Boulesteix AL, Binder H. On stability issues in deriving multivariable regression models. Biometrical Journal. 2015;57(4):531–555. pmid:25501529
- 47. Heinze G, Wallisch C, Dunkler D. Variable selection—A review and recommendations for the practicing statistician. Biometrical Journal. 2018;60(3):431–449. pmid:29292533
- 48. Chernick MR. Bootstrap Methods: A Guide for Practitioners and Researchers. Wiley Series in Probability and Statistics. Wiley; 2011. Available from: https://books.google.it/books?id=UxDKh5Spwp8C.
- 49. Nogueira S, Sechidis K, Brown G. On the Stability of Feature Selection Algorithms. Journal of Machine Learning Research. 2018;18(174):1–54.
- 50. Anselin L. Spatial econometrics. vol. 310330.
- 51. Pinkse J, Slade ME. THE FUTURE OF SPATIAL ECONOMETRICS*. Journal of Regional Science. 2010;50(1):103–117.
- 52. Anselin L. Lagrange Multiplier Test Diagnostics for Spatial Dependence and Spatial Heterogeneity. Geographical Analysis. 1988.
- 53. Anselin L. Spatial Regression Analysis in R–A Workbook. Urbana. 2005.
- 54. Hurvich CM, Simonoff JS, Tsai CL. Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 1998;60(2):271–293.
- 55. Hoaglin DC, Welsch RE. The hat matrix in regression and ANOVA. The American Statistician. 1978;32(1):17–22.
- 56. Zhang Y, Zhang Y. Exploring the relationship between ridesharing and public transit use in the United States. International journal of environmental research and public health. 2018;15(8):1763. pmid:30115835
- 57. Horowitz I. City politics and planning. Routledge; 2017.
- 58. Van Fan Y, Perry S, Klemeš JJ, Lee CT. A review on air emissions assessment: Transportation. Journal of cleaner production. 2018;194:673–684.
- 59. Saboori B, Sapri M, bin Baba M. Economic growth, energy consumption and CO2 emissions in OECD (Organization for Economic Co-operation and Development)’s transport sector: A fully modified bi-directional relationship approach. Energy. 2014;66:150–161.