User-generated content is a valuable resource for capturing all aspects of our environment and lives, and dedicated Volunteered Geographic Information (VGI) efforts such as OpenStreetMap (OSM) have revolutionized spatial data collection. While OSM data is widely used, considerably little attention has been paid to the quality of its Point-of-interest (POI) component. This work studies the accuracy, coverage, and trend worthiness of POI data. We assess the accuracy and coverage using another VGI source that utilizes editorial control. OSM data is compared to Foursquare data by using a combination of label similarity and positional proximity. Using the example of coffee shop POIs in Manhattan we also assess the trend worthiness of OSM data. A series of spatio-temporal statistical models are tested to compare change in the number of coffee shops to home prices in certain areas. This work overall shows that, although not perfect, OSM POI data and specifically its temporal aspect (changeset) can be used to drive urban science research and to study urban change.
Citation: Zhang L, Pfoser D (2019) Using OpenStreetMap point-of-interest data to model urban change—A feasibility study. PLoS ONE 14(2): e0212606. https://doi.org/10.1371/journal.pone.0212606
Editor: Michael Szell, IT University of Copenhagen, DENMARK
Received: October 23, 2017; Accepted: February 6, 2019; Published: February 25, 2019
Copyright: © 2019 Zhang, Pfoser. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Raw data of changelog can be downloaded from OpenStreetMap.org at http://wiki.openstreetmap.org/wiki/Planet.osm/full Home price data can be downloaded from Zillow.com at https://www.zillow.com/home-values/ Foursquare data can be crawled from API: https://developer.foursquare.com/docs/api/venues/search.
Funding: This work was supported by Department of Defense grant HM02101410004; the National Science Foundation, Grant No. 1637541; and George Mason University, Presidential graduate research scholarship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
The inexorable trend towards urbanization worldwide presents a pressing challenge to our understanding of the scale and speed of urban processes. Scale refers to the quantity of data involved given the changing spatial and temporal resolution of the data. Traditionally, urban theories examine a coarse level such as a whole city observed over decades. Nowadays we try to understand the urban processes at the building and the citizen level. This trend creates a considerable data science research challenge. Complementing scale, speed refers to the analysis of data in real-time, which should lead to faster and more responsive decision making. These aspects give rise to the emerging data science field called Urban Analytics. As part of Urban Analytics we use novel types of data to evaluate contemporary and future cities through methods including GIS, Remote Sensing, Big Data and Geodemographics . In studying urban change, we try to assess and verify urban theories using data science methods. Urban change relates to the physical environment , including land use , infrastructure, business locations  and other assets of a city. Such an effort necessitates basic data collection, which until recently has been the responsibility of public authorities. This creates a bottleneck with respect to scale and speed, since such efforts rely on considerable man power and funding support. Updates to the data follow a 5+ years collection cycle. Studying urban change at the levels of granularity outlined above, requires data of a higher spatial and temporal resolution. Depending on existing collection infrastructure and regulations, such data might not be available and/or trustworthy .
Our work tries to assess the suitability of user-generated content in the form of OpenStreetMap POI data as a means to infer urban change. Specifically, we will explore two aspects of the data: (i) accuracy and coverage and (ii) trend worthiness. The accuracy and coverage of OSM POI data is not always evident and has been the focus of a sizeable research community over the years (cf. Section 2). In this work, we compare OSM POI data to Foursquare data. Foursquare data although a crowdsourced resource, exhibits some editorial control. Our comparison will use the name and location of POIs in both sources to reason about the respective coverage. Besides positional accuracy, a question we would like to answer is whether VGI is updated in a timely manner and can be used to capture trends and, in our specific context, urban change. To satisfy this so-called trend worthiness criterion, we assess whether VGI can match some commonly recognized trends or theories. Often urban analytics study the interaction of an environment to social phenomena. Our approach here is that by using statistical modeling, we consider VGI as trend worthy if the modeling error is acceptable while maintaining its statistical behavior. Such a test is useful in a social science context, since it would conditionally permit the use of VGI and, more generally, user-generated content in an otherwise data-poor environment. To this effect, we create statistical models based on population-based power law relationships in urban science . This underlying theory suggests that population growth is one of the fundamental parameters behind change to an urban environment. For example, some basic power-law functions can relate population size to urban phenomena such as electricity usage, salaries, road network length, intellectual output, and even the citizens’ average speed of walking and heart rate. With evaluate a range of models in terms of their quality and how well they allow us to use OSM POI as an indicator of urban change.
The remainder of this work is organized as follows. Section 2 discusses related work such as Power Law Relationships and how they capture urban phenomena as well as quality aspects and fitness-for-use of user-generated content. Section 3 describes the data sources, related collection methods, and provides some data visualizations. Section 4 outlines our methodology to assess data quality. Section 5 presents the results of the quality assessment, including error distribution, the performance of statistical modeling, and interesting observations such as “inverse coffee shop effects”. Section 6 concludes and gives directions for future research.
2 Related work
Our discussion of related work focuses on two aspects, (i) user-generated content and its quality and fitness-for-use issues, and (ii) urban science theory as related to our work.
2.1 User-generated content and quality
Given the advent of volunteered geographic information (VGI)  and geospatial crowdsourcing , more and more relevant user-generated content becomes available. The most well-known geo-crowdsourcing effort is OpenStreetMap (OSM) (http://www.openstreetmap.org), colloquially referred to as the Wikipedia of Maps. OSM is a “free” vector dataset covering the entire planet. It has started out as a road network dataset, but by now includes general spatial feature information (transportation networks, buildings, land use data) and, important for this effort, point-of-interest data. A concern with VGI and user-generated content is data quality. Given the lack of quality control in data collection, the error in the data could be wild. OSM coverage and accuracy are for example examined in [9–12]. The work in  examines the accuracy and coverage of OSM by comparing it to British Ordnance Survey datasets. In this case, the error of OSM was found to be within 6m and the data had a 26% coverage. It should be noted that this study was conducted almost a decade ago. The authors of  assessed spatial coverage and ground-truth positional accuracy for five cities and towns in Ireland by comparing OSM data to Google and Bing maps. No method is proposed for POI data. A more systematic quality assessment is conducted in , which proposes metrics for geometric, attribute, semantic and temporal accuracy, as well as logical consistency, completeness, lineage, and usage. Most work however has focused on road networks and considerably little attention has been paid to POI data.
A recent OSM use cases survey  mentions an increasing number of data mining efforts in support of urban analytics. Urban form and function is but one area.  discusses different aspects of urban form and function and how they can be captured from user-generated content. Examples include map construction algorithms [15–17] and social media mining methods . [5, 19] developed algorithms to better infer land use from OSM. Some other studies evaluated the overall potential for using OSM to assess land use and land cover [20, 21]. Another study in  presents an Urban-Rural Index derived from OSM to create an objective understanding of rural and urban classification. To quantify urban change, business directories and geocoding have been used in various efforts [23–26].
2.2 OSM POI quality assessment
Focusing on city scale, various authors have examined POI data quality. In , the authors propose a spatial-semantic interaction methodology to analyze the internal reference of different types of features of OSM POIs. The method developed so-called variograms and clusters of different feature types. The similarity of two POI features is defined as the spatiotemporal co-occurrence of different feature types of the same POI. A spatial process statistic is calculated to see if the processes are independent or not. Similar to our challenge, the authors mention that a weakness in their approach is the absence of a reference dataset. Recently, fitness-for-use of POIs  is defined at the levels of geo referencing, i.e., is the location accurate and allows for inferring an unambiguous reference to a real-world entity. Unfortunately, the authors do not propose a systematic metric for their measure. An overview paper  compares different aspects of the quality of OSM data, such as coverage, accuracy, and historical change using methods developed by different researchers. This work includes a reference dataset, the BD TOPO database produced by IGN, and it compares the data to a dataset derived from Flickr. The ambition of this quality assessment is limited to accuracy and coverage. To the best of our knowledge, our work is the first study that assesses the “fitness-for-use” of OSM POI data for studying urban phenomena and change. While we also consider accuracy and coverage, our most important contribution relates to the examination of trend worthiness of OSM POI data as a means to assess urban change. We consider a POI dataset trend worthy if the relative rate of change in the number of POIs accurately reflects a change in the real world, while conceding that the number of POIs at any given time deviates from the actual number of POIs existing in the real world, i.e., actual coffee shops vs. coffee shop POIs recorded in OSM.
2.3 Power law relationships
The fundamental dynamics of cities can be related to power law scaling relationships , i.e., urban phenomena scale with population size based on an universal power law function P = αNβ, in which P is a specific kind of urban indicator, N is population size, α and β is scaling parameters. Based on the value of β, we can develop several growth models, which introduce different rates of growth for different cities. The authors here draw a comparison between the size of cities and the size of life forms and the energy demand each has to be sustainable. This work lead to countless results utilizing this theory, including the shape of cities , mobility patterns , and urban metabolism theory .
Other works have examined coffee shops and their deep connection to our social life. Coffee shops are social capital , in that a good coffee shop can provide a stronger sense of community for the residents of a neighborhood. One can consider places such as coffee shops, plazas, market places etc. as “the heart of a community’s social vitality and the grassroots of democracy” . They represent a “Third Place” and are considered an essential part of society besides the workplace and home. With this social view of urban spaces,  provides a detailed review of how computing and analytics are augmenting people’s experiences of cities. The author argues that urban sensing and analytics will lead to considerable urbanization improvements. Some quantitative studies have empirically evaluated these visions. The authors of  investigate the effect of coffee shops on crime. The authors argue that coffee shops provide “an on-the-ground and visible manifestation of a particular form of gentrification… and lifestyle.”
With this theoretical background, we are confident in the use of power law relationships when using the change of POI data to study urban phenomena. We would like to point out that this work should not be considered the one and only approach to studying urban change, but rather proposes a method that leverages OSM POI data in this context.
Various datasets are used to provide for a sound quality assessment of OSM POI data. We assess (i) the accuracy and coverage of OSM as a POI data source by using/comparing it to respective Foursquare data, and we assess (ii) its trend worthiness by building statistical models that relate the change in coffee shops to a change in home prices. The following sections discuss each data source as well as the pre-processing steps needed to make the data actionable.
3.1 OSM coffee shop data
OSM allows anybody to freely edit a global map dataset. To keep track of the edits, OSM uses an independent data object called “changeset” to record changes from editing operations and users. Changesets record all changes such as tags, coordinates, and comments. Such a database then includes for all OSM objects (nodes, ways, relations) respective metadata information as shown in Fig 1).
For our work, we are interested in changeset data that reflects actual real-world change, i.e., the addition or deletion of a coffee shop node in close temporal proximity to the actual opening or closing of the coffee shop. Since we do not have ground truth data, i.e., municipal records, we use quantitative information such as the monthly coffee shop count plot in Fig 2, which shows the overall change in OSM data.
Assuming OSM has matured as a dataset, the plot shows a slow growth in numbers after OSM’s inception, as not too many people were aware of it. Following 2010, the rate of growth in edits increases and after 2015 the growth slows again, suggesting that the POIs recorded in OSM start reflecting real world changes, i.e., coffee shops that opened were recorded in a timely manner. Based on these trends, we select coffee shop changesets for the period of 11/2014 to 11/2016 for our experimentation and study.
In a first step, we compute a seasonal average. Here, November, December, and January are defined as Winter, and so forth. Although not strictly correct, this definition has been used in  reporting on real estate pricing trends. We can define the seasonal coffee shop density as the number of coffee shops in a neighborhood divided by the area of the neighborhood (km2). We will use the term “Coffee Shop Density” in the remainder of this work. The changes to coffee shop density for each season are shown in Fig 3. We see that different neighborhoods seem to have different trends. While in most cases the coffee shop density is increasing, some neighborhoods, such as the Financial District and Soho, seem to suffer through periods of decreasing numbers. To better show different patterns between pairs of neighborhoods, a pairwise Pearson correlation (r score)  is shown in Fig 4. About half of the neighborhood pairs are strongly correlated (Pearson’s r > 0.5). Only the Financial District is negatively correlated with most of the other neighborhoods. This mix of trends could also be an indication for the reliability of OSM data, i.e., the closing of shops is actually reflected in the crowdsourced OSM dataset.
3.2 Foursquare coffee shop data
We use Foursquare data  as a means to verify the coverage of OSM POI data. Foursquare, a location-based social-media app, utilizes POIs as the location where the user “socializes” with friends. Users are encouraged to review and visit different locations as often as possible to claim them. Since POIs are a core data aspect of this app, it also exerts editorial control, i.e., POIs are actively curated, As such, we can consider this dataset a good reference dataset when it comes to evaluating crowd-sourced OSM data.
Unfortunately, Foursquare does not provide access to their entire database in a fashion similar to OSM. An API with limited service rates allows one to interact with the service and in our case to retrieve POI information. To account for some API limitations, we use two types of queries as detailed in Tables A and Table B in S1 Appendix. Type I uses the collected OSM data to retrieve all Foursquare POIs that have a matching label within a 50m radius. Type II uses a regular spatial grid (200m spacing) to retrieve all coffee shop POIs in relation to the centroid of each cell. This strategy ensures that less than 50 POIs are retrieved per request (Foursquare limit), while it also covers POIs that are not in the immediate area of our OSM POIs. The mapping of categories of OSM data and Foursquare data is shown in Table C in S1 Appendix. The resulting Foursquare POI dataset is obtained by fusing these two datasets. The location of all OSM POIs for query Type I and grid center points for Type II are shown in Figs 5 and 6.
In total, 851 Foursquare POIs were retrieved on June 30, 2018. Section 4 will show how Foursquare and OSM POIs match up. An interesting observation is that Foursquare has a very different definition of POI categories when compared to OSM. For example, while Foursquare has a “donut” category, OSM considers it “cafe” (Table C in S1 Appendix).
Ideally our ground-truth data should have a historical dimension. Unfortunately, neither Foursquare nor any other data sources besides OSM captures this aspect. As such, we are only able to compare the accuracy of the current POIs between sources.
3.3 Home prices
To model the relationship between coffee shop densities and home prices, we obtain a dataset from Zillow, a real estate listing Web site. Zillow provides a data analysis product called “Home Value Index” (https://www.zillow.com/research/zhvi-methodology-6032/), which captures home prices at different spatial granularities ranging from city to neighborhood levels. This data is based on listing prices posted on the web site, and as such, can also be considered crowdsourced. For Manhattan, different datasets are provided at the neighborhood level. We selected the “Median List Price Per Sq Foot” from the Home Value Index. This data is more resilient with respect to outliers and removes a house size bias. We calculate the seasonal average of “Median List Price Per Sq Foot”, to which we refer to as home prices in the remainder of the paper. The fluctuations of this measure are shown in Fig 7. Seeing this data spatially, we observe that adjacent neighborhoods tend to have similar home prices. Fig 8 visualizes this relationship. Neighborhoods for which no data is available from Zillow or if they have no real estate market (Central Park) are left blank (white).
Our main objective is to assess the quality of user generated content and argue for its use in urban science research. The following sections provide a detailed discussion of the overall methodology that is employed. Fig 9 provides and overview.
4.1 POI accuracy and coverage assessment
To assess the accuracy and coverage of OSM POI data, we compare it to a reference dataset. However, since there are no authoritative datasets available, we chose Foursquare POI data, which is crowdsourcing data with some editorial control. Unfortunately, no historical versions of the data are available. Our dataset was retrieved on June 30, 2018. Comparing two POI sources that cover the same geographic area (Manhattan) and context (coffee shops) is not trivial, given a potential label mismatch (“Ben’s Cafe and Eatery” vs. “Ben’s”) and location uncertainty (a daunting problem for any user-generated content). In the following, we try to match POIs from both data sets using string similarity measures and location proximity.
4.1.1 Label similarity.
To compare POI labels between data sources, we selected the Longest Common Sub-sequence (LCS) method and the Levenshtein distance (cf. ). The Levenshtein distance has been used in similar contexts, e.g., . LCS gets its advantage over Levenshtein distance since it measures the difference of two strings that are compared to a common substring, while the Levenshtein distance uses one string as its reference. Given two crowdsourced data sources, it is unlikely that two labels for the same POI match up exactly. Both methods can provide an approximate match and can account for labels of varying length and spelling differences. Both measures are distance measures, which need to be converted to a similarity score. Similarity scores can be mapped to the interval [0, 1]. To formally define the similarity of two labels, the two Label Similarity Scores are defined as follows. (1) (2)
Here, LLCS is the length of longest sub-sequence and LLD is the number of changes needed to transform one label to another (reference). LFSQ and LOSM capture the lengths of the Foursquare and OSM labels, respectively. Identical labels generate a score of 1 in both cases. Both functions are monotonically increasing as the length of a common sub-sequence increases (LCS), or the difference between strings decreases (LD), which makes both a valid and good similarity measure.
When calculating similarity, we have to consider some particularities of the data. In both datasets, common terms are “coffee”, “cafe”, and “cafeteria”, all referring to the same concept. Given the same POI, it might be labeled as “Starbucks” in one dataset and “Starbucks Coffee” in the other. In this case, its similarity scores would only be . As such we removed all commonly used terms from the labels to provide for a fair assessment (stop words). In addition, we also removed all white spaces and punctuation. The experiments use a rather strict similarity threshold of 0.9 for both methods (cf. Section 5).
4.1.2 Location similarity.
POIs are typically captured as point locations and the expectation is that the recorded coordinates do not match up exactly across data sources. To examine whether two coordinates capture the same POI (and without considering the label) one can use a buffer region to see whether one location is close to the other. The choice of a proper threshold is critical, as various coffee shops with the same name (chain) might be close by. Using a projected coordinate system, Euclidean distance can be used. Fig 9 gives an overview of the overall processing pipeline.
4.2 Trend worthiness of OSM POI data
While the expectations are that OSM data is timely and has considerable coverage, it might not always perfectly match the real-world situation, i.e., accuracy and coverage might not always reflect reality. As such we want to assess whether such data captures overall urban trends and change using a statistical modeling approach.
We first introduce the concept of scaling relationships to model urban phenomena based on population size. This allows us then to relate change in coffee shop numbers and home prices to population and to eventually model the direct relationship between them. We use a range of spatial and temporal analysis methods and adjustments to try and improve the overall model fit. Section 5 will finally tell us the adjustments that work best and consequently the model that has the best fit.
4.2.1 Scaling relationship based on population.
The fundamental driving force behind urban change is human activity and population size. In , it is shown that larger metropolitan areas produce comparatively more wealth, innovation and activity following a power law function based on population size.
This power law scaling relationship is shown in Eq 3. The variance of each urban indicator is the exponential scaling factor β, which can be used to derive three types of urban phenomena. With (i) β ≈ 1 (linear growth) we can describe individual human needs, like jobs and water consumption, (ii) β < 1 (sublinear growth) characterizes infrastructure, and (iii) β > 1 (superlinear growth) signifies quantities related to social currencies, like innovation. The coefficient estimation is done using linear regression after a log transformation of the scaling relationship model (Eq 4) . In the following equations, Nt is the population at a certain time, Y0 is the initial state of an indicator, Yt is the current state of an indicator, and ϵt is the error. (3) (4) Coming back to our problem of coffee shops and home prices, in  the authors found that coffee shops are a good indicator for gentrification. Gentrification is a common and controversial topic in politics and urban planning and refers to improving deteriorated urban neighborhoods by an influx of a wealthier demographic. Intuitively, the establishment of a coffee shop relies on certain population numbers (customers) to support it. To open a coffee shop, a much smaller investment is needed than, for example, a super market. As such, the number of coffee shops is sensitive to relatively small changes in population numbers (volatility). It is easier to close a coffee shop than a supermarket when customers stay away. Another argument here is that coffee is a cheap commodity that is consumed frequently. Hence, it is not as susceptible to personal spending cuts as more expensive products such as entire meals, clothing or jewelry.
Power low relationship between coffee shops and home prices. Coffee shop numbers correlate with population numbers over time and place and can be treated as an indicator of human activity. Thus, they should follow a power law function of population. On the other hand, the real estate market is an economic phenomenon also related to human activity. With Eqs 3 and 4, we have two relations between coffee shops c and population, and between home prices p and population: log(ct) = log(c0) + βc log(Nt) and log(pt) = log(p0) + βp log(Nt). In combining them, we infer a function between coffee shops and home prices. Eq 5 also represents a power low scaling relationship and can be fitted using regression techniques. As a side benefit, since both datasets are user generated, it would also establish the usefulness of such data for the investigation of urban phenomena.(5)
In Eq 5, log(Pt) is the log transformed home price and log(Ct) is the log transformed coffee shop density. In the remainder of this paper, these log transforms are still referred to as “coffee shop density” and “home price”.
Fig 10 shows that different neighborhoods over time (shown using different colors and symbols) exhibit different patterns. The home prices of each neighborhood do not always increase as the coffee shop density increases. Fig 11 shows seasonal patterns across all neighborhoods and they seem to be more consistent. Different sub-figures represent different seasons in different years. Home prices seem to increase with coffee shop density for each season in general. In the lower-left corner, both, coffee shop density and home prices are low. In the upper-right corner, home price and coffee shop density have diverging trends.
These different patterns in both figures indicate that there might be other spatiotemporal effects at work beyond the basic scaling relationship model. We will use this model as our basic model and introduce a series of adjustments that utilize some commonly recognized temporal and spatial patterns in urban science.
4.2.2 Temporal trend and lag.
Global trend. Every city’s real estate market is affected by global market forces and the economy. The economic cycle is a commonly recognized trend  and refers to market fluctuations over certain periods, e.g., ten years. While there are many modeling techniques to estimate economic cycles, our data does only cover a two-year period of a typical ten-year cycle and as such it could cover any portion of the cycle (peak, bottom, etc.). We use the following general polynomial model for this trend: τ + α1t + α2t2. Here t is an index capturing sequential seasons ([1, 8] of our two year period), τ is the intercept term, α1 and α2 are coefficients. This function can generalize the situation mentioned before. If α2 is zero (or not statistically significant) this model is reduced to a linear trend model. If α1 and α2 are both zero (or not statistically significant), it becomes a stationary process.
Creating one model that captures all neighborhoods, we need to address the differences in magnitudes of prices existing in those neighborhoods. We observed that during a recession, the absolute devaluation of properties of higher value would be much more than the absolute devaluation of properties with lower value. The reverse can be said during boom periods. We use the mean value of the unit home price for the entire two year period , with n representing different neighborhoods to normalize home prices. The scaled polynomial model becomes . The modified scaling relationship model is then as follows. (6)
Seasonality. Many articles investigating the real estate market discuss the “seasonality” of home prices (cf. [42–44]). Following a typical approach, seasonality is modeled using dummy variables ([45, 46]). Our model can be restated as follows. (7)
Here, I(i) and wi are seasonal dummy variables and their coefficients, respectively. With seasonal dummy variables present, the intercept term τ of the scaling relationship model is dropped from the linear model, since a collinearity issue would exist between τ and seasonality. Both seasonality and τ are implicitly estimated by wi. A larger wi means a larger seasonality factor, and vice versa.
Temporal lag. The relationship between coffee shops and home prices could be that an increase of coffee shops either (i) leads to higher home prices, (ii) coincides, or (iii) follows home prices. As such, we can add a lag variable for coffee shop density to our model. (8)
In this new model, log(Ct−j) is the last j season’s difference to the current season’s coffee shop density and ηj is its coefficient.
4.2.3 Spatial trend and autoregression.
Spatial trend. Many physical or social phenomena, such as the earth’s gravity, snow thickness, and population are correlated with their location. Using this basic spatial analysis technique, the spatial trend in observational data is described by means of a two-dimensional polynomial equation . Despite some claims that it does not perform well for the case of real estate [48, 49], we still want to investigate its performance given that it is a basic geospatial method.
For Manhattan, (cf. Fig 8 right) it shows that home prices are higher in the south of the city than in the north. Two-dimensional coordinates themselves do not influence home prices. However, since many spatial phenomena in a city follow a certain spatial configuration, e.g., subway lines, tourism, etc., home prices might be affected by them and as such are implicitly correlated with location.
In our case of incomplete knowledge and variables, space might be a proxy for the missing information and spatial trend analysis might still be valuable for our model. Since coffee shop density could follow a spatial trend, a collinearity issue might exist between location and coffee shop density. The updated model using a polynomial for the x and y coordinates is as follows. (9)
Spatial autoregression model. In this model we consider spatial autocorrelation, which is the interdependence between different locations. It is widely used in economy, sociology and biology when one needs to analyze the correlation of neighboring phenomena. It helps in establishing that neighborhoods sometimes exhibit similar characteristics at a certain spatial scale . Recently, the authors of  argue that latent factors could be universal behind all of those autoregressive models. In a similar argument, spatial autocorrelation is linked to missing values estimation and interpolation . For example, in the case of Manhattan there are some externalities, such as transportation hubs, parks, and venues that affect more than one neighborhood. From the perspective of latent factors, autoregression models estimate additional hidden parameters that can increase the goodness-of-fit of our model. This is especially the case if its residuals do not show the expected random distribution and they are spatially clustered creating spatial interdependence. There are two basic spatial autoregression models. Eq 10 is spatial lag based and includes lagged dependent variables. Eq 11 is spatial error based and includes lagged error terms (cf. ). In both equations, Yt is the vector of observed home prices for each neighborhood at time t with dimension n × 1. W is a weighted matrix with dimension n × n. Xt is a matrix of regressors with dimension n × m, with m being the number of regressors in this model, including coffee shop density and the variables discussed in the adjustment models. β is the coefficients’ vector with dimension m × 1. ϵt is the error term. ρ is the coefficient of the spatial lag term. λ is the coefficient of the spatial error term. (10) (11)
Modeling approach. In our study, we build models starting with a basic model up to involved models that include additional regressors. We do so in order to observe whether subsequent adjustments add power to the model. The simplest model uses a mean value of coffee shop density and home prices across all seasons. One model is built for each season. We use this simple approach to show the applicability of a scaling relationship model.
Next, a comprehensive model with data for all neighborhoods and seasons is introduced. It includes coffee shop density as an independent variable. Based on this model, different independent variables and spatial autoregression methods are added or deleted based on their respective p-value and modeling power.
To assess the performance of our various models, we choose the Akaike Information Criterion (AIC)  and the Bayesian Information Criterion (BIC)  instead of the typical R-squared metric since AIC and BIC are more resilient to overfitting . In our modeling approach, there are two sources of complexity. The first is an increasing number of different independent variables. The second source comes from the fact that polynomial models have the potential risk of being overly complex for our modeling case. Both AIC and BIC are information-based criteria that assess model fit as metrics for selecting a finite set of models. They both maximize likelihood and penalize an increasing number of parameters and complexity. As such, both are resilient to overfitting and they are widely used for model comparisons in modern statistics. In general, the criteria for model comparison is that lower AIC or BIC values indicate a better model fit. Normally, BIC will give a higher score as it penalizes model complexity more than AIC. One strategy to add or eliminate a variable is that if a variable (i) is not considered statistically significant (p-value), or (ii) a model using it has a similar or worse AIC or BIC value than a simpler model, then this variable would not be included in the next (improved) model. Another strategy is that if coefficients of coffee shop density change significantly when new variables are added, the model is considered a bad one, since they eliminate the contribution of coffee shop density in our model. This reasoning relates to the discussion of spatial trends and spatial auto-regression and that these models might implicitly (location) estimate coffee shop density.
5 Experimental results
The ambition of this section is simple. We want to evaluate the POI Accuracy & Coverage and trend worthiness methods that we defined in Section 4.
5.1 Accuracy and coverage
The OSM dataset extract contains 529 coffee shops, while we were able to retrieve 851 locations from Foursquare. While the former strictly contains coffee shops, so does the latter also include other types of venues (e.g., sandwich and donut shops) due to the API characteristics. Using a strict label similarity score threshold of 0.9 and a 50m proximity threshold for location-matching, we were able to match 310 (LCS method) and 316 (Levenshtein method) OSM POIs to Foursquare data. Out of the 561 unmatched Foursquare POIs, we were able to match 210 (LCS) and 135 (Levenshtein) of them to other POIs in OSM. Overall 351 (LCS) or 426 (Levenshtein) Foursquare POIs and 219 (LCS) or 394 (Levenshtein) OSM POIs could not be matched to a respective equivalent in the other data source. The matched OSM locations are shown in the Fig 12. To see the distribution of the label similarity score SLCS, SLD and location proximity, Fig 13 shows the CDF of the label matching score for pairs of names with distance threshold < 50m. We can see that about 35% of them have SLCS = 1, which is much better than expected. It means that about 35% of the data can be matched exactly within 50m. Only 30% of the data has SLD = 1. Fig 14 shows the CDF for label pairs with a proximity threshold < 50m. Close to 80% of all matching labels are within 30m of each other, and about 95% are within 40m for both methods. It tells us that the matched labels for both methods have a spatial accuracy of 30m − 40m.
5.2 Fitting scaling relationships
For our first model case, we try to fit one scaling relationship model per season. Additionally, we use one baseline model for the mean value of coffee shop densities and home prices across seasons. The two variables are shown in Fig 15. The blue line is the fitted regression line. The grey area is the confidence interval of the predicted values. The coefficients and model fitting results of Table 1 suggest that across different seasons the scaling factor β is stable at around 0.30. All the models have very small p-values (less than 0.01), which indicates a good fit. These results are a strong indication towards the existence of a scaling relationship between coffee shop densities and home prices. Fig 16 shows the normal probability plot of residuals, which follows a normal distribution with a high goodness-of-fit (R-squared = 0.9389).
Choosing β ≈ 0.3 is a reasonable value, since this scaling relationship model is based on the coffee shop density vs. population and home price vs. population models (cf. Eq 5). The addition of one coffee shop would imply the addition of a sizeable population. Thus, our β is considerably smaller than the values identified in , which are directly related to population. Several weaknesses of our basic model are also evident. About half of the points are outside of the confidence interval, and the changing temporal pattern inside one neighborhood in Fig 10 is in stark contrast to the stable spatial pattern of one season in Fig 11. We will address this issue in the models when considering adjustments.
5.3 Model results with adjustments
Section 4 presented a series of models, which added spatiotemporal variables or spatiotemporal methods to the basic model. We use the labels M1, M2, etc. to distinguish the various models (cf. Table 2). The coefficients of parameters, p-values, AIC, and BIC of each model are shown in Table 3.
As mentioned in Section 4.2.1, M1 is the basic model considering only coffee shop density as independent variable. β is estimated to be 0.3032, which is almost the same as in the case of the simple model using mean values. We will use this model as a baseline in our comparison to assess the various adjustments.
Temporal trend modeling (M2) has coefficients that are all significant and with p-values close to 0. Here, α1 and α2 represent global trends. To understand what that means, we can assign an arbitrary coffee shop density value (2) and seasonal dummies (winter) to this model. The estimated global trend is shown in Fig 17. During those two years, real estate markets reached a peak, exhibiting a growth pattern and only shrinking somewhat towards the end of the period. The coefficient of coffee shop density, β, drops from 0.3032 in the basic model to now 0.1824. It shows that probably this global market change is correlated with the coffee shop density’s overall change during the observation period. The model also captures seasonality. With w1 representing winter, this means that home prices in winter would have a higher value than during other seasons (cf. Fig 18). For our different models, this seasonal pattern is the same across all models. However, it is in contrast to some other reports, e.g., , which claim that prices are lowest during winter. Limited by our available sparse data (only two years) and methods, we cannot conclusively explain this difference, only that it is statistically significant. Moreover, this model has an AIC value of -40.9 and a BIC value of -16.8, which is a significant decrease from M1’s AIC and BIC values of 19.8 and 28.7 respectively. As such, we are quite confident that global trends and seasonality impact our model.
Temporal lag modeling (M3) did not perform well, since all of the lag parameters have significant p-values, even though AIC slightly dropped to 18.0 (from 19.8 in M1), BIC is 33.3 (increase from 28.7 in M1). This also shows that M3 does not improve over M1. Changes to coffee shop density have no effect beyond a single season (three months period).
Another interesting aspect is a potential spatial trend, i.e., does Manhattan’s home price follow a two-dimensional distribution? M4 looks at this question. As we can see in Table 3, all θs are significant. Assigning arbitrary values to all the other parameters, we can generate a trend surface as shown in Fig 19. It shows a slope pattern with higher values in the Northwest area and lower values towards Southeast. The result does not seem to be intuitive at first. Assuming it is not a fundamental issue with the modeling methods themselves, then a possible explanation is that only three neighborhoods in this study are located to the north of Central Park. Thus, the model is geared towards the trend in the southern area. Neighborhoods around Central Park typically also have high home prices. The overall power of this model improves considerably when compared to the next best one, M2, with AIC/BIC dropping to -61.5/-35.5 respectively. The coffee shop coefficient β = 0.2042 is only slightly higher than the 0.1824 for the case of M2. Still, this model seems to be quite a good fit.
M5 and M6 consider a spatial autocorrelation effect. Spatial autoregression is widely used in home price analysis. However, it might implicitly estimate latent factors, which are also estimated by coffee shop densities and spatial trends. The modeling results strongly support this effect. To see whether M5 and M6 capture similar spatial effects, we observe that both models have lower AIC/BIC, -65.3/-30.8 for M5, and -77.3/-33.6 for M6. This means that they are both statistically good models. The variables in both models have very low p-values, which indicates that they explain their effects very well. The θ coefficient values capturing spatial trends in both M5 and M6 are now considerably smaller (104×) when compared to M4. This means that spatial autoregression has taken over the interpolation power from spatial trends. The coffee shop density coefficient β is also smaller. However, M5 and M6 do have some differences. The autoregression coefficient, ρ, of M5’s dependent variable is negative. This indicates a negative autocorrelation between neighbors. However, the error term’s coefficient, λ, is positive, which suggests a positive effect between neighbors that is unexplained by coffee shop density or other trends. The AIC/BIC value of M6 is -77.3/-33.6. Both are lower and indicate an improvement over M5. It is hard to say which model, M5 or M6, is right or wrong, since both autoregression models work on latent factors. The model itself did not explicitly give us information about the factors they estimate. Of course, there are other advanced modeling techniques that could be applied in future research and which might have more explanatory power.
5.4 Best model?
After comparing the p-values of coefficients, AIC and BIC, we can identify the two “best model” candidates. M4 is the best model without an autoregressive process given its low BIC score. M6 uses spatial autoregression and is the best in terms of overall AIC. M6 has a worse BIC score, since it is penalized for its complexity by using autoregression terms.
To visually compare the models in terms of prediction accuracy, we can use a Q–Q (quantile-quantile) plot, which is a probability plot that compares two probability distributions by plotting their quantiles against each other. Fig 20 plots the residuals of M1, M4, and M6 against an expected normal distribution. M1 shows some deviation for both tails and also in the middle. The residuals for M6 are worse for the right tail portion. M4 seems to be the best of the three models. Its residuals fit to a normal distribution very well. However, M6’s total range of residuals (0.8341) is smallest, followed by M4 (0.8641) and M1 (1.032). Fig 20 shows for each neighborhood (points with the same color) that residuals fluctuate more for M6 than for M4 and M1. This is also the reason behind M6’s lowest AIC value. This pattern means a better fit to the normal distribution inside a neighborhood.
In general, based on our modeling approach, both M4 and M6 would be good candidates for modeling coffee shop densities in relation to home prices. With M6, even though this model has a better AIC score, there is a concern that the autoregressive process implicitly estimates the coffee shop density. This complexity is penalized as indicated by a larger BIC score than M4. The coffee shop density coefficient β = 0.1508, which is almost half of the β of the baseline model. For M4, β = 0.2043. Since it more convincingly considers coffee shop density, we consider M4 to be the best overall model.
5.5 Inverse coffee shop effects
The following power law based scaling relationship examines an intriguing, albeit intuitive observation about cities derived from our data. Using a simple model for mean home prices , a growth function can be obtained by calculating the derivative of the scaling relationship (cf. Eq 12). This growth function can help us in understanding the coffee shops’ contribution to local communities, i.e., how does adding one coffee shop affect home prices? (12)
Eq 12’s generic structure captures a fundamental feature of coffee shops and how they are related to neighborhood changes. The function shown in Fig 21 indicates that a change in coffee shop density has an inverse sublinear effect on home prices. In underdeveloped neighborhoods with small coffee shop numbers (zero is not included in this discussion), home prices increase more rapidly when coffee shops are added. This effect corresponds to the left tail of the curve shown in Fig 21. In a highly developed neighborhood with many coffee shops, the impact of a new coffee shop is rather small. This also explains that the different neighborhoods have different temporal correlations between coffee shop densities and home prices. The lower left corner of Fig 10 captures neighborhoods in which home prices grow faster with the addition of coffee shops. The upper right corner in the same figure shows neighborhoods that have already a high concentration and the growth dynamic backed by cafe shops is less evident. This growth function now provides an explanation for this phenomenon. It also confirms the intuition that the first, for example, Starbucks coffee shop has the biggest impact on neighborhood development (Harlem vs. the Upper East Side). As such, our work supports that argument that one could really use coffee shop densities as an indicator for gentrification as suggested in .
The study of urban change has always been of critical importance and has only become widely feasible in recent years due to the availability of open data and user-generated content. However, the quality of such data for this particular use case still needs to be systematically examined as addressed in this work. By examining the accuracy and coverage of OSM POI data we found that it compares favorably to other more authoritative data sources. Compared to Foursquare, 60% of the OSM POIS could be matched with high accuracy. A more important aspect we looked at is the trend worthiness of the data, i.e., even if it is not as accurate, does it still capture temporal trends such as the relative increase of coffee shops over time? Using statistical models that exploit the power law relationship of various factors in relation to population, we were able to relate coffee shop POI data to urban housing prices. The models that we derived allow us to show that even though OSM POI data (coffee shops in our case) might be incomplete, i.e., not all coffee shops are recorded in a timely matter, such data can still be used in urban analytics research. It is also interesting to observe that the estimated growth function decodes a generic process of urban change and shows that coffee shop data can be an important geo-social indicator.
We can give the following directions for future research. We can apply our models to evaluate further user-generated content datasets, such as restaurants, supermarkets, parking garages, and bike sharing systems. A goal could be to define a respective data matrix that shows which aspects are best suited to predict urban change. On the other hand, we want to extend this model to include all kinds of different urban phenomena and datasets including, but not limited to crime, street vitality, and local business climate. Further, we want to improve our spatiotemporal modelling techniques. Various approaches exist that rely on more advanced statistical models, such as spatial-temporal autoregression , and spatial filtering . Overall, although we focused on a very specific model in this work, we argue that the proposed approach (modeling plus user generated content) has considerable potential for social science research.
- 1. Singleton AD, Spielman S, Folch D. Urban Analytics. SAGE; 2017.
- 2. Singh A. Review article digital change detection techniques using remotely-sensed data. International journal of remote sensing. 1989;10(6):989–1003.
- 3. Verburg PH, Schot PP, Dijst MJ, Veldkamp A. Land use change modelling: current practice and research priorities. GeoJournal. 2004;61(4):309–324.
- 4. Grant J, Perrott K. Where is the café? The challenge of making retail uses viable in mixed-use suburban developments. Urban Studies. 2011;48(1):177–195. pmid:21174898
- 5. Liu X, Long Y. Automated identification and characterization of parcels with OpenStreetMap and points of interest. Environment and Planning B: Planning and Design. 2016;43(2):341–360.
- 6. Bettencourt LM, Lobo J, Helbing D, Kühnert C, West GB. Growth, innovation, scaling, and the pace of life in cities. Proc of the National Academy of Sciences. 2007;104(17):7301–7306.
- 7. Goodchild MF. Citizens as sensors: the world of volunteered geography. GeoJournal. 2007;69(4):211–221.
- 8. Pfoser D. On user-generated geocontent. In: Proc. International Symposium on Spatial and Temporal Databases (SSTD); 2011. p. 458–461.
- 9. Haklay M. How good is volunteered geographical information? A comparative study of OpenStreetMap and Ordnance Survey datasets. Environment and planning B: Planning and design. 2010;37(4):682–703.
- 10. Zielstra D, Zipf A. Quantitative studies on the data quality of OpenStreetMap in Germany. In: Proc. of GIScience; 2010.
- 11. Ciepłuch B, Jacob R, Mooney P, Winstanley AC. Comparison of the accuracy of OpenStreetMap for Ireland with Google Maps and Bing Maps. In: Proc. 9th International Symposium on Spatial Accuracy Assessment in Natural Resuorces and Enviromental Sciences; 2010. p. 337.
- 12. Girres JF, Touya G. Quality assessment of the French OpenStreetMap dataset. Transactions in GIS. 2010;14(4):435–459.
- 13. Arsanjani JJ, Zipf A, Mooney P, Helbich M. An introduction to OpenStreetMap in Geographic Information Science: Experiences, research, and applications. In: OpenStreetMap in GIScience. Springer; 2015. p. 1–15.
- 14. Crooks A, Pfoser D, Jenkins A, Croitoru A, Stefanidis A, Smith D, et al. Crowdsourcing urban form and function. Int’l Journal of Geographical Information Science. 2015;29(5):720–741.
- 15. Karagiorgou S, Pfoser D. On Vehicle Tracking Data-Based Road Network Generation. In: Proc. 20th ACM SIGSPATIAL GIS Conference; 2012. p. 89–98.
- 16. Ahmad M, Karagiorgou S, Pfoser D, Wenk C. Map Construction Algorithms. Springer-Verlag; 2015.
- 17. Ahmad M, Karagiorgou S, Pfoser D, Wenk C. A Comparison and Evaluation of Map Construction Algorithms using Vehicle Tracking Data. GeoInformatica Journal. 2015;19(3):601–632.
- 18. Karagiorgou S, Pfoser D, Skoutas D. Geosemantic Network-of-Interest Construction Using Social Media Data. In: Proc. GISCIENCE conf.; 2014. p. 109–125.
- 19. Touya G, Reimer A. Inferring the scale of OpenStreetMap features. In: OpenStreetMap in GIScience; 2015. p. 81–99.
- 20. Estima J, Painho M. Investigating the potential of OpenStreetMap for land use/land cover production: A case study for continental Portugal. In: OpenStreetMap in GIScience; 2015. p. 273–293.
- 21. Estima J, Painho M. Exploratory analysis of OpenStreetMap for land use classification. In: Proceedings of the second ACM SIGSPATIAL International Workshop on Crowdsourced and Volunteered Geographic Information; 2013. p. 39–46.
- 22. Schlesinger J. Using crowd-sourced data to quantify the complex urban fabric—OpenStreetMap and the urban–rural index. In: OpenStreetMap in GIScience; 2015. p. 295–315.
- 23. Angel A, Lontou C, Pfoser D, Efentakis A. Qualitative geocoding of persistent web pages. In: Proc. 16th ACM SIGSPATIAL GIS Conference; 2008. p. 1–10.
- 24. Bader MD, Ailshire JA, Morenoff JD, House JS. Measurement of the local food environment: a comparison of existing data sources. American Journal of Epidemiology. 2010;171(5):609–617. pmid:20123688
- 25. Carroll GR, Torfason MT. Restaurant Organizational Forms and Community in the US in 2005. City & Community. 2011;10(1):1–24.
- 26. Kubrin CE, Squires GD, Graves SM, Ousey GC. Does fringe banking exacerbate neighborhood crime rates? Criminology & Public Policy. 2011;10(2):437–466.
- 27. Mülligann C, Janowicz K, Ye M, Lee WC. Analyzing the spatial-semantic interaction of points of interest in volunteered geographic information. In: Proc. International Conference on Spatial Information Theory (COSIT); 2011. p. 350–370.
- 28. Jonietz D, Zipf A. Defining fitness-for-use for crowdsourced points of interest (POI). ISPRS International Journal of Geo-Information. 2016;5(9):149.
- 29. Touya G, Antoniou V, Olteanu-Raimond AM, Van Damme MD. Assessing crowdsourced POI quality: Combining methods based on reference data, history, and spatial relations. ISPRS International Journal of Geo-Information. 2017;6(3):80.
- 30. Batty M. The size, scale, and shape of cities. Science. 2008;319(5864):769–771. pmid:18258906
- 31. Simini F, González MC, Maritan A, Barabási AL. A universal model for mobility and migration patterns. Nature. 2012;484(7392):96. pmid:22367540
- 32. Kennedy C, Pincetl S, Bunje P. The study of urban metabolism and its applications to urban planning and design. Environmental pollution. 2011;159(8):1965–1973. pmid:21084139
- 33. Waxman L. The coffee shop: Social and physical factors influencing place attachment. Journal of Interior Design. 2006;31(3):35–53.
- 34. Oldenburg R. The great good place: Cafes, coffee shops, bookstores, bars, hair salons, and other hangouts at the heart of a community. Da Capo Press; 1999.
- 35. McKenna HP. Urbanizing the Ambient: Why People Matter So Much in Smart Cities. Enriching Urban Spaces with Ambient Computing, the Internet of Things, and Smart City Design. 2016; p. 209.
- 36. Papachristos AV, Smith CM, Scherer ML, Fugiero MA. More coffee, less crime? The relationship between gentrification and neighborhood crime rates in Chicago, 1991 to 2005. City & Community. 2011;10(3):215–240.
- 37. Renthop. Rental Seasonality 2016; 2017. Available from: https://www.renthop.com/study/national/seasonality-2016.html [cited 2019-02-10].
- 38. Pearson K. Note on regression and inheritance in the case of two parents. In: Proc. of the Royal Society of London. 1895;58:240–242.
- 39. Foursquare. Foursquare API documentation; 2019. Available from: https://developer.foursquare.com/docs [cited 2019-02-10].
- 40. Cormen TH. Introduction to algorithms. MIT press; 2009.
- 41. Kenton W. Economic Cycle; 2019. Available from: http://www.investopedia.com/terms/e/economic-cycle.asp [cited 2019-02-10].
- 42. Thwaites G, Wood R. The measurement of house prices. Bank of England Quarterly Bulletin. 2003.
- 43. McAvinchey ID, Maclennan D. A regional comparison of house price inflation rates in Britain, 1967-76. Urban Studies. 1982;19(1):43–57.
- 44. Diewert WE, et al. Alternative approaches to measuring house price inflation. Discussion Paper 10-10, Department of Economics, The University of British …; 2010.
- 45. Kuo CL. Serial correlation and seasonality in the real estate market. The Journal of Real Estate Finance and Economics. 1996;12(2):139–162.
- 46. Case KE, Shiller RJ. Prices of single family homes since 1970: New indexes for four cities; 1987.
- 47. Agterberg FP. Trend surface analysis. In: Spatial statistics and models; 1984. p. 147–171.
- 48. Case B, Clapp J, Dubin R, Rodriguez M. Modeling spatial and temporal house price patterns: A comparison of four models. The Journal of Real Estate Finance and Economics. 2004;29(2):167–191.
- 49. Bourassa S, Cantoni E, Hoesli M. Predicting house prices with spatial dependence: a comparison of alternative methods. Journal of Real Estate Research. 2010.
- 50. Anselin L. Spatial econometrics: methods and models. vol. 4. Springer Science & Business Media; 2013.
- 51. Stewart B. Latent factor regressions for the social sciences. Harvard University: Department of Government Job Market Paper. 2014.
- 52. Griffith DA. Spatial autocorrelation and spatial filtering: gaining understanding through theory and scientific visualization; Springer Science & Business Media; 2013.
- 53. Akaike H. Factor analysis and AIC. Psychometrika. 1987;52(3):317–332.
- 54. Schwarz G, et al. Estimating the dimension of a model. The annals of statistics. 1978;6(2):461–464.
- 55. Gayawan E, Ipinyomi RA. A comparison of Akaike, Schwarz and R square criteria for model selection using some fertility models. Australian Journal of Basic and Applied Sciences. 2009;3(4):3524–3530.
- 56. Zhou WX, Sornette D. Analysis of the real estate market in Las Vegas: Bubble, seasonal patterns, and prediction of the CSW indices. Physica A: Statistical Mechanics and its Applications. 2008;387(1):243–260.
- 57. Pace RK, Barry R, Gilley OW, Sirmans C. A method for spatial–temporal forecasting with an application to real estate prices. International Journal of Forecasting. 2000;16(2):229–246.