Figures
Abstract
Mobile location data has emerged as a valuable data source for studying human mobility patterns in various contexts, including virus spreading, urban planning, and hazard evacuation. However, these data are often anonymized overviews derived from a panel of traced mobile devices, and the representativeness of these panels is not well documented. Without a clear understanding of the data representativeness, the interpretations of research based on mobile location data may be questionable. This article presents a comprehensive examination of the potential biases associated with mobile location data using SafeGraph Patterns data in the United States as a case study. The research rigorously scrutinizes and documents the bias from multiple dimensions, including spatial, temporal, urbanization, demographic, and socioeconomic, over a five-year period from 2018 to 2022 across diverse geographic levels, including state, county, census tract, and census block group. Our analysis of the SafeGraph Patterns dataset revealed an average sampling rate of 7.5% with notable temporal dynamics, geographic disparities, and urban-rural differences. The number of sampled devices was strongly correlated with the census population at the county level over the five years for both urban (r > 0.97) and rural counties (r > 0.91), but less so at the census tract and block group levels. We observed minor sampling biases among groups such as gender, age, and moderate-income, with biases typically ranging from -0.05 to +0.05. However, minority groups such as Hispanic populations, low-income households, and individuals with low levels of education generally exhibited higher levels of underrepresentation bias that varied over space, time, urbanization, and across geographic levels. These findings provide important insights for future studies that utilize SafeGraph data or other mobile location datasets, highlighting the need to thoroughly evaluate the spatiotemporal dynamics of the bias across spatial scales when employing such data sources.
Citation: Li Z, Ning H, Jing F, Lessani MN (2024) Understanding the bias of mobile location data across spatial scales and over time: A comprehensive analysis of SafeGraph data in the United States. PLoS ONE 19(1): e0294430. https://doi.org/10.1371/journal.pone.0294430
Editor: Christos Nicolaides, University of Cyprus, CYPRUS
Received: May 29, 2023; Accepted: November 1, 2023; Published: January 19, 2024
Copyright: © 2024 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: We have shared the data for this study with the public, which can be access at https://github.com/gladcolor/Advan_mobility_data_bias_exploration.
Funding: the author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
Mobile location data has become increasingly important for understanding human mobility patterns in contemporary society. Modern smartphones are equipped with highly sensitive Global Positioning System (GPS) receivers that can provide accurate location data to installed applications, such as Google Maps [1] and social media platforms including Twitter, Facebook, and Instagram [2]. Such location data has become a vital source of geospatial big data for human mobility studies, allowing researchers to gain insights into travel trajectories, activity patterns, and behavior across large geographic areas with a high level of granularity [3, 4]. Several commercial data companies started to provide mobile location data including SafeGraph, Cuebiq, X-mode, and Foursquare, to name a few. These data often do not uniquely identify individuals but rather provide an anonymized overview of their aggregated movement to protect individual privacy while still providing insights into broader patterns of human mobility. For example, mobile location data may be anonymized and aggregated at the neighborhood or census block level, rather than at the level of individual users.
SafeGraph Patterns data [5], or Advan Patterns since 2023 [6], has emerged as one of the most frequently utilized sources of mobile location data in academic research across multiple domains, particularly in the realm of urban sciences, public health, consumer behaviors, and environmental science. For instance, in urban sciences, the data has been used for analyzing human mobility patterns within and between various regions [7], evaluating transportation infrastructures and planning [8], analyzing transportation equity and socioeconomic disparities [9, 10], and assessing the accessibility of bus rapid transit [11]. Similarly, SafeGraph data has been extensively used in public health studies, including tracking the spread of infectious diseases [12, 13], monitoring social distancing behaviors, examining the effectiveness of control measures during the COVID-19 pandemic [14, 15], and investigating the impacts of non-pharmaceutical intervention [16]. In addition, such datasets play a pivotal role in environmental sciences, such as understanding factors influencing long-term park visitation [17], estimating visitors’ demographic status and their patterns in national parks [18, 19], and examining how urban socio-physical system impact on the resilience of cities [20]. Furthermore, numerous studies used SafeGraph data for marketing and consumer behavior research, especially to understand how people move and interact with businesses and commercial areas and predict consumer behaviors [21, 22].
One potential source of harm associated with the use of big data in research and applications is the risk of incorporating implicit biases into analyses that impact the accuracy and reliability of research findings [23]. Due to its nature as big data, mobile location data such as SafeGraph is particularly susceptible to sampling bias because deriving mobility datasets from mobile devices involves multiple successive sampling procedures, from the population to mobile phone owners and app users [24]. Sampling bias refers to the discrepancy between a sample and the population from which it was collected. It is a systematic error that cannot be alleviated by simply increasing the size of the sample [25]. In the context of SafeGraph data, ’sample’ refers to the panel of devices compiled within the dataset, while the ’population’ denotes the entirety of the U.S. population, and the bias can arise from multiple dimensions. Sampling bias can result in certain demographic and socioeconomic groups being overrepresented or underrepresented. For instance, if the data is collected only from users who have downloaded particular apps, it may not be representative of the entire population. In addition, the rate of access to smartphones can vary across diverse age groups and genders [26] and mobile location data is collected from individuals who have opted in to share their location data. Location is another dimension of the sampling bias, as the availability and quality of mobile location data can vary across various geographic regions. In certain geographic areas, the bias can result from a range of factors, including weaker mobile signals or a small number of cell towers or urban-rural settings. In addition, the popularity of certain location-based apps or services may also contribute to disparities in the types and quantity of location data collected in distinct geographical zones [27]. Finally, the bias is likely to change over time as data collection at certain times of the day or week or at different seasons may influence the detection of specific mobility patterns. For example, the frequency and regularity of data collection, in which data collected during peak hours, potentially overestimated the number of individuals using transportation infrastructure, while data collected during off-peak hours possibly underestimated the rate of mobility.
To alleviate concerns over potential data bias, SafeGraph provides a preliminary assessment of bias in its Patterns datasets, revealing high Pearson coefficients (>0.96) between the number of sampled devices and the population at the state and county level [28]. Similarly, aggregated tracked devices in the SafeGraph data exhibit an exceptionally strong association with the population at the national level (>0.99) for different demographic groups, including race, education attainment, and household income. Wang et al. [29] investigated the association between healthcare visits and neighborhood socioeconomics during the COVID-19 pandemic, and found that the sampling rates were balanced overall at the state level in North Carolina. Coston et al. [30] introduced external datasets to audit the bias of SafeGraph data. They examined the voter turnout data of North Carolina’s 2018 general election and found that older and non-white voters appear to be less captured in SafeGraph visits in poll locations. Their study demonstrated a workflow to leverage administrative data for mobility bias detection, yet the accessibility of such data is limited since rare events involving dramatic foot traffic collect demographic information. Although these studies provide useful insights into SafeGraph data bias, they do not provide a comprehensive evaluation of the bias or analyze the spatial and temporal aspects of identified bias at different spatial scales. In light of the potential impact of bias on research outcomes, it is imperative that researchers make a concerted effort to understand and mitigate bias in mobile location data such as SafeGraph and maintain the highest standards of rigor and accuracy in their studies. While there is no universally accepted standard for validating mobile location data [24], conducting a systematic analysis of data bias at multiple geographical levels over a longitudinal study would be beneficial for the research community and improve practical applications of mobile location data.
To address this need, we conducted a comprehensive investigation of bias in mobile location data using the widely used SafeGraph Patterns [6] in the entire United States (US) as our study dataset. The research focused on examining the bias from multiple dimensions including spatial, temporal, urbanization, demographic, and socioeconomic over a five-year period from 2018 to 2022 at multiple spatial scales, including state, county, census tract, and census block group, and at different geographic settings of urban and rural areas. The bias examined covers commonly used demographic and socioeconomic variables such as sex, age, race/ethnicity, income, and educational attainment. The significance of this study lies in four key contributions: 1) a systematic assessment of bias from multiple dimensions across a wide spatiotemporal range (the entire US with monthly data of five years); 2) identification of population bias across a broad range of demographic and socioeconomic variables; 3) spatial and temporal analysis of the quantified bias at multiple spatial scales, and 4) a general analysis framework for evaluating the bias of mobile location data. By systematically documenting the bias at various geographic levels over a five-year period, the findings of this research offer valuable reference for future studies that leverage SafeGraph data or other mobile location datasets.
The remainder of this paper is organized as follows: Section 2 describes the study area and the methodology employed; Section 3 presents and discusses the results; Section 4 delves into the limitations of the presented research; and Section 5 summarizes our findings.
2. Data and method
2.1 Study area and data
We conducted a nationwide study encompassing the entire US, including Alaska and Hawaii, with geographic units at the census block group, tract, county, and state levels. The boundaries of these geographic units were defined by the US Census Bureau. The Census block group is the smallest publicly available geographic unit for sample data from the decennial census which typically has a population of 600 to 3,000 people. Our analysis utilized the SafeGraph Panel Overview data on monthly patterns from 2018 to 2022, spanning five years, sourced from [31]. We extracted the column of number_devices_residing for each block group from the panel data as a proxy for residents. This variable indicates the count of distinct devices observed with a primary nighttime location in the specified block group. SafeGraph determines the home location of a device by analyzing data for 6 weeks during nighttime hours (between 6 pm and 7 am) to identify a common nighttime location for the device which is then mapped to a census block group [31].
The socioeconomic and demographic population data were extracted from the American Community Survey (ACS), 5-year estimates [32]. We used the ACS data of 2018 and 2019 at the four geographic levels of block group, tract, county, and state. While ACS employed new boundaries since 2020, SafeGraph data still used block group boundaries from the 2010 Census. As there is no reliable means to align the new boundaries with the previous ones [33–35], we relied on ACS 2019 data to analyze bias for 2020, 2021, and 2022.
We used two distinct sources for the urban-rural classification. To determine the urban classification for block groups, we applied a threshold that assigned an urban label to those areas where more than 50% of the block group’s land area fell within the urban polygons defined by the US Census Bureau. These polygons were delineated at the Census block level, using housing density as the primary criterion [36]. For tract level, we employed the urban-rural classification scheme from USDA ERS (2020a), which classified tracts with centroids located within Census urban polygons as urban [36]. For the county level, we used the classification scheme from USDA ERS (2020b), which contains two major categories (metro and nonmetro) and nine sub-categories based on population, urbanization degree, and adjacency to metro areas. The metro counties were classified as urban and the nonmetro counties were categorized as rural in this study.
2.2 Analyze bias for the total population
In this analysis, we focused on the representativeness of the total population of the SafeGraph data without considering different population groups. Specifically, we used sampling rate to analyze the spatial and temporal bias of the data for the whole population across five geographic levels (block group, tract, county, state, and nation) in the US. To compute the sampling rate for a specific geographic level, we summed the number of residing devices for each geographic unit and then divided this sum by the corresponding population of the unit. We hypothesized that data without bias should exhibit a consistent sampling rate across space, and data without temporal bias should demonstrate a constant sampling rate over time.
To examine the temporal trend of the bias and identify the potential urban-rural disparities, we computed the five-year (2018–2022) monthly nation level sampling rate and its associated urban-rural classification. Violin charts were further generated to display the distribution of the sampling rate in each month in urban and rural areas. Furthermore, to examine the geographic disparities of the bias, we further mapped the sampling rates in four geographic levels (block group, tract, county, and state) in the US. These maps aim to provide SafeGraph data users with an intuitive visual understanding of the representativeness of SafeGraph data for the entire population at different geographic levels.
2.3 Analyze socioeconomic and demographic bias for different population groups
In this analysis, we focused on examining the data’s representativeness of different population groups at the county and state levels. Specifically, we investigated whether the tracked mobile devices provided by SafeGraph are evenly distributed among 23 different population groups classified with demographic and socioeconomic variables. These variables cover five categories, including age, gender, race/ethnicity, education, and income (Table 1).
We hypothesized that data without bias should have the same sampling rate among different population groups. Subsequently, any differences in the sampling rate are viewed as bias. Following SafeGraph [28] and Wang et al. [29], we adopted an aggregation-based bias inspection approach to assess the bias among population groups (denoted as g) at county and state levels. Specifically, we assumed that in a block group (denoted as c), the sampled devices (Dc) have the same demographic characteristics as that block group, meaning that they share the same proportions of each population group (pg_c). Next, we aggregated the sampled devices of each population group to the county (or state) level to calculate the county (or state) level bias ( is the number of block groups of the county or state). Specifically, for the population group g, the proportion of sampled devices relative to the total number of devices in the county or state (
) can be obtained by Eq (1), where
is the total number of sampled devices. Similarly, the proportion of g to the total county or state population (
) can be obtained by Eq (2), where Pc is the population of block group c and
is the total county (or state) population. Ideally if the data has no bias,
should equal to
, indicating their ratio is 1. By computing the difference between the actual ratio and 1 for each population group, we can estimate whether the sampled device is evenly distributed among block groups or not as illustrated in Eq (3). A bias value far from 0 indicates high sampling bias, with positive values indicating over-representing (over-sampling) a specific population group and negative values indicating underrepresenting (under-sampling) a specific population group. For example, a bias value of 0.05 for a specific population group indicates that this population group is 5% overrepresented in the data, while a bias value of -0.05 indicates 5% underrepresentation.
To examine spatial and temporal patterns of the quantified biases, we mapped the bias of each population group at the county and state levels for each of the five years. The maps were designed to provide an intuitive visual representation of the socioeconomic and demographic bias across different spatial scales and over time for the US. We further created a heatmap to visualize the monthly trend of the bias of each population group from 2018 to 2022. The goal was to provide a clearer understanding of the disparities and changes in bias over time, and to identify areas and groups that may require additional attention in terms of data analysis.
Note that SafeGraph excluded children of age <16, so the sampling rate derived from its datasets is roughly an “adult sampling rate.” However, we stick to the term “sampling rate” since it conveys an intuitive and clear idea that it is a ratio of the sampled devices to the total population. Furthermore, in reality, children usually stay with guardians or caregivers, not completely separating from adults. So, human mobility studies usually cannot exclude children. SafeGraph datasets excluded children’s devices, but not children themselves. According to Sun et al. [37], about 75% of children have a smartphone between 11 and 15 years old; such age group takes up about 6.5% of the total US population. Therefore, about 4.9% (6.5%×75%) of the population’s devices were excluded. In addition, children <15 compose about 18% of the US population [38]. It is worth noting this issue for a better interpretation of our analysis.
3. Results and discussion
3.1 Sampling rate for the total population
3.1.1 Temporal dynamics.
The overall temporal trend of the monthly sampling rate across three categories of urban, rural, and the nation from 2018 to 2022 is shown in Fig 1. The national sampling rate is obtained by dividing the total number of devices of all block groups by the US population. SafeGraph used multiple criteria to identify the home block group of a device; one criteria is determining which block group a device spent the most nights over the past six weeks. As a result, the devices tracked by the SafeGraph panel are dynamic, varying from block group to national level, due to the addition of new devices and the relocation of residents. For the urban and rural categories, the sampling rates are computed by dividing the total number of devices of block groups classified as urban or rural by the respective population counts. Throughout the five-year period, the sampling rate for all three categories displayed notable fluctuations, ranging from 4.5% to 14.5% with an average of 7.5%. The average sampling rates for the entire US across the five years were 7.2%, 8.1%, 7.2%, 6.6%, and 8.5%, respectively, from 2018 to 2022.
CBG: Census block group.
The rates show a significant increase between February and June of 2018, followed by a decline until September, and then another increase until May 2019. The trend also reveals a dramatic decline in March 2019 following the COVID-19 outbreak in the US. After the outbreak, the government released several travel restriction measures, resulting in a significant decline in population movement [7, 9]. The reduced human movement might have resulted in lower sampling rates as SafeGraph data is collected from mobile device applications, which rely on users opting to share their location data while using the applications. The sampling rates continued to decline for an extended period until October 2021, when a significant recovery occurred. The recovery trend reached its highest peak (13.4%) in May 2022, followed by a sharp decline to the lowest point (4.9%) in July 2022. This dramatic change may relate to data provider disruption in May 2022 [34].
The sampling rates for the nation, urban, and rural areas exhibit consistent trends, with similar rates observed for the nation and urban areas. It is worth noting that prior to late 2019, the rural population was generally underrepresented as indicated by the lower sampling rate. However, this trend reversed after late 2019, with the rural population becoming overrepresented and urban population become underrepresented. The disparities between urban and rural areas also widened. The dynamics of the urban-rural disparities of the sampling rate suggest the importance of understanding the bias in the temporal dimension.
As the overall temporal trend shows clear temporal bias and urban/rural differences over time, we further calculated the monthly sampling rate for four geographic levels (state, county, tract, and block group) from 2018 to 2022. To compute the sampling rate for a particular geographic level, we first aggregated the number of devices located within each unit and then divided this total by the corresponding population of that unit. At the county, tract, and block group levels, we calculated separate rates for urban and rural areas, and then visualized the results for 2019 using violin charts in Fig 2. The results for 2018, 2020, 2021, and 2022 are presented in Figs A1, A3-A5 in S1 Appendix.
Urban-rural disparities are illustrated for the county, tract, and block group (CBG) levels. The charts for 2018, 2020, 2021, and 2022 are presented in Figs A1, A3-A5 in S1 Appendix.
At the county level, the result revealed that the largest rural-urban difference in sampling rate was observed in 2018 and 2019, with a higher sampling rate in urban areas than in rural areas. However, this difference began to decrease in March 2020, and by 2021, there was no discernible difference between urban and rural sampling rates. In 2022, we observed a slight reversal of this pattern, with a slightly higher sampling rate observed in rural areas. At the tract level, we initially observed similar patterns to the county level in 2018. However, the sampling rate became equalized in March, but starting in April 2019, a contradictory pattern occurred: rural areas had a higher sampling rate than urban areas. This pattern continued and gradually intensified throughout the years, with the most dramatic difference observed in 2022. The observed discrepancies in urban-rural sampling rate disparities are likely caused by spatial aggregation and the urban-rural classifications at different geographic levels. The block group level showed similar patterns over the five-year period as the tract level, which may be due to the fact that each tract contains an average of only three block groups.
These findings highlight the importance of analyzing urban-rural disparities in bias for specific geographic levels when using SafeGraph data. For instance, this analysis revealed that in 2019, the data underrepresented the rural population at the county level but overrepresented the rural population at the tract and block group levels. Note that the different classification methods for urban-rural at different geographic levels may affect the bias disparities.
3.1.2 Spatial distribution.
To examine the geographic disparities in how SafeGraph data represent the whole population, we calculated the yearly sampling rate for each geographic unit at the four geographic levels and mapped the result for 2019 in Fig 3. The maps for 2018, 2020, 2021, and 2022 are presented in Figs A6, A8-A10 in S1 Appendix.
The maps for 2018, 2020, 2021, and 2022 are presented in Figs A6, A8-A10 in S1 Appendix.
Overall, the nine states in Deep South (i.e., Alabama, Florida, Georgia, Louisiana, Mississippi, North Carolina, South Carolina, Tennessee, and Texas) and three states in the Midwest (i.e., Oklahoma, Kansas, Nebraska) have a higher concentration of areas with higher sampling rates, while the West and Northeast have more areas with lower sampling rates. This pattern was in general consistent over the five years and was clearly illustrated in the maps of county, tract, and block group levels. The maps exhibited a consistent pattern with Fig 1, where the overall sampling rate displayed an upward trend from 2018 (mean: 6.9%) to 2019 (7.1%), decreased in 2020 (7.9%) and 2021 (6.6%) before increasing again in 2022 (8.5%). The maps at the block group level revealed that across five years, the relatively low sampling rates (< 5%, dark blue) were generally concentrated in densely populated areas of the Northeast and West, while the relatively high sampling rates (>10%, dark red) were discretely distributed with spatial clustering in the South. This distribution and trend at the block group level were consistent with those observed at the tract and county levels.
The maps visually indicated that lower sampling rates tend to be concentrated in densely populated areas of the Northeast and West. To further investigate this observation, we conducted additional analyses to explore whether these geographic disparities are correlated with unit population across various levels of geography. Specifically, for each year and each geographic level (county, tract, and block group), we generated a scatter plot to visualize the relationship between the sampling rate and population. In addition, urban and rural areas were plotted separately to reveal potential urban/rural disparities in such an association. Fig 4 presents the results for 2019, while the results for 2018, 2020, 2021, and 2022 are presented in Figs A11, A13-A15 in S1 Appendix.
The results for 2018, 2020, 2021, and 2022 are presented in Figs A11, A13-A15 in S1 Appendix.
Fig 4 illustrates that there is insignificant or no association between the sampling rate and the population at all geographic levels, with r2 ranging from 0.001 to 0.02 for both urban and rural areas. This pattern is consistent across all five years (Figs A11-A15 in S1 Appendix). The lack of a discernible association between sampling rates and population indicates no systematic bias of sampling rate in these geographic levels or urban/rural settings in terms of population size. Therefore, the geographic disparities depicted in the maps (Fig 3) are likely driven by other factors, such as demographic and socioeconomic characteristics of the population, which is examined in Section 3.2.
3.1.3 Association between device count and population.
The above analysis reveals the bias of the sampling rate across different geographic levels and over time. It should be noted that a high sampling rate does not necessarily indicate better representativeness of the population. To assess how well SafeGraph data represents the whole population, we further conducted analyses by creating scatter plots for five years that showed the correlation between the census population and the sampled device count, with urban/rural classification at three geographic levels: county, tract, and block group. The result for 2019 is shown in Fig 5, and the results for 2018, 2020, 2021, and 2022 are presented in Figs A16, A18-A20 in S1 Appendix.
The results for 2018, 2020, 2021, and 2022 are presented in Figs A16, A18-A20 in S1 Appendix.
Fig 5 shows the number of sampled devices is strongly correlated with the census population at the county level for both urban and rural counties (r2 = 0.972, Pearson r = 0.986 for urban; r2 = 0.872, r = 0.934 for rural). The association at the tract and block group levels becomes weaker than the county level, but it remains at a moderate level (r2 > 0.466, r > 0.683 for urban tracts; r2 > 0.689; r > 0.830 for rural tracts). This pattern aligns with the uneven geographic distribution of sampling rate at the tract and block group levels, as shown in the previous maps (Fig 3). SafeGraph [28] reported a strong correlation (r = 0.966) between the number of sampled devices and the population at the county level using one month data (October 2019), which is consistent with our findings with the 5-year data. However, at the block group level, they reported a much lower correlation coefficient of 0.176 using that one month of data. This finding further highlights the importance of understanding the temporal dynamics of the bias in using SafeGraph data.
Figs A16, A18-A20 in S1 Appendix further revealed that 2019 is generally consistent with the other four years; however, a gradual decline was observed in the representativeness of urban areas, specifically at the tract and block group levels. For example, the r2 values for urban tracts from 2018 to 2022 were 0.651, 0.466, 0.539, 0.415, 0.375, while the r2 values for rural tracts remained relatively stable, ranging from 0.678 to 0.735. This finding further underscores the widening urban/rural disparities observed in Figs 1 and 3. Interestingly, at the county level, the data shows slightly better representativeness in urban areas, while at the tract and block group levels, the data perform better in rural areas.
3.2 Demographic and socioeconomic bias
3.2.1 Overall demographic and socioeconomic bias among population groups.
This analysis aimed to examine the representativeness of SafeGraph data for different population groups across different spatial scales. We calculated the demographic and socioeconomic biases among 23 different population groups with five categories, including age, gender, race/ethnicity, education, and income (Table 1), following the method detailed in section 2.3. Particularly, for each state and county in the US, we calculated the bias for each of the 23 population groups for each year from 2018 to 2022. The frequency distribution of the socioeconomic and demographic bias at the state and county levels in 2019 is presented in Fig 6, while the results for 2018, 2020, 2021, and 2022 are presented in Figs A21-A25 in S1 Appendix. The urban/rural disparities of the bias at the county level are also illustrated in these figures. Additionally, the median, minimum, and maximum bias of population groups for both county and state levels are reported in Tables 2 and 3.
The results for 2018, 2020, 2021, and 2022 are presented in Figs A21, A23-A25 in S1 Appendix.
Since 2019, most biases are in the range of [-0.071, 0.034].
Most biases are in the range of [-0.056, 0.029].
From 2018 to 2022, the median bias of most population groups was relatively low, within the range of about ±0.05. At the state level, the data shows minor bias in gender over the five-year period (within ±0.0012). Population groups aged between 15–17, Hispanic, having no school, no college education, and income less than 50K are slightly underrepresented, with most bias values falling in [-0.059, 0.025]. The unrepresentativeness of the young groups is expected as SafeGraph does not track the mobile devices of children under the age of 16 [39]. In 2019, other age groups, Black, Asian, population with bachelor or graduate degrees, and income over 100K are generally overrepresented (about 0–0.02). It is noteworthy to mention that the population aged 65 and over is not generally underrepresented in SafeGraph data from 2020 to 2022. This maybe due to the gap between mobile phone and internet users and non-users no longer widens with age in the US [40], or this could be attributed to that this demographic group tends to consent to location-based cookie policies. The accessibility of internet infrastructure and the affordability of mobile phones have resulted in as many seniors using mobile phones as other age groups in the US. We also observed some unique patterns at the state level in 2022, where populations with lower education and income levels were overrepresented, while populations with higher education and income levels were underrepresented. This pattern contradicts the general understanding that mobile location data tends to underrepresent the population with low socioeconomic status.
In 2019, the overall bias for different population groups at the county level is generally consistent with the state level. The bias of race variables (i.e., Black, Hispanic, and Asian) exhibits the most significant variation among counties compared to other variables, while the bias of Male, Female, and White show the least variation with Female (median: 0.001–0.002) and White (median: 0–0.020) being slightly overrepresented. When examining the five-year period from 2018 to 2022, the bias for gender and all age groups shows consistent patterns, while other variables show notable changes and even change directions. For instance, Black and Asian groups were overrepresented in 2018 (median: 0–0.088), but their representativeness decreased in the following years, together with Hispanic.
The bias for most population groups exhibits urban/rural disparities over the five years, where the rural area generally shows less representation than the urban area, though with a few exceptions. The biases of these minority groups (i.e., Black, Asian) show a higher frequency of lower values in rural counties than in urban counties, and this pattern remained consistent over the five years, with the exception of Black populations in 2020 and 2021. For the Black and Hispanic populations, the rural population exhibited less representativeness than their urban counterparts in 2018. However, this gap reversed in the following 4 years.
3.2.2 Spatial distribution of the demographic and socioeconomic bias.
To examine the geographic disparities of the demographic and socioeconomic bias, we mapped the bias of each population group from 2018 to 2022 at both state and county levels. The maps for 2019 are shown in Figs 7 & 8, while the results for 2018, 2020, 2021, and 2022 are presented in Figs A26–A35 in S1 Appendix.
The results for 2018, 2020, 2021, and 2022 are presented in Figs A26, A28-A30 in S1 Appendix.
The results for 2018, 2020, 2021, and 2022 are presented in Figs A31, A33-A35 in S1 Appendix.
As shown in Fig 7, the data shows the bias in gender with no significant geographic disparities (light grey color) across the nation (within ±0.001). Bias in age exhibited marked geographic differences. The population group aged between 15–17 is underrepresented in most states (median: -0.009), with a relatively higher underrepresentation observed in the Northwest (e.g., Montana, North Dakota, and South Dakota). In contrast, population groups over the age of 17 were in general overrepresented in most states, with the highest overrepresentation in the West, the Northeast, Texas, and Florida in the South. These groups were only underrepresented in a few states, such as North Dakota (median: -0.043), Florida (median: -0.040), and Arizona (median: -0.036).
The bias in race/ethnicity varies across states. Among the four race groups, the data shows the best representativeness for the White, although there are slight overrepresentations in some Deep South states (i.e., Alabama, Georgia, Louisiana, Mississippi, South Carolina) and underrepresentations in certain Northeast (e.g., New York) and Western states (e.g., California). The bias in the Black is oppositely distributed from those of the White but generally shows overrepresentation. The Black was significantly overrepresented in the Northeastern and Western states and slightly underrepresented in some Central states (e.g., Nebraska, Kansas, and Alabama). The Hispanic was underrepresented in most states, particularly in some central states such as Kansas (-0.120), Nebraska (-0.110), and Texas (-0.090), while being overrepresented in the Northeast and Alaska (0.07). Asian was overrepresented in most states, with the highest overrepresentation in the South and West.
Population groups with no high school education were underrepresented in most states, especially in the Middel and South, and overrepresented in some states in Northeast (e.g., Connecticut, 0.088). Population groups with college education and over were overrepresented, particularly in the South. The maps also reveal that population groups with an income of 50K-100K have the best representativeness across all states. However, population groups with incomes below 50K were highly underrepresented in the South and Midwest, while being slightly overrepresented in some states in the Northeast and West. Meanwhile, population groups with incomes over 100K were generally overrepresented, particularly in the South and Midwest regions, with the highest overrepresentation observed in Mississippi (0.107) and Arkansas (0.068), though some states in the Northeast and West exhibited less representation.
To uncover more detailed geographic disparities, we further visualized the bias of different population groups at the county level (Fig 8). The results showed that the spatial distribution of the bias was generally consistent with those observed at the state level, except for Black and Asian groups. At the county level, the underrepresented and overrepresented counties for Black/Asian were relatively dispersed, with no clear clustering trend (except in the Southwest, where Black/Asian is overrepresented). However, at the state level, the Black and Asian were overrepresented in many states, particularly in the West and Northeast.
The geographic disparities of other years show consistency with those of 2019 at both county and state levels, albeit with some variations at the state level. For instance, the Black and Asian were overrepresented in most states in 2018 and 2019, while they were underrepresented in most states in 2020, 2021, and 2022. Specifically, from 2020 to 2022, the Black was underrepresented in almost all Midwestern states, and the Asian was underrepresented in a growing number of states. In 2018, Hispanics were only underrepresented in some southern states, such as Nebraska (-0.078), Kansas (-0.066), Alabama (-0.048), and Texas (-0.046). Moreover, the pandemic exacerbated geographic disparities in the sampling bias for different population groups. In 2020 and 2021, minority and low-income groups were highly underrepresented in more states, while the White and high-income groups were highly overrepresented in more states.
Overall, the spatial distribution of the bias at the state and county levels across the five years shows that youth, minorities, low-income groups, and those with lower levels of education were more likely to be underrepresented in the South and Midwest, while their counterparts (i.e., the White, groups with higher income and higher education levels) were more likely to be overrepresented in the South. Moreover, we found that the geographic disparities of the bias vary across the years, with the pandemic exacerbating geographic disparities in the sampling bias for different population groups.
3.2.3 Temporal trend of the demographic and socioeconomic bias.
Besides the observed geographic disparities, the demographic and socioeconomic bias shows noticeable changes over the five years as illustrated in sections 3.2.1 and 3.2.2. In this section, we further analyzed the monthly trend of the bias of each population group from 2018 to 2022 for both county and state levels using heatmaps (Fig 9).
At both county and state levels, the temporal trend for the two geographic levels is similar throughout the five years. The monthly bias across groups shows similar patterns as the overall bias (Figs A21-A25 in S1 Appendix). The gender shows minimal bias and is stable over time. Adult groups (those aged 18 and over) were slightly of oversampled (0.01–0.03), and those aged under 18 were underrepresented marginally. The patterns are in general consistent over time. Among racial groups, Hispanics were slightly overrepresented in 2018 and early 2019. However, their representation gradually decreased, with a significant underrepresentation (-0.05 –-0.02) observed following the pandemic (March 2020) to July 2021. Asians and Whites are generally overrepresented with an inconsiderate monthly bias of 0.01. Those with a bachelor’s degree or higher were overrepresented (0.03) than others with lower education levels (-0.02). Similarly, the high-income group (>100K) displays an opposite trend (0.01–0.03), but the low-income population (<50K) shows a minor underrepresentation. The median income households (50K –100K) were well sampled. The monthly bias among various income groups was minor (-0.02–0.02). It is important to note that this negligible bias resulted from the average of 3,220 counties or 52 states (weighted by population); the bias fluctuates among counties or states with a large variance (see the violin plots in Fig 6).
Understanding the temporal changes in demographic and socioeconomic bias is crucial. While some population groups exhibit a consistent pattern of either overrepresentation or underrepresentation, the bias of some groups varies across months. During the pandemic, individuals from low socioeconomic groups (including Black and Hispanic individuals, those with less than a college education, and those with a household income below $50K) were found to be significantly underrepresented compared to other periods as demonstrated by the changing to darker blue colors following the pandemic. This finding suggests that the COVID-19 outbreak may foster disparities in the sampling representation of vulnerable groups. This is not unexpected, as the pandemic has highlighted digital inequalities that are well-known in public health research [41, 42].
4. Limitations and future research
We conducted a comprehensive bias analysis on the emerging mobile location dataset. Although this effort represents one of the first attempts in the literature, there are some limitations to this approach. First, our bias detection approach relies on aggregation-based methods [28, 29], and such methods are not adequate for detecting biases at fine-grained levels, such as census tracts and block groups with limited sampled devices (dozens). We advocate novel methods to detect the sampling bias among population groups at those levels.
Second, our analysis is limited in its coverage of other components of the mobile location data. Specifically, this study focused on the sampled device count; other important components of the SafeGraph monthly Patterns dataset, such as sampled POIs and visit counts at the sampled POIs, were not fully addressed. One potential avenue for future research could involve using high schools with known student counts to examine the bias of sampled high school POIs and visit counts. Furthermore, it should be noted that SafeGraph has added Laplacian noise to critical columns in Patterns, such as visitor home CBG, for privacy reasons [31]. Following the addition of this noise, the SafeGraph dataset only featured census block groups (CBGs) with a minimum of two devices. In instances of 2–4 devices, the count was marked as four. While this noise addition is generally considered to minimally affect large-scale, aggregated data, its influence on smaller areas like CBGs is still uncertain. Consequently, the specific impact of the added noise on any data biases requires further investigation.
The third limitation of this study is the inconsistency in boundary data between the Census Bureau and SafeGraph. While the Bureau implemented updated geographical boundaries in 2020, SafeGraph continued to use the 2010 boundaries. For the sake of consistency, this study utilized the Census Bureau’s 2019 boundaries and the associated ACS 2019 data. Census Bureau provides CBG level crosswalk files to compare two Census results [35]. We noticed that there are about 173,585 Census block groups (CBGs) out of 220,333 block groups in Census 2010 that have the same GEOID in 242,335 CBGs in Census 2020. The extent to which these changes in boundaries have influenced the findings of this study remains unclear since this issue was not examined in the current paper.
Finally, we used the ACS 2019 5-year estimation in analyzing the bias for the 2020–2022 SafeGraph datasets. While we believe that the marginal population change (1.01% nationwide increase from 2019 to 2022) [32] will not significantly impact our conclusions, it is possible that these numbers may not be entirely accurate. Therefore, caution should be exercised when interpreting the results of our study, particularly with respect to population-level estimates. Further research and analysis are needed to refine our understanding of the impact of these factors on bias in mobile location data.
5. Conclusion
The use of mobile location data for investigating human mobility patterns has become increasingly important in various research domains, and SafeGraph is one of the most commonly used sources of such data. In this study, we comprehensively examined the sampling bias of SafeGraph Patterns across five dimensions of spatial, temporal, urbanization, demographic, and socioeconomic, covering the entire US over the past five years from 2018 to 2022. While our analysis focused on SafeGraph data in the US, the analysis approach detailed in this study can be readily applied to other mobile location data such as Advan Patterns [6], other geographic regions (e.g., Canada and Europe), and other time periods.
The SafeGraph Patterns dataset exhibited a fluctuating sampling rate over the past five years, with an average of 7.5%, which is relatively large given the size of the US population. The sampling rate was relatively uniform at the county level, and the number of sampled devices was strongly correlated with the census population for both urban (r > 0.97) and rural counties (r > 0.91), but less so at the tract and CBG levels. The sampling bias was generally minor among population groups such as gender, age, and moderate-income, with biases typically falling within the range of -0.05 and +0.05. However, minority groups such as Hispanic populations, low-income households, and individuals with low levels of education generally exhibited higher levels of underrepresentation bias that varied over space, time, urbanization, and across spatial scales. We also observed a notable increase in the underrepresentation of low socioeconomic groups following the COVID-19 pandemic, indicating that the COVID-19 outbreak may exacerbate pre-existing disparities in the representation of vulnerable groups within the data. These findings provide important insights for future studies that utilize SafeGraph data or other mobile location datasets, highlighting the need to thoroughly evaluate the spatiotemporal dynamics of the bias across spatial scales when employing such data sources.
To ensure the accuracy and validity of results when analzying mobile location data, we recommend that future studies using such data sources carefully consider the sampling biases from multiple dimensions and employ appropriate approaches to mitigate these biases. These approaches may include applying statistical weighting to adjust the data to reflect the true distribution of the population of interest, conducting sensitivity analyses to assess the impact of sampling bias on the results, and combining with other data sources, such as social media and census data, to provide additional information about the characteristics of the population.
Supporting information
S1 Appendix. This appendix includes the results of the bias analysis for the five years from 2018 to 2022.
Note that figures for 2019 are also included to facilitate comparison with other years.
https://doi.org/10.1371/journal.pone.0294430.s001
(DOCX)
References
- 1.
Rahman MM, Mou JR, Tara K, Sarkar MI. Real time Google map and Arduino based vehicle tracking system. In 2016 2nd International Conference on Electrical, Computer & Telecommunication Engineering (ICECTE) 2016 Dec 8 (pp. 1–4). IEEE.
- 2. Cao G, Wang S, Hwang M, Padmanabhan A, Zhang Z, Soltani K. A scalable framework for spatiotemporal analysis of location-based social media data. Computers, Environment and Urban Systems. 2015 May 1;51:70–82.
- 3.
Aguilera A, Boutueil V. Urban mobility and the smartphone: Transportation, travel behavior and public policy. Elsevier; 2018 Nov 2.
- 4. BirenBoim A, Shoval N. Mobility research in the age of the smartphone. Annals of the American Association of Geographers. 2016; 106:2, 283–291,
- 5. SafeGraph, (2023a). Availabel from: https://www.safegraph.com/
- 6. Barry E. SafeGraph Patterns is Now on Dewey as Advan Patterns. 2023 January 23. Availabel from: https://www.deweydata.io/blog/advan-patterns-now-available
- 7. Li Z, Huang X, Hu T, Ning H, Ye X, Huang B, et al. ODT FLOW: Extracting, analyzing, and sharing multi-source multi-scale human mobility. Plos one. 2021 Aug 5;16(8):e0255259. pmid:34351973
- 8. Goodspeed R, Yuan M, Krusniak A, Bills T. Assessing the Value of New Big Data Sources for Transportation Planning: Benton Harbor, Michigan Case Study. Urban Informatics and Future Cities. 2021:127–50.
- 9. Wang J, Kaza N, McDonald NC, Khanal K. Socio-economic disparities in activity-travel behavior adaptation during the COVID-19 pandemic in North Carolina. Transport Policy. 2022 Sep 1;125:70–8. pmid:35664727
- 10. Coleman N, Gao X, DeLeon J, Mostafavi A. Human activity and mobility data reveal disparities in exposure risk reduction indicators among socially vulnerable populations during COVID-19 for five US metropolitan cities. Scientific Reports. 2022 Sep 22;12(1):15814.
- 11. Singh SS, Javanmard R, Lee J, Kim J, Diab E. Evaluating the accessibility benefits of the new BRT system during the COVID-19 pandemic in Winnipeg, Canada. Journal of Urban Mobility. 2022 Dec 1;2:100016.
- 12. Chang S, Pierson E, Koh PW, Gerardin J, Redbird B, Grusky D, et al. Mobility network models of COVID-19 explain inequities and inform reopening. Nature. 2021 Jan;589(7840):82–7. pmid:33171481
- 13. Ning H, Li Z, Qiao S, Zeng C, Zhang J, Olatosi B, et al. Revealing geographic transmission pattern of COVID-19 using neighborhood-level simulation with human mobility data and SEIR model: A Case Study of South Carolina. International Journal of Applied Earth Observation and Geoinformation. 2023 Apr 1;118:103246. pmid:36908290
- 14. Yan Y, Malik AA, Bayham J, Fenichel EP, Couzens C, Omer SB. Measuring voluntary and policy-induced social distancing behavior during the COVID-19 pandemic. Proceedings of the National Academy of Sciences. 2021 Apr 20;118(16):e2008814118. pmid:33820846
- 15. Li Z, Li X, Porter D, Zhang J, Jiang Y, Olatosi B, et al. Monitoring the spatial spread of COVID-19 and effectiveness of control measures through human movement data: proposal for a predictive model using big data analytics. JMIR Research Protocols. 2020 Dec 18;9(12):e24432. pmid:33301418
- 16. Yang W, Shaff J, Shaman J. Effectiveness of Non-pharmaceutical Interventions to Contain COVID-19: A Case Study of the 2020 Spring Pandemic Wave in New York City. medRxiv (2020). URL https://doi.org/10.1101/2020.09. 2020;8.
- 17. Song Y, Newman G, Huang X, Ye X. Factors influencing long-term city park visitations for mid-sized US cities: A big data study using smartphone user mobility. Sustainable Cities and Society. 2022 May 1;80:103815.
- 18. Liang Y, Yin J, Pan B, Lin MS, Miller L, Taff BD, et al. Assessing the validity of mobile device data for estimating visitor demographics and visitation patterns in Yellowstone National Park. Journal of Environmental Management. 2022 Sep 1;317:115410. pmid:35751247
- 19. Kupfer JA, Li Z, Ning H, Huang X. Using mobile device data to track the effects of the COVID-19 pandemic on spatiotemporal patterns of national park visitation. Sustainability. 2021 Aug 20;13(16):9366.
- 20. Yabe T, Rao PS, Ukkusuri SV. Resilience of interdependent urban socio-physical systems using large-scale mobility data: Modeling recovery dynamics. Sustainable Cities and Society. 2021 Dec 1;75:103237.
- 21. Hou Y, Poliquin CW. The effects of CEO activism: Partisan consumer behavior and its duration. Strategic Management Journal. 2023 Mar;44(3):672–703.
- 22. Banerjee S, Krebs C, Bisgin N, Bisgin H, Mani M. Predicting customer poachability from locomotion intelligence. InProceedings of the 5th ACM SIGSPATIAL International Workshop on Location-based Recommendations, Geosocial Networks and Geoadvertising 2021 Nov 2 (pp. 1–4).
- 23. Griffin G, Mulhall M, Simek C, Riggs W. Mitigating bias in big data for transportation. J Big Data Anal Transp 2: 49–59.
- 24. Grantz KH, Meredith HR, Cummings DA, Metcalf CJ, Grenfell BT, Giles JR, et al. The use of mobile phone data to inform analysis of COVID-19 pandemic epidemiology. Nature communications. 2020 Sep 30;11(1):4961. pmid:32999287
- 25. Sharma A, Farhadloo M, Li Y, Gupta J, Kulkarni A, Shekhar S. Understanding COVID-19 Effects on Mobility: A Community-Engaged Approach. AGILE: GIScience Series. 2022 Jun 10;3:14.
- 26. Pew Research Centerr. Share of adults in the United States who owned a smartphone from 2015 to 2021, by age group. In Statista. 2021.
- 27. Ito M, Kawahara JI. Effect of the presence of a mobile phone during a spatial visual search. Japanese Psychological Research. 2017 Apr;59(2):188–98.
- 28. Squire R. What About Bias in the SafeGraph Dataset?. 2019 October 17. Available from: https://colab.research.google.com/drive/1u15afRytJMsizySFqA2EPlXSh3KTmNTQ#offline=true&sandboxMode=true
- 29. Wang J, McDonald N, Cochran AL, Oluyede L, Wolfe M, Prunkl L. Health care visits during the COVID-19 pandemic: A spatial and temporal analysis of mobile device data. Health & place. 2021 Nov 1;72:102679. pmid:34628150
- 30. Coston A, Guha N, Ouyang D, Lu L, Chouldechova A, Ho DE. Leveraging administrative data for bias audits: Assessing disparate coverage with mobility data for COVID-19 policy. InProceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency 2021 Mar 3 (pp. 173–184).
- 31. SafeGraph. Patterns. 2022b. Available from: https://docs.safegraph.com/docs/monthly-patterns
- 32. US Census Bureau. American Community Survey 5-Year Data (2009–2021). Census.Gov. 2022 December 8. Available from: https://www.census.gov/data/developers/data-sets/acs-5year.html
- 33. Berry L. 2016–2020 ACS Release Includes Important Updates to Census Boundaries. ArcGIS Blog. 2022 March 10. Available from: https://www.esri.com/arcgis-blog/products/arcgis-living-atlas/mapping/acs-2016-2020-updated-boundaries/
- 34. SafeGraph. SafeGraph—Advan Methodology Differences. 2022a. Available from: https://community.deweydata.io/t/safegraph-advan-methodology-differences/26163
- 35. US Census Bureau. Relationship Files. Census.Gov. 2021 October 28. Available from: https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html
- 36. US Census Bureau. Urban and Rural. Census.Gov. 2023 February 9. Available from: https://www.census.gov/programs-surveys/geography/guidance/geo-areas/urban-rural.html
- 37. Sun S, Wang X, Wang D. Smartphone usage patterns and social capital among university students: The moderating effect of sociability. Children and Youth Services Review. 2023 Dec 1;155:107276.
- 38. Blakeslee SB, Vieler K, Horak I, Stritter W, Seifert G. Planting seeds for the future: scoping review of child health promotion apps for parents. JMIR mHealth and uHealth. 2023 Jul 20;11(1):e39929. pmid:37471125
- 39. SafeGraph. Privacy Policy. 2023b. Available from: https://www.safegraph.com/privacy-policy
- 40. Rice RE, Katz JE. Comparing internet and mobile phone usage: digital divides of usage, adoption, and dropouts. Telecommunications policy. 2003 Sep 1;27(8–9):597–623.
- 41. Watts G. COVID-19 and the digital divide in the UK. The Lancet Digital Health. 2020 Aug 1;2(8):e395–6. pmid:32835198
- 42. Lai J, Widmar NO. Revisiting the digital divide in the COVID‐19 era. Applied economic perspectives and policy. 2021 Mar;43(1):458–64. pmid:33230409