Open-source data pipeline for street-view images: A case study on community mobility during COVID-19 pandemic

Street View Images (SVI) are a common source of valuable data for researchers. Researchers have used SVI data for estimating pedestrian volumes, demographic surveillance, and to better understand built and natural environments in cityscapes. However, the most common source of publicly available SVI data is Google Street View. Google Street View images are collected infrequently, making temporal analysis challenging, especially in low population density areas. Our main contribution is the development of an open-source data pipeline for processing 360-degree video recorded from a car-mounted camera. The video data is used to generate SVIs, which then can be used as an input for longitudinal analysis. We demonstrate the use of the pipeline by collecting an SVI dataset over a 38-month longitudinal survey of Seattle, WA, USA during the COVID-19 pandemic. The output of our pipeline is validated through statistical analyses of pedestrian traffic in the images. We confirm known results in the literature and provide new insights into outdoor pedestrian traffic patterns. This study demonstrates the feasibility and value of collecting and using SVI for research purposes beyond what is possible with currently available SVI data. Our methods and dataset represent a first of its kind longitudinal collection and application of SVI data for research purposes. Limitations and future improvements to the data pipeline and case study are also discussed.


Introduction
Street-level imagery is becoming an increasingly popular form of data for research [1].In particular, Street View Images (SVI) as popularized by Google Street View are used in many studies [2,3].Uses for SVI data include estimating demographics [4], evaluating the built environment [5], surveying plant species [6], measuring pedestrian volume [7] and many other applications [8][9][10].
While SVI data can provide many useful insights for researchers, it is not without its flaws.For corporate-collected images such as Google Street View, or Tencent Street View the availability of images depends on where the companies decide to collect data, and capitals within Seattle were chosen to ensure a representative sample of the overall population of Seattle [22].While the drivers try to make the surveys as consistent as possible, occasionally exogenous factors caused deviations from standard protocols.For example, during three of the surveys (05-29-2020, 06-18-2020, and 06-26-2020), protests over the murder of George Floyd caused parts of the survey route to be unnavigable.

Data Processing Pipeline
After video collection, the raw data is segmented into image data.The images are sub-sampled from video frames so that they are collected about every 4 meters.The images are then uploaded into the DesignSafe-CI Data Depot [24].From DesignSafe, the images are transferred to the TACC Frontera high-performance computing cluster [25].We completed all file transfers between the two services using Globus [26].Without access to these services, or similar ones, the storage and computing requirements for this project would be intractable.
On Frontera, orthorectification is performed to the images, then pedestrian detection is performed on the orthorectified images.The orthorectification transforms the images from a single image in the equirectangular projection to two images in the rectilinear (gnomonic) projection [27].Pedestrians are detected on each of the new images using a convolutional neural network (CNN) based on a pre-trained model from the Pedestron repository [20].Our data represents a highly challenging detection task, as there is great variation in lighting, backgrounds, human poses, levels of occlusion and crowd density from image to image and run to run.The Cascade Mask R-CNN architecture in the Pedestron repository performed well on the CrowdHuman data set, representing a similar challenge to our data [21].All testing and use of the CNN was performed using GPUs on the Frontera cluster.An example image after undergoing orthorectification and pedestrian detection is shown in Fig 1.Using one GPU node on Frontera, with four NVIDIA Quadro RTX 5000 GPUs, the entire process takes about 3 seconds per original 360 • image.Given the 4 million images we collected, this takes about 3,300 hours of computing time.While this is not a small number, when running in parallel, the whole process can be completed in a manner of days.In comparison, a human taking 10s per orthorectified image to count all the pedestrians would take over 22,000 hours to complete the same task.File compression/decompression for file transfer also takes a substantial amount of time.Since we used DesignSafe as our main data storage platform, we had to transfer files to/from the Frontera supercomputer to perform our pedestrian detection.To avoid overloading the file transfer system, we compressed the images from each run into a tar file prior to transferring the files to Frontera.This file compression/decompression can take several hours per run, but can be performed in parallel with the detection algorithm since they are on different systems.After compression, file transfer using Globus [26] takes minutes.
In post-processing, the pipeline filters out low-confidence detections (defined as any detection with less than 80% confidence) and associates the remaining high-confidence detections to U.S. Cenus Bureau GEOIDs [19].We arrived at this confidence level after tuning for the precision and recall of the CNN classifier.Specifically, the pipeline filters based on the output of the second to last layer of the CNN, known as a softmax layer.For a k−class classification problem, the softmax layer will output a k−dimensional probability vector, where each i th entry of the vector gives the probability that the original input to the CNN belongs to class i.
The final stage of post-processing is GEOID matching, where latitude and longitude metadata are cross-referenced to disjoint geographic regions (e.g.U.S. census tracts or block groups) and their respective GEOID codes.The cross-referencing code assumes the availability of shapefiles describing the geometry of the geographic regions.Aggregating the pedestrian detections according to U.S. Census Bureau GEOIDs [19] is necessary for analyses using sociodemographic data collected by the census.Additionally, the pedestrian detections can easily be cross-referenced with custom geometry defined using popular geographic information system software, such as the capitals data used in route construction and our analysis.
Following the GEOID matching step, the pedestrian detections data is written to a tabular format file (e.g.comma separated values).This file is an "analysis-ready" data product, in the sense that it is readable by most popular statistical analysis software (R, SPSS, Stata, etc.) and can be easily merged with other datasets using the GEOID column(s).A visual depiction of the entire pipeline is seen in Fig 2 .Full code and a manual for following our process is available at https://github.com/marte292/rapid-data-pipeline.

Data Processing
All analysis is performed using the Python programming language version 3.11 [28].The initial data product as outlined in the previous section is a list of detections, alongside the date of collection, geolocation, and GEOID.We also utilized a similar list of the images themselves with the same features.The last dataset we utilized is the median household income data and racial demographic data from the 2019 American Community Survey (ACS) 5-year estimates.We aggregated the detections and image data for each data collection survey at the census tract level, then matched each census tract's total number of detections and images to its respective demographic and income data.
We utilized the data from 36 of the 37 surveys, omitting data from 10-29-2020.A heavy rain event caused the survey to be stopped early due to poor video quality.For each survey, we divided the number of detections in each census tract by the number of images collected in the tract to create a normalized 'detections per image' metric.This is a necessary step as the number of images in each tract may change survey to survey due to circumstances outside our control, such as construction or community events altering the route.
The last step in data processing was to transform some of our data to be represented by categorical variables.The date of each survey was coded both as either a weekend or weekday, and by the season.The date was also coded as either being before, or after the date that vaccines became publicly available.Income data was coded to be one of 5 levels that were used during route design.Lastly, the proportion of the census tract's population that identifies as non-white was coded as an indicator variable with '1' corresponding to areas that are 55.5% white or more.We determined this threshold using Jenk's natural breaks optimization.This left us with a dataset of 3171 observations to be used for analysis.Each observation represented a census tract with a detections per image value, as well as values for each of the categorical variables defined above.

Initial Regression Analysis
Based on the known literature, we hypothesized that season, day of the week, COVID-19 vaccine availability, income level, and demographics all would have an impact on pedestrian traffic.We implemented a regression model to identify which of these factors are identified as statistically significant (α = .05).The regression model is detailed below: where Y is the detections per image for a date/census tract combination; I vaccine is an indicator for if the vaccine was available on that date; C season is a categorical variable with 3 levels for summer, winter, and spring; I weekend is an indicator for if it is the weekend or not; C incomelevel is a categorical variable with 4 levels for the 4 income brackets above the lowest bracket; I demographicindicator is an indicator variable for if the population is 55.5% white or more.β 0 is the baseline detections per image on a weekday, not in the summer, with the vaccine unavailable, in a census tract at the lowest income level and a population less than 55.5% white.β 1 represents the change in detections per image from the vaccine becoming available, and β 2..4 represent the change for different seasons.β 5 represents the change from a weekday to the weekend, and β 6..9 represent the change to other income brackets.Lastly β 10 represents the change in detections per image to from an area that is less than 55.5% white an area that is more.In addition to the above analysis, we subset the data by only looking at detections that occurred in an image with at least one other detection.Then we calculated detections per image again, and fit the above model again with the new response variable.This same process was followed for detections with at least two, three, and four other detections in the same image.The goal of these analysis was to see if there were different trends for larger groups of people when compared with the entire data set.

Data pipeline
Our main contribution, the open-source data pipeline, is publicly available on https://github.com/marte292/rapid-data-pipeline.The repository contains a process manual with step-by-step instructions on how to implement the data pipeline in Python [28].The required Python libraries and system requirements are provided.Additionally, we provide enough code for future researchers to implement the pipeline on their own systems, with their own file structure.The pipeline is capable of processing terabytes of image data and outputting an analysis-ready data product in a matter of days (using high-performance computing, such as a single GPU node on Frontera, an academic supercomputer) with minimal human input.

Case study
Using data from the Seattle street-level imagery campaign, we calculated the number of detections per image across all data collection surveys.The full results of the linear regression model for total detections per image are displayed in Table 1.They show that the season being summer is the only significant seasonal effect.Additionally, the income bracket is a significant predictor, with wealthier areas seeing less pedestrian traffic.Finally, a census tract having a population greater than 55.5% white is a significant positive predictor.All other variables are not significant, including vaccine availability.
For the regression models using a subset of data, the results are similiar to the initial model.All models have the same significant predictors as the initial model.The model using the detections sharing an image with at least one other also had the weekend as a borderline significant, negative predictor.The models using detections sharing an image with at least 3 and 4 others had vaccine availability as a significant, positive predictor.The full results of the linear regression model for detections per image with at least 4 others are displayed in

Comparison to Google Community Mobility Data
Given the ability to measure community mobility through pedestrian counts, there is potential value of our pipeline for social sciences and public health research [29,30].At an individual level, higher physical activity is known to predict better physical [31,32] and mental health [33][34][35], and is associated with higher self-reported satisfaction and quality of life [36,37].In an aggregate sense, mobility is theorized to be an intermediate variable through which socioeconomic deprivation affects vulnerability to infectious disease [38,39], resilience to disasters [40], and exposure to environmental hazards [41].
In light of this body of literature, we argue that the use of pedestrian counts to assess mobility could be a differentiating factor in researching social and health inequity.One extremely common source of mobility data during the COVID-19 Pandemic has been Google Community Mobility Reports [42] and Apple Mobility Trends Reports [43].
While there have been improvements in recent years [44], there are known representation and self-selection biases with existing mobility data captured by smartphones and other internet-based data collection methods [45][46][47][48][49].
Given the large number of publications using smartphone data as the foundation for their work, a natural question is how our data compares to smartphone mobility data.Comparison between our data set and the still publicly available Google Community Mobility Reports data can reveal some of the similarities and differences between the two data sets [42].Google Community Mobility data is reported at the county level in the United States.Since Seattle is in King County, Washington, the King County data is what we use to draw the comparison.
Google Community Mobility data does not provide raw mobility numbers, but rather is reported as a percentage change from the five week period of Jan 5 -Feb 6, 2020.This data is collected from smartphones running the Android operating system with location history turned on, which is off by default.The data is baselined by day of week, so data from a given Monday is compared to the median of the five Mondays in the baseline window to calculate a percent change.Additionally, it is unclear how exactly Google quantifies mobility.It is mentioned that it combines number of visitors to a location with amount of time spent in that location, but no specifics beyond that are provided.
Google mobility data is broken down into different categories.The category that most closely aligns with one of the categories used in our analysis is parks.Although Google's data classifies parks as official national parks and not the general outdoors, it does not indicate how it accounts for city or state parks.Our own data for park locations is based on the City of Seattle's official classifications.
Fig 4 shows a comparison of our detections per image data against Google Community Mobility data.Note that not all surveys are included because Google Community Mobility data stopped being provided on October 15, 2022.Overall, the trends between the two data sets are remarkably similar, lending further credibility to our data collection procedure.The more notable differences in the graph are from the months of November 2020 through August 2021, where the Google mobility data shows a larger drop followed by an increase in community mobility than was visible through our own data.A depiction of our own detections per image data (blue, dashed; right axis) against Google Community Mobility data (orange, solid; left axis).The pearson correlation between the two data sets is 0.387.The Google Community mobility data is aggregated at King County, WA, while our data covers a survey route within Seattle, which belongs to King County.As the dates of surveys were irregular (e.g., due to weather conditions), all dates are included in the figure .One plausible explanation for this is the upwards sampling bias that occurs when using smartphone data [50,51].Our data set captures anyone on the street, including individuals experiencing homelessness who are less likely to have smartphones.This population was on the streets throughout the entirety of the pandemic, so they were consistently captured by our data collection efforts.This consistent baseline pedestrian count could lead to a lesser response to vaccine rollout and winter weather in our own data in comparison with Google's.Additionally, there is a known income gap in both vaccination rates and smartphone ownership [52,53].This gap could drive the increase in the Google Mobility data during vaccine rollout.

Implications, Limitations, and Extensions
Our results show that it is possible for researchers to collect and analyze longitudinal SVI data.The presented methods can be used to collect and process SVI data from 8 hours worth of video in a manner of days.This time will only further decrease with faster data processing infrastructure and methods.These methods will allow novel longitudinal SVI data to be collected for research in a variety of application areas.
The results of the case study also bear further discussion.We demonstrated expected relationships between seasonal effects like day of week and weather on pedestrian traffic.Additionally, we showed that pedestrian traffic is inversely proportional to income, a known result during the COVID-19 pandemic [54,55].Our results also showed that more white areas had higher on average pedestrian counts.This could be due to known trends, such as areas with larger non-white populations being more likely to stay home in response to government restrictions [56], or just due to local trends, as racial mobility trends tend to vary between cities [57].These results validate our method with respect to established literature.
One new finding from our case study is that while overall pedestrian counts did not respond to vaccine availability, the subset of pedestrians who were in larger groups (4+ people in an image) did.Likely, the reason we did not see a response to the vaccine in the aggregate data is because our data only captures people who are outdoors.There is data that shows that outdoor pedestrian activity varied across cities, frequently increasing at recreation locations like trails, during the early days of the pandemic [58,59].Given these increases at some locations, a return to 'normal' pedestrian traffic may not mean an increase, but rather a change in traffic patterns.Our data captures this by showing showing that there was a significant increase in larger groups of people after the vaccine became available.This implies that people were more willing to be near each other outdoors after they had been vaccinated.
While the data pipeline presented here does represent a method for generating a novel data product, there are implementation challenges worth further discussion.For data collection, in addition to the time required to drive the route limiting the places of interest the route could reach, there were also many tradeoffs that had to be made when designing the route itself [22].Despite having our survey route carefully designed to assess a representative sample of the Seattle population, some bias in route design is unavoidable.Since the route design included data from the American Community Survey aggregated at the census tract level, there is an implicit assumption of spatial homogeneity of the population within each census tract.Such bias is a manifestation of the well-known modifiable areal unit problem [60].Since the majority of the route was primarily based on locations of interest throughout the city, this concern is somewhat mitigated.
In terms of processing, the pre-trained model we used required a substantial amount of high-performance computing time, and at times the data product generated was so large as to be unwieldy.Given the challenge our data set represents, using a model designed to be generalizeable is necessary to attain good detection results.As many state-of-the-art models perform substantially worse out of sample, we had to be careful to choose a model that was designed to perform well in this situation, at the cost of slower computing times [61].Another unforeseen challenge was regular updates to the video camera's software to process and segment the video data into images.Consistent image formatting was vital for the data processing pipeline to function, so regular quality checks are necessary to make sure the images are processed properly.
The data product created, pedestrian detections, has some limitations as well.First, our method only captures pedestrians who are outdoors and near enough to the street to be captured via camera.This means that our data set does not include people who are indoors at these locations of interest, or who are too far from the street to be seen by camera.While the changes over time in pedestrian traffic we observed are still meaningful, it is important to recognize they don't capture everything.Similarly, our data cannot be interpreted as the actual number of pedestrians on the street.There is overlap in the image data, even when subset at 4 meter intervals and cropped during orthorectification.The orthorectified images only represent about 25% of the originals.However, this natural cropping is not enough to avoid the image overlap and further cropping would risk information loss.Pedestrians that appear in the foreground of one image may end up in the background of another.There are also several known instances of cyclists keeping relative pace with the street-view vehicle for several blocks, resulting in numerous detections.These issues are easy to circumvent in analysis by comparing the relative number of detections, although at the cost of interpretability.
Even with the above limitations, the data pipeline presented in this paper can be directly applied or adapted to be used in a number of contexts.Potential applications of longitudinal SVI data in assessing the built environment [14], broad urban research [1,3,62], and health research [8] have been well-documented, as the temporal instability of existing SVI data is discussed as a limitation in all of these fields.Beyond this, it is possible to estimate population demographics [4], and other neighborhood-level statistics [13,63] using SVI data.As our ability to quickly and accurately parse scenes using computer vision improves [64], potential application areas will only increase in number.
Another field where longitudinal SVI data could contribute a lot is disaster research.There is a substantial body of research dedicated to empirical methods for modeling various aspects of disaster recovery [65].Our methods could be applied in this field to quantify recovery using pedestrian detections as a metric for community mobility, or another metric assessing the built environment as appropriate.Similar work has been done using repeat photography after Hurricane Katrina [15] but our methods represent a substantial increase in generated data, allowing for a wider range of analyses.Spatial video data collection for disaster reconnaissance has also been done [66], but involves manual assessment of the captured video.Our methods demonstrate that a fully-automated approach is possible, which would allow for more frequent data collection at a lower cost.

Conclusion
Regression analysis based on longitudinal SVI data showed that pedestrian traffic patterns changed in response to the availability of the COVID-19 vaccine.Our results demonstrate the feasibility and value in collecting SVI data as a part of a longitudinal study.Longitudinal SVI data is capable of providing valuable insights in a variety of fields of study.

Fig 1 .
Fig 1. Sample images from the pedestrian detection data pipeline.The left image is an original 360 • image from a data collection run.The image on the right is the right-hand side of the original image after orthorectification and pedestrian detection (both sides of the image are processed separately).There are two pedestrians that were detected by the algorithm (in red bounding boxes).

Fig 2 .
Fig 2. Flowchart of the data processing pipeline.The parts of the flowchart in gray occur on NHERI DesignSafe-CI, while the right hand part in blue is done on the Frontera cluster.
Fig 3 shows the detections per image for each survey, as well as the detections per image for the subset of detections sharing an image with at least 4 others.Fig 3 also displays the timestamp of COVID-19 vaccines becoming publicly available in Washington state.Fig 3 depicts the trends over time for detections per image and detections sharing an image with at least 4 others.While both graphs exhibit similar trends overall, notably after vaccine rollout the graph of detections sharing an image with at least 4 others exceeds the graph of detections per image in all cases.The spike in detections seen in June 2020 is due to the large scale protests of police brutality that took place in Seattle in the aftermath of George Floyd's murder.

Fig 3 .
Fig 3. Time series data of the total detections per image (solid blue line, left axis), and detections per image for the subset of detections sharing an image with at least 4 others (orange dashed line, right axis).As the survey dates are irregular, all dates are included in the figure.Please note that the axis for total detections per image does not start at 0. This was done purposefully to facilitate comparison between the trends of the two graphs.

Fig 4 .
Fig 4. A depiction of our own detections per image data (blue, dashed; right axis) against Google Community Mobility data (orange, solid; left axis).The pearson correlation between the two data sets is 0.387.The Google Community mobility data is aggregated at King County, WA, while our data covers a survey route within Seattle, which belongs to King County.As the dates of surveys were irregular (e.g., due to weather conditions), all dates are included in the figure.

Table 2 ,
with all other regression models available in the supporting information.

Table 1 .
OLS Regression Results for Detections per Image.

Table 2 .
OLS Regression Results for Detections per Image for the detections subset sharing an image with at least 4 others.