Crowdsourcing architectural beauty: Online photo frequency predicts building aesthetic ratings

The aesthetic quality of the built environment is of paramount importance to the quality of life of an increasingly urbanizing population. However, a lack of data has hindered the development of comprehensive measures of perceived architectural beauty. In this paper, we demonstrate that the local frequency of geotagged photos posted by internet users in two photo-sharing websites strongly predict the beauty ratings of buildings. We conduct an independent beauty survey with respondents rating proprietary stock photos of 1,000 buildings across the United States. Buildings with higher ratings were found more likely to be geotagged with user-uploaded photos in both Google Maps and Flickr. This correlation also holds for the beauty rankings of raters who seldom upload materials to the internet. Objective architectural characteristics that predict higher average beauty ratings of buildings also positively covary with their internet photo frequency. These results validate the use of localized user-generated image uploads in photo-sharing sites to measure the aesthetic appeal of the urban environment in the study of architecture, real estate, urbanism, planning, and environmental psychology.

A1 from the Exchangeable Image File Format (Exif) used by cameras and smartphones with GPS systems. These coordinates capture the position from which the photo was taken. However, both Panoramio and Flickr allow users to pin down image locations on a digital map, absent Exif information. In both sources the pinning down of images to the position from which they were taken is encouraged by software, but users may be using alternative heuristics for photo location (et al. Larson (2015)). Nevertheless, since city street widths limit the distance between the position of the author and the landmark, we can expect for photos taken and uploaded around major urban buildings to have geotags that are proximate to the building's coordinates. Moreover, users taking photos at far distances from a building tend to geotag the objective.
Recent research by Zielstra and Hochmair (2013) has carefully analyzed the positional accuracy of images in Flickr and Panoramio. 1 These authors find Panoramio's geotags to be more accurate than Flickr's since its users are more likely to own better cameras and to be more sophisticated in geolocating their photos. Importantly, a large number of discrepancies are due to users geotagging the exact coordinates of the buildings they were capturing, not the point of where they were shooting (27 and 15 percent for North American street buildings in Flickr and Panoramio, respectively). Therefore, positional errors are generally biased in the direction of making photo geotags closer to the target buildings. Moreover, differences between Flickr and Panoramio can be attributed to the type of photo that is being posted. For example, Flickr photos tend to show a significant amount of human activity compared to Panoramio, which shows more scenic and landmark photography. Therefore, we expect Panoramio to be a better proxy for architectural beauty. In any case, the number of photos geotagged around each building needs to be considered probabilistically: more photos in the vicinity of a building signal a higher probability of that building being depicted. Type one classification errors (including pictures that are not about the building measured) and type two classification errors (missing pictures about a building that are posted beyond the distances that we explore in the paper) will typically bias down the relationship between image uploads and the actual building beauty metric. Therefore estimates in the article need to be interpreted as lower bounds for the quantitative impact of perfectly-measured image uploads 1 Focusing on street buildings in the United States, Zielstra and Hochmair (2013) found the median distance between the camera position and the geotag of Flickr images to be of 31 meters.
For Panoramio, the distance was only 15 meters.
A2 on building beauty. Since researchers will tend to focus on the relative ranking by type of building rather than on the exact building beauty rating numbers (that is by definition designed by the survey creator), the downward bias may not be a problem in statistical applications that use image uploads under the laws of large numbers to capture building beauty across building types.
For example, consider two sets of different building types, which we conventionally denominate "brick" and "concrete." Assume that each set contains a vast and an identical number of buildings, but that the sum of the number of photos posted around "brick" buildings is double the number of photos posted around "concrete" buildings. Given the large number of observations, and if there are no reasons to assume that geotagging errors by internet users depend on building materials, the actual number of true photos of "brick" buildings should also be double (the rate of misclassification across groups should be constant in large samples). Furthermore, due to the results in the paper, it is likely for the beauty mean of the "brick" buildings to be substantially higher than the beauty mean for "concrete" buildings, which is sufficient for many -if not most-applications. Nevertheless, the differences in image uploads between the two groups multiplied by the coefficients reported in Table 2 of the main text are likely to underestimate their mean building beauty differences, due to attenuation bias.
Note that if researchers have a priori doubts about the randomness of measurement errors across categories (e.g. brick buildings tend to be in high-pedestrian traffic areas or narrower streets and therefore would get more type II misclassifications), then they should explicitly control for the potential sources of covariance that generate such measurement errors (e.g. control by pedestrian traffic or street width). In most applications, conditioning on neighborhood effects (e.g. zip code or census tract fixed effects) should control for localized differences in photo-taking behavior that could originate other patterns of misclassification error.
Another interesting and related issue has to do with the possible spatial correlation of building beauty. If architectural beauty tends to be geographically clustered, then type II classification errors (including photos from adjacent buildings) may be more likely for edifices within "beautiful" clusters. Beauty clustering may not be perceived as a problem for many applications if such clustering is systematic. In this case, the beauty of adjacent buildings can be thought of as an additional predictor of a building's beauty. In other applications, a straightforward way to bypass this issue is by including area fixed effects, so that we study differences in the impact of A3 building types within each cluster. However, our results in Table 2, Panel I, column 5, suggest that the additional information in annuli further away from the building (which are more likely to capture adjacent buildings) does not generate statistically significant coefficients. In other words, there does not seem to be a substantial contribution to explaining building beauty from photos that are more likely to be about neighboring buildings. Note that if neighboring beauty were adamant predictors of a building's beauty, we should have found stronger correlations. After all, 50-meter image uploads around one building is expected to be a noisier proxy for a buildings beauty than image uploads encompassing wider areas.
In conclusion, while building beauty may indeed be spatially correlated, local image uploads at 50 meter distances seem to be sufficient statistics for a building's beauty perceptions.

A3 Robustness tests
This section presents additional robustness tests that validate the use of image uploads as a proxy for building beauty. As a first test, we present the scatter plots of the relationship between image uploads and building beauty. Figure A2 displays the graphic version of the relationship shown in Table 2  websites. Interestingly, photo frequencies within the first 10 meters are less informative in Flickr. This could be consistent with the fact that Flickr users are more likely to upload coordinates directly from their phone, corresponding to the place from where they took the photo. Therefore, a setback of 10 meters would seem reasonable. Conversely, the Panoramio application makes it easy for users to pin photos to the exact geolocation of the buildings on a map. Alternatively, some of the differences in marginal significance between websites within the first 50 or 60 meters may just be random. Nevertheless, in both datasets, the sum of all photos at distances between 0 and 50 meters provides a strong predictor of building beauty, one that is strongly correlated across sites. Table A1 shows the results of the regression that explains building beauty using height, age, and architectural style dummies, as used in Table 3 of the main text.
We find that Spanish Colonial Revival and Beaux-Arts architectural styles receive on average a higher building beauty, while Modernist and Early Modernist buildings receive the lowest building beauty. 2 The fitted results from this regression (the explanatory variable values multiplied by the estimated coefficients) are used as our "predicted" component of beauty and the residuals (orthogonal to the explanatory variables) as our measure of "residual" beauty.  characteristics in the regression. The omitted architectural style, which serves as a baseline is modernism. Below each of our estimates and in parentheses, we report standard errors that are robust against heteroskedasticity and clustered on buildings. *** denotes a coefficient significant at the 1% level, ** at the 5% level, and * at the 10% level. Figure A3: Estimated survey score marginal gains from pictures in range (Flickr) As an additional robustness test we repeat the exercise in Table 2, Panel I, column 5 of the main text, this time allowing for the range of controls that we used in other columns in the same table. Table A2 presents the estimates of image uploads using two-dimensional annuli ("donuts") of different lengths around each building in our sample. We then regress building beauty simultaneously against the number of photos uploaded in each annuli. The fact that the coefficients of photo uploads within 50 meters are robust across specifications confirms the highly localized nature of the relationship between image uploads and building beauty, which suggests that we are not confounding contextual factors, regional differences, or neighborhood effects.

A4 Respondents in Online Sample
As explained in the text, we used the services of a private vendor (Qualtrics) to conduct our survey online. We contracted an ex-ante random sampling of the US population, as opposed to much more expensive methods that explicitly stratify the sample to achieve the average population characteristics ex-post. In practice, this implies that we will have sampling error with regards to matching national characteristics. In addition, the procedure could also include errors introduced by the online sampling methods of the private vendors. Turk. On average, they find that online surveys tend to oversample the highly-A8 Notes: The number of photos within each annuli is shown in tens. Each column presents a different specification, and the bottom rows describe the covariates and sample restrictions on each model. Below each of our estimates and in parentheses, we report standard errors that are robust against heteroskedasticity and clustered on buildings.
A9 educated and white, but do well in other dimensions. In general, they conclude that "for many applications, the advantages of online surveys (e.g., the efficiency of data collection, lower economic costs, and acceptable approximations to population profiles) far exceed their disadvantages regarding external validity." Table A3 reports the demographic characteristics of our survey conducted in 2013 and the American Community Survey (ACS) 5-year from 2009 to 2013. Among the demographic characteristics, we have age reported in brackets (under 20, 20-30, 30-40, 40-50, and 50 or older) and gender. Our survey also includes the race of the respondent, which corresponds to one of the following categories: White/non-Hispanic, African American, Asian, Hispanic, and Other. Also, the survey reports the education level of respondents, which varies from high-school graduates to respondents with some college or completed college and respondents with less than high school. Finally, concerning geography, our survey reports whether respondents live in a metropolitan area and their state of residence.
Some of the survey characteristics do not deviate much from those reported in the census. However, as in Heen et al. (2014), the online survey did tend to oversample whites. The survey's largest discrepancy is with regards to metropolitan area status.
The frequency of self-reported metropolitan status is much smaller than we would expect based on random sampling. Such discrepancies could arise from different conceptions about how a metropolitan area is perceived by respondents. However, we take the discrepancies at face value to assess the robustness of the findings.
In order to see if the existing differences between the census sample and survey in table A3 impact the covariance between assessed beauty and online photo frequencies we conduct additional exercises to reweigh the survey data to match the frequency of the census demographic categories. In these exercises, we eliminate the respondents who report "unknown" in the survey (amounting to 18 people). The first row of  We conclude that -in our case-the raw online sample provides researchers with valid variation for investigating issues related to the environmental covariates of perceived architectural beauty. Moreover, the use of similar online rating exercises can provide researchers with a cost-effective way to study architectural beauty in other contexts as well. Conducting off-line image ratings with significant respondent samples (e.g., over 500) and many image ratings (exceeding 100 per respondent), might make these exercises prohibitively expensive, thereby curtailing such investigations. Re-weighting exercises -as we do here-can then be conducted to assess the robustness of results to sampling conditions. Covariates:

Rater effects
Photo order effects Notes: The dependent variable is average survey score. Observations are building and rater specific. Each column presents a different specification, where we vary the source of image uploads. The bottom rows describe the covariates in each model. Below each of our estimates and in parentheses, we report standard errors that are robust to heteroskedasticity and clustered at the building level. *** denotes a coefficient significant at the 1% level, ** at the 5% level, and * at the 10% level. Figure A4: Top survey photos ranked by mean respondent scores. Figure A5: Bottom survey photos ranked by mean respondent scores.