How to Study the City on Instagram

We introduce Instagram as a data source for use by scholars in urban studies and neighboring disciplines and propose ways to operationalize key concepts in the study of cities. These data can help shed light on segregation, the formation of subcultures, strategies of distinction, and status hierarchies in the city. Drawing on two datasets of geotagged Instagram posts from Amsterdam and Copenhagen collected over a twelve-week period, we present a proof of concept for how to explore and visualize sociospatial patterns and divisions in these two cities. We take advantage of both the social and the geographic aspects of the data, using network analysis to identify distinct groups of users and metrics of unevenness and diversity to identify socio-spatial divisions. We also discuss some of the limitations of these data and methods and suggest ways in which they can complement established quantitative and qualitative approaches in urban scholarship.


Introduction
Since its launch in 2010, Instagram has quickly become one of the most widely used social networking platforms in the world [1]. In early 2015, it was reported that over 200 million users around the world use the service to share 70 million pictures per day [2]. As a visual-locative social medium, Instagram can be regarded as a participatory sensing system [3,4]. Its users produce data as they navigate their everyday lives, smartphones in hand. These data are centralized in the Instagram platform. As such, it lends itself as an unparalleled data source for social researchers. In this paper, we introduce methods that can be used by scholars to make sense of contemporary urban dynamics. In particular, our interest is to explore and visualize sociospatial patterns and divisions within the city. We apply these methods to two cities, Amsterdam and Copenhagen. This application is a proof of concept to demonstrate the possibilities-as well as the limitations-of these data and methods to inform the work of urban scholars.

Related Work
Our approach for studying social divisions draws on two literatures: social science literature on cities, and computer science literature on social media. Social scientists have long been interested in social divisions within the city. Building on the foundational work on urban social ecology by the Chicago School [5], they have mapped the uneven distribution of population groups and identified determinants and outcomes of economic and ethnic segregation [6][7][8].
The vast majority of studies have researched residential segregation through official registry data. While such studies provide a range of insights, it is widely acknowledged that these studies do not cover some essential dimensions of differentiation within the city, as they account more for where people reside than for how they move through urban space in their daily lives. The urban landscape is marked by numerous processes of enclave formation that do not revolve around places of residence, i.e. neighborhoods, but around places of work, leisure, or retail [9][10][11][12][13]. Researchers have relied on ethnography or survey research to grasp these subtle and dynamic patterns of segregation [14,15]. While these methods provide distinct advantages, they are arguably less suited to capture relations and flows outside the bounds of the research site [16][17][18].
User-generated data from location-based social networking platforms hold the promise of filling in the gaps in the picture presented by current research on urban dynamics [19]. As the product of a participatory sensing system, these data add to the methodological toolkit of scholars of the city. These data can be used to identify groups on the basis of observed behavior rather than using a predefined classificatory scheme, allowing for a more fine-grained and upto-date breakdown of urban populations into subgroups.
However, researchers seeking to put these opportunities to use should be aware that social media data do not simply reflect the activities of urban dwellers. This is especially true for the Instagram data we utilize in this paper. Instagram users selectively represent their lifeworlds by showcasing images they feel are suited for circulation. This also means that they represent the city and their place within it in a curated manner. Users typically do not report on their visits to the supermarket or their commute to work. Instead they share images as part of strategies of distinction: they picture themselves with friends, in nice outfits, in places that are special to them [20]. In a word, they use Instagram to mark their place in the social structure and within the city. By associating with each other (by following, liking, or commenting) and tagging the same places, users form communities at the interface of online and offline spaces. By mapping these processes of association and place demarcation, we can investigate how communities emerge at this interface and create sociocultural domains. Such processes could before only be grasped through surveys or ethnographies of concrete settings, but now we can use social media data to investigate on a large scale and in detail how city dwellers associate with one another and form communities.
These data can also give insight into the places or sets of places in which different groups in the city spend their waking hours and into the role these places perform in the formation of groups and subcultures. In this context, Lofland understands a city's public realm to be made up of places in which city dwellers encounter strangers [21]. These places are quintessential sites of urban life because they require people to interact with others with which they have no intimate bonds [22]. As such, they also serve as sites to cultivate cosmopolitan habits [23]. Urban dwellers can also transform nominally public places into a group-specific domain. Such parochial places serve to solidify group identities and reaffirm boundaries [21]. Thus, identifying which places in a city are cosmopolitan and which are parochial is important for an understanding of patterns of encounter and enclavement in the city. In addition, divisions between groups can also occur in time rather than in space, for instance when places and areas become exclusive sites in a city's nightlife [24][25][26][27].
While several scholars working at the juncture between geography and computer science have begun using social media data [28][29][30][31], within the last half decade computer scientists have conducted most of the work taking advantage of location-based social networks [32]. Before the recent rise of Instagram, the focus was on geotagged tweets and data sourced from Foursquare, the check-in service. Cranshaw et al.'s Livehoods Project presents a method to study urban dynamics and structure through social media data using machine-learning techniques [33]. Their methodology aggregates individual data points (Foursquare checkins) using spatial clustering techniques to identify areas that emerge from the actions of city dwellers. Frias-Martinez et al. use geotagged tweets to study land use and sites of interest in New York City. On the basis of spatio-temporal patterns in the data, they distinguish areas in the city used primarily for leisure, business, or residential purposes [34]. Silva et al. use Foursquare data to create heatmaps to visualize urban dynamics on the basis of individual trajectories. The resulting visualization shows the overall likelihood of a city's inhabitants of transitioning between different types of spaces, such as public transit hubs and places of work [35,36].
More recently Silva et al. have moved from using Foursquare data to using Instagram data as the basis for a "participatory sensing system." (In [37] Silva et al. compare the two data sources.) In using Instagram data to study urban dynamics, they find spatio-temporal patterns to be correlated with routine activities of city dwellers. As such, it can serve to identify places or sets of places of cultural activity [38]. A series of interdisciplinary collaborations between computer scientists and art historians have also sought to make sense of Instagram in the context of the city. By analyzing large datasets of geotagged posts collected in cities throughout the world, these contributions seek to visualize differences in rhythms and content between cities [39][40][41]. Other contributions by researchers around Lev Manovich include http://selfiecity.net and http://on-broadway.nyc.
In sum, while scholars have undertaken promising forays, research so far has been limited. Especially when compared to Twitter, research on Instagram is in its infancy. There are important theoretical reasons for filling this lacuna, as Instagram data enable researchers to shed new light on processes that have long occupied scholars of cities, including the formation of subcultures, segregation, and the cosmopolitan or parochial nature of places within the city. Our contribution is to develop a number of methods to illuminate these processes and to provide a proof of concept for how these methods can be put to work.

Data
We utilize both network and spatial data sourced from Instagram. Network data allow us to identify groups, while spatial data allows us to map the places that Instagram users picture. Taken together, we can use this data to identify socio-spatial divisions by investigating the presence or absence of social groups in places throughout the city.
We collect both kinds of data from Instagram using the platform's application programming interface (API). For this purpose we built and used kijkeens [42], a tool that polls the Instagram API's location endpoint at regular intervals to gather all geotagged posts from an urban area, stores post metadata in a database, and, after a specified delay, gathers network data ("likes" and comments) for each of the posts. We created two dataset of posts published in Amsterdam and Copenhagen over a twelve-week period. In Amsterdam we collected 953,403 posts between 19 April and 12 July 2015, while in Copenhagen we collected 890,621 posts between 25 May and 17 August 2015. Because we are interested in everyday patterns of urban dwelling, we only considered posts by users who had posted in the city over a time spanning four weeks or longer to eliminate likely tourists. This cut the number of posts down to 442,246 and 507,445 posts, respectively.
We stored a variety of metadata about each of these posts. Most importantly for our purposes here are the data about social activity ("likes" and comments) and tagged locations. We stored the social activity on each post about 24 hours after the initial publication of the post. Since most of the activity on a post occurs within the first few hours of its life, this accounts for the bulk of the interactions garnered by the geotagged posts in our dataset. In Amsterdam, there were over 16 million interactions, of which 1.1 million were between local users in our dataset. In Copenhagen, the number of interactions was over 21 million, of which 1.8 million were between users in our dataset. It bears keeping in mind that the API returns a maximum of 140 likes per post, so we cannot capture all interactions for very popular users whose number of likes regularly exceeds this number. As a result of this restriction of the data, our analysis may underestimate the centrality of certain users in the overall network.

Methods
Our analysis combines several methods. We use network analysis to identify groups and metrics of unevenness and diversity to identify socio-spatial divisions.

Identifying Groups Using Network Analysis
How can we identify groups of city dwellers? Where previous scholars had to collect data for their analyses through painstaking community studies [43,44], we are able to use the network data captured on Instagram. Instagram users give recognition to others on the platform by liking and commenting on their posts. For the purposes of our network analysis, we understand reciprocated recognition (mutual liking and/or commenting) to constitute a social tie between two users. Research on social media use suggests that this provides a surer indicator of a social tie between users than mere followership [45]. We construct an undirected, unweighted network graph on the basis of our interaction data. Table 1 provides some metrics on the two city networks.
We identify subgroups among Instagram users by applying a technique called community detection. The method we use, called the Louvain method of modularity optimization [46], progressively groups connected nodes in a network together until it reaches an optimal level of clustering. We use the igraph package [47] and the implementation of the Louvain algorithm by Traag [48]. We perform community detection on the largest connected component of each graph, which accounts for most nodes with reciprocated ties in both networks (93.9 percent for Amsterdam and 96.6 percent for Copenhagen). We consider only clusters of at least five hundred users. We chose this cutoff to keep the number of clusters manageable after determining that the clusters above this cutoff contain most nodes. Those interested in studying specific subcultures may want to include even some of the smaller, more marginal clusters, but that is not necessary for our purposes.
Next, we characterize these groups and find what, aside from the overall network structure, makes them distinct. We tried out a variety of methods. At first, we sought to characterize clusters in an automated manner by using user profile data. Instagram users have the possibility of filling in a 150-character "biography" field. Our attempts to use text analysis techniques such as tf-idf [49] to characterize clusters on the basis of this textual data failed to yield reliable or valid results. Instead, we opted to rely on a combination of network analysis and manual classification to characterize groups. Despite some shortcomings, this seems the most suitable approach given the data we have. In a first step, we analyze the structure of subgraphs. We are particularly interested in the density of ties and in whether certain nodes stand out as hubs. If we find that subgraphs are tightly knit and organized around hubs, then we have a rationale for characterizing groups in terms of their most central users, which we can regard as group focal points [50]. To determine tie density of subgroups, we calculate the local average clustering coefficient and compare it to the clustering coefficient of a random Erdős-Rényi graph with an equal number of nodes and edges [51]. We report a logged ratio of clustering coefficients (log C subgraph C random ) to compare observed tie density to that of a random graph. If the clustering coefficient is significantly higher than in the random graph, we can conclude that tie density is high. To determine the extent to which subgraphs are organized around hubs, we inspect the centrality distribution of each subgraph and report goodness-of-fit measures (Kolmogorov-Smirnov distance) for three heavy-tailed distributions: power law, lognormal, and stretched exponential distributions [52]. We use Page Rank as a measure of network centrality [53].
As we will see, the subgraphs have a skewed, heavy-tailed centrality distribution (indicative of hubs) and high tie density (as compared to random graphs). We exploit these network features and characterize groups according to their focal points. We did so manually. The authors each looked at the data to inductively arrive at a characterization for each cluster and then compared results to fine-tune the results of this inductive process. Future research may want to develop a coding scheme for these purposes, but that is not something we could draw on here. In characterizing these central accounts, we first consider the user profile, and then we analyze the content of their pictures and the tagged locations. Often users list their profession or affiliation in their biography which we then only have to verify by studying their images, but other times we have to determine their social and cultural background through close study of the content of their images. Our analysis focused on the ten most central accounts. While a more exhaustive manual analysis undoubtedly would have revealed further nuances, we found that examining the ten most central accounts provided us with a good impression of the cluster in the sense that examining additional accounts did not lead us to fundamentally change our characterization.

Mapping Social Divisions and Interactions
How can we measure the level and nature of segregation and interaction between groups? Sociological studies of urban segregation have employed a number of different metrics. The index of dissimilarity (DIS) was long considered the gold standard of residential segregation measures [54], and it is particularly suited to capture the evenness of populations in an urban area [55]. The segregation of a minority group M across k different areal units i relative to a majority population W is measured as follows: Here m i and w i refer to the subpopulations of the minority and minority populations found in each areal unit. This index has a variety of characteristics to recommend it. Above all, it is easy to interpret. The value of DIS corresponds to the proportion of the minority group that would have to relocate to achieve a fully even distribution. We use the dissimilarity index to measure group segregation, since we are interested in how evenly groups are present in places around the city. In calculating this measure of evenness, we take each cluster as a "minority group" and compare it to the other clusters combined forming the majority group.
We are also interested in exchanges between groups. Two groups can be related to one another by being in frequent interaction with one another, for instance by liking and commenting on each other's posts. The strength of this relation is indicated by the weight of the edge connecting the two groups in the cluster graph. The edge weight is calculated by summing up reciprocated ties between members of both groups. We normalize edge weights by dividing an edge's weight by the combined number of nodes in the two clusters it connects.

Locating Cosmopolitan and Parochial Places
Which places facilitate encounters between members of different groups, and which are exclusive to members of the same group? We rank places from most parochial to most cosmopolitan by employing a diversity measure known as the divergence index. It compares the expected distribution of groups in a place given what we know about the overall distribution of these groups in the city as a whole to the observed distribution within that place. Instagram users can tag their posts with predefined locations, but they can also define their own place names (or at least they could during our period of data collection). We manually verified and, where necessary, merged place names. We only consider places tagged in at least 25 posts by at least 15 different users. If each group was represented in the proportion in which it is present in the city, we would have a situation of full diversity. The value of the measure would be 0. The further the expected and observed distributions diverge in a particular place, the less diverse that place can be said to be. The divergence index (DIV) is used in both the social sciences and the life sciences to measure population diversity [56,57]. The divergence index in areal unit i is defined as follows: In this equation, π m corresponds to the overall relative occurrence of cluster m, and p i m refers to the relative occurrence of cluster m in areal unit i. For ease of interpretability, we report standardized divergence indices, which we calculate by dividing DIV i by max(DIV) [56]. A value of 1 thus indicates maximum divergence (i.e., lowest diversity).
Unlike other diversity measures based on concepts from information theory, DIV is not impacted by the number of subgroups being considered [56]. This is important because network structures differ between cities, so community detection can yield a different numbers of subgroups. If we want diversity measures to be comparable, they must not be influenced by how many communities there are.

Groups
Community detection allowed us to identify twelve clusters of 500 or more users in Amsterdam (Tables 2 and 3) and sixteen clusters above the cutoff in Copenhagen (Tables 4 and 5). The users in these clusters account for more than two thirds of all users with reciprocated ties in the case of Amsterdam and more than three quarters of users in the case of Copenhagen.
All cluster subgraphs have clustering coefficients that are significantly higher than in a corresponding randomly generated graph. For all clusters, the difference exceeds an order of magnitude. This points to the high density of ties within the clusters. Furthermore, their centrality distributions hew closely to a heavy-tailed distribution (see also S1 Fig), which speaks to the existence of hubs within each cluster. This provides us with a rationale to characterize groups in terms of their most central users.
In the case of Amsterdam, we are able to characterize eleven out of twelve clusters using manual classification of central users. The groupness of the clusters generally is rooted in shared professions, interests, lifestyles, and hangouts. Some clusters, such as AMS5 (consisting mostly of high school students), have no strong tie to particular places and no common ways of identifying. They are, however, at a similar stage in their lives. We might say that the young people in this cluster have not (yet) developed a distinctive style or autonomous identity. Finally, AMS6 -a cluster with an unusually high proportion of private accounts-has no basis for groupness that we can discern.
In Copenhagen, a greater number of clusters is characterized by shared stage of life rather than shared professions or hangouts. Part of the reason is that more teenagers are visible on Instagram in Copenhagen than in Amsterdam, indicating either a higher level of adoption of the social network among Danish youth or a lower aversion to setting accounts to public and using geotags. Nonetheless, several clusters are clearly defined by shared characteristics, interests or professions.
Comparing Amsterdam to Copenhagen, we can see some commonalities and differences. Both cities contain a sizeable cluster of image makers whose main occupation on the platform, whether vocationally or avocationally, is to picture the city in which they are based (AMS3 and CPH5). Places tagged by users in these clusters include well-known parks, buildings and structures. In both cities, clusters vary in size, in degrees of activity, and in popularity. The most active cluster in Amsterdam (AMS3) has an average of 26.1 geotagged posts per user during the twelve-week period under investigation and the least active (AMS8) has 11.6 posts per user. In Copenhagen, the most active cluster (CPH4) has an average of 22.2 geotagged posts per user, while the least active (CPH9) has only 7.2. Finally, the number of followers also varies between clusters. On average, follower numbers in Amsterdam are higher than in Copenhagen, and the spread of follower numbers is also greater. The Lifestyle Vanguard in Amsterdam (AMS2) has the highest average number of followers, and in contrast, the most popular Copenhagen cluster (CPH3) consists of high school students.

Social Divisions and Interactions
As indicated by the dissimilarity index (DIS) reported in Tables 2b and 3b, the presence of clusters in places around the city is uneven. To achieve an even groupwise distribution of posts, on average about one in three posts would have to be posted from elsewhere. The uneven presence of clusters throughout the city does not mean that they are completely isolated, however. Figs 1 and 2 show levels of interaction between clusters in the two cities. In Amsterdam, clusters AMS2 and AMS4 have 0.45 mutual ties per user, the strongest tie between two clusters. Given the similarity in lifestyle orientations between the two clustersone consisting of the Lifestyle Vanguard, the other consisting of Cultural Entrepreneurs-this link is not too surprising. The next strongest tie is between clusters AMS4 and AMS7, the Cultural Entrepreneurs and the Party Buffs, another pair of clusters whose lifestyles, while not overlapping, appear to have an affinity. Clusters AMS1 and AMS3 have strong ties to AMS7 and AMS2, respectively. These five interconnected clusters are among the most popular, and they are each defined by shared lifestyles and professions rather than a common stage of life.
In Copenhagen, clusters CPH1 and CPH4 have a strongest tie, with 0.37 mutual relations per user. Both of these clusters are firmly grounded in design professions and affinities, though CPH4 has a higher degree of young parents whose family life appears on Instagram alongside their interest in design. Clusters CPH2 and CPH3, large clusters consisting of men and women in their teens and twenties attending secondary and postsecondary education institutions, also have a strong tie.

Cosmopolitan and Parochial Places
Users tagged 367 places in Amsterdam and 680 places in Copenhagen. The following visualizations show the places in order of diversity, focusing only on the thirty most tagged locations for How to Study the City on Instagram ease of presentation (Fig 3 shows the distribution of the divergence indices for all tagged places in both cities). This way of presenting our data has the benefit of making patterns of group cooccurrence in places easily apparent. The most diverse places are tagged by members of all clusters, suggesting they are places of encounter between a wide array of different groups (i.e., cosmopolitan places). The least diverse are the exclusive domain of just one or two clusters (i.e., parochial places). Those with middling levels of diversity are overwhelmingly tagged by members of the same three to four groups. They are neither exclusive parochial domains, nor are they places of broad encounter.
The Rollende Keukens ("rolling kitchens") open air food cart festival in Amsterdam has near-proportional representation from all twelve clusters of Instagram users (see Fig 4). Like some other "places" we find in our data, the festival is actually a temporary happening, which in this case took place in the Westerpark, a public park in the west of Amsterdam. The location of the event might explain its broad appeal, as the list of the most cosmopolitan places includes the Vondelpark, Westerpark, Westergasfabriek (a cultural center located in the Westerpark), and the Museum Square-all public places. The parks and squares which Amsterdam Instagram users fondly picture and associate themselves with frequently become sites of encounter between different groups. These public places are followed by popular cafes and other hangouts, an independent concert venue, and two of the city's famous art museums. Several of these places have the strong presence of AMS3, the cluster of City Imagers who picture famous sights. Clusters AMS2 and AMS4 are also a constant presence, while clusters AMS5 and AMS6 are absent from several of them. As we move closer to the parochial end of the spectrum, there is a marked increase in music festivals, clubs, and large concert venues. Several of these places have a strong presence of members of clusters AMS1, AMS4, and AMS7. The most parochial places, which include restaurants, clubs, and a gym frequented by Health and Lifestyle Devotees (AMS11), are the preferred hangout of members of just a single cluster. The type of places featured among the most parochial suggest that the city is particularly segregated at nighttime, since they include several nightlife locations. Some clusters are completely absent from the most parochial places, including the Visual Professionals (AMS9), Cultural Explorers (AMS10), and the Coffee Aficionados (AMS12).
Like in the case of Amsterdam, Copenhagen's public parks and places are frequently sites of encounter (see Fig 5). Among the most cosmopolitan places are the Faelledparken (Commons Park) and the Tivoli, an amusement park. Users also frequently tag the neighborhoods of Vesterbro, Nørrebro, and the redeveloped harbor area Islands Brygge. These are not so much places as they are areas, so we cannot conclude that they are places of encounter. They are, however, areas that Instagram users across the board associate themselves with by tagging them in their posts. The places with middling levels of diversity include hangouts popular among the younger users in clusters CPH2 and CPH3 (high school and college students), including a pedestrian shopping area and two bars in the downtown area. The most parochial places, finally, are the near-exclusive domain of a single cluster. The Trailerpark Festival, a music festival held in August, was tagged almost exclusively by users in cluster CPH1. Original Coffee, a coffee shop with locations in several neighborhoods, is the near-exclusive domain of CPH3 and two other clusters of teenage users, CPH8 and CPH9. Finally, Northmodern, a trade show dedicated to Danish design products, is overwhelmingly tagged by users in the Design & Family cluster (CPH4).

Discussion
Urban researchers interested in social divisions traditionally have had to choose between two options that each have considerable tradeoffs. On the one hand, they could study social-spatial divisions quantitatively with the tradeoff that they had to ask questions that can be answered through data drawn from official records. On the other hand, they could study more complex and dynamic processes through which subcultures create and claim spaces, but then they had to resort to time and labor-intensive methods that can only be applied in a limited number of places or on limited samples. By using data drawn from social media, researchers of cities can begin to investigate at a very large scale and in minute detail how urban dwellers form groups within and through urban space. Instagram, in particular, offers extraordinary opportunities to users to showcase where they are and whom they associate with. Through its API and some of the methods we and others have developed, the medium also offers extraordinary opportunities for researchers interested in investigating segregation, the formation of subcultures, strategies of distinction, and status hierarchies.
Notwithstanding these opportunities, we should mention some caveats [58][59][60][61]. The lives of Instagram users are not contained within the platform, so our access to their lives is very much incomplete. The representations on Instagram, moreover, are highly selective. It would be mistaken to consider Instagram as somehow representative of the sum total of city dwellers' uses of space. We should only look at Instagram if we are interested in what we can find there: the pictures and connections that selectively represent selected parts of the city from a selective group of urban dwellers. Our purpose was mainly to provide a "proof of concept" by developing a range of methods and demonstrating how Instagram data can shed new light on classic issues in the study of the city. While some of our findings would most likely prove robust (e.g., central public parks are widely popular and very cosmopolitan), other findings are more provisional. To benefit from the opportunities the data offer, it is necessary to carefully specify questions and complement Instagram with other sources of data.