Temporal and spatiotemporal investigation of tourist attraction visit sentiment on Twitter

In this paper, we propose a sentiment-based approach to investigate the temporal and spatiotemporal effects on tourists’ emotions when visiting a city’s tourist destinations. Our approach consists of four steps: data collection and preprocessing from social media; visitor origin identification; visit sentiment identification; and temporal and spatiotemporal analysis. The temporal and spatiotemporal dimensions include day of the year, season of the year, day of the week, location sentiment progression, enjoyment measure, and multi-location sentiment progression. We apply this approach to the city of Chicago using over eight million tweets. Results show that seasonal weather, as well as special days and activities like concerts, impact tourists’ emotions. In addition, our analysis suggests that tourists experience greater levels of enjoyment in places such as observatories rather than zoos. Finally, we find that local and international visitors tend to convey negative sentiment when visiting more than one attraction in a day whereas the opposite holds for out of state visitors.


Identification of initial attraction visit dataset
Our entire Chicago visit dataset contains 8,034,025 geo-located tweets originating from 225,805 users collected between May 16, 2014 and April 27, 2015 (missing four days). First, we applied boundarybased identification by finding tweets located within an attraction's boundaries that contain at least one keyword related to that attraction within their texts. In this way, we identified 67,737 attraction visit tweets from 30,574 visitors. Second, we gathered tweets that are shared in six hours of an attraction visit while the visitor is still within the attraction's boundary. In this way, we identified an additional 8,630 tweets from 3530 visitors. Finally, we applied distance-level identification by selecting tweets containing the attraction's full name within the tweet text and located within a one kilometer distance of the attraction's boundary. With this step, we gathered another 5,579 attraction visit tweets from 4248 people. Visit data is shown in Table 1. We eliminated two attractions due to low numbers of tweets (< 50). In total we gathered 81,908 attraction visit tweets from 32,559 unique visitors.

Cleaning the outliers in the initial attraction visit dataset
In order to make sure that we only gathered visit related tweets, we aimed to eliminate outliers from the initial visit tweets based on time of tweeting. Fig 1 illustrates how Chicago attraction visits are distributed over the course of the day. According to the figure, a majority of attraction visits occur between 9AM and 11 PM while the peak attraction visit timeframe occurs between 1PM and 4PM. This result intuitively reflects real-world attraction visit patterns where many attractions are open within these times. Attraction visits tweets also follow a quite different pattern than both general Chicago or USA tweets, in that many of the tweets are shared after 5PM until midnight. This is a positive indication that tweets from attraction visits are distinct from general tweets.
To ensure that the temporality of attraction visits aligns with attractions' operating hours, we use the opening and closing hours of each attraction gathered while compiling the attraction dataset. Excluding attractions that are open 24 hours a day, we filtered the tweets shared outside of business hours of each attraction. We assume visitors can arrive an hour earlier than the opening hour and can

Identification of visitor origin
As mentioned in the main text, we used location information provided in Twitter profiles of visitors contained in the attraction visit list. Table 2 shows the top 30 most commonly reported location information terms used within their profiles. Almost 21% of visitors provided no location followed by a relatively large Chicago/Illinois-related location information. The remainder of the location information mostly refers to major US cities and states. In total, we found 10,615 case-insensitive unique location information from 31,924 visitors.
We apply the two-step visitor origin identification approach to match location information with one of the three visitor origin categories. In the first step, we identify local and out of state visitors providing structured location information. We constructed queries provided in Table 3 to mark these visitors. For the remaining 8,721 visitors, we used Google Maps API to identify their corresponding visitor origins.   The US Independence Day celebration related positive tweets from the Navy Pier, Lake Michigan, and Millennium Park-July 04, 2014. We modified the word cloud generation algorithm to account for Twitter jargon (e.g., hashtags) and increase the dictionary for stop-words. We set word clouds output to a maximum of 50 words to maintain readability. Figure 4: The distribution of average daily sentiment values split into four seasons and three visitor types. All visitor types seem to follow the same seasonal sentiment trends explained in the main text. The primary difference is on the magnitude of these scores where internationals have lower median scores than the other two. Figure 5: The distribution of the average daily sentiment values across three visitor types. Local visitors and out of state visitors have very similar score distributions that are relatively higher than international visitors. Looking at the high-level statistics, we noticed that international visitors tend to express neutral sentiment most of the time making their scores lower.