Performance of Social Network Sensors During Hurricane Sandy

Information flow during catastrophic events is a critical aspect of disaster management. Modern communication platforms, in particular online social networks, provide an opportunity to study such flow, and a mean to derive early-warning sensors, improving emergency preparedness and response. Performance of the social networks sensor method, based on topological and behavioural properties derived from the"friendship paradox", is studied here for over 50 million Twitter messages posted before, during, and after Hurricane Sandy. We find that differences in user's network centrality effectively translate into moderate awareness advantage (up to 26 hours); and that geo-location of users within or outside of the hurricane-affected area plays significant role in determining the scale of such advantage. Emotional response appears to be universal regardless of the position in the network topology, and displays characteristic, easily detectable patterns, opening a possibility of implementing a simple"sentiment sensing"technique to detect and locate disasters.


Introduction
Natural, man-made and technological disasters present a constant threat to society [1]. An increasing frequency, intensity and impact of such events are often attributed to the effects of climate change [2][3][4]. Consensus is growing that the likelihood and potential damage of natural disasters in the future will rise, and there is a need to adequately prepare for their consequences [5][6][7][8]. An integral part of such preparation efforts is an understanding of the information flow during disasters in order to derive early-warning sensors, track public awareness, gather emergency and relief information and predict human behaviour, such as escape panic [9][10][11]. Conveniently, online social media, like Twitter and Facebook, have matured into prominent communication platforms and provide an unprecedented opportunity to record and analyse vast amounts of information [12].
One particular network phenomenon, the "friendship paradox", may increase efficiency of monitoring disaster related information. The paradox was studied in a seminal work by Feld [26] and known colloquially as "your friends have more friends than you do". Feld showed that a node in a social network on average has lower number of links than the average of links its friends have.
This occurs because well-connected nodes are included multiple times in a set of "friends-offriends", therefore boosting corresponding average. The paradox and its strong form (formulated for median rather than mean averaging) were observed experimentally in the context of online social networks [27,28] and networks of co-authorships and citations [29]. Although the original focus of Feld's study was on psychological implications of the paradox (e.g. potential perception of inadequate social inclusion), his finding inspired a simple technique of forming a sample group with network centrality above what random sampling allows. This could be achieved without global knowledge of network topology by using friends of randomly selected people instead of themselves. Because centrality often appears in correlation with other attributes -like activity, popularity, health or income [27,29] -friends are subject to an earlier exposure to a contagion [30,31] or information propagating through the network [32].
In a disaster, the ability to implement efficient and early detection of emergency information is extremely valuable and a sensor method technique based on the "friendship paradox" is attractive for that purpose. Existing experimental validations of the sensor method [30,32] confirm its applicability for endogenous processes, when the spread of contagion or exchange of information occurs only between the nodes of a network. Contrary to such a spread of infection or social network memes, information about disasters is carried simultaneously by many other external channels. Considering this interplay of exogenous and endogenous propagation modes, and factoring in the speed, scale and strong geographical nature of phenomena such as hurricanes, it is not immediately obvious whether the sensor method would perform reliably in a disaster. To address this central question, we study performance of the sensor method during Hurricane Sandy to establish if there is an early awareness advantage, what is its magnitude, and what is the effect of geographical location of users. Finally, while there is evidence that centrality correlates with measurable attributes [27][28][29], it is unclear if there is an underlying correlation with personality or behaviour traits. To study this, we employ sentiment analysis to explore differences between random control groups and corresponding sensor groups in terms of the timeliness and magnitude of their emotional response.

Context of the research: Hurricane Sandy and its digital traces on Twitter
The disaster event at the centre of our case study is Hurricane Sandy, the largest hurricane of the 2012 season and one of the costliest disasters in the history of the United States. Sandy was a late season hurricane that formed on October 22 2012 at 12:00 UTC about 500 km south-southwest of Kingston, Jamaica. It made its first landfall in Jamaica at 19:00 UTC on October 24 as a Category 1   hurricane, then as a Category 3 hurricane in Cuba at 05:30 UTC on October 25, subsequently weakening down to Category 1 as it moved through the Bahamas. It continued to grow in size and, while moving northeast along the United States coast, re-intensified to the maximum wind speeds of 85 knots at 12:00 UTC on October 29, about 350 km southeast of Atlantic City. In the next day the hurricane weakened into a post-tropical storm and made its landfall at 23:30 UTC on October 29 near Brigantine in New Jersey. At the time of landfall the wind reached 70 knots and the storm surge was as high as 3.85 meters, with prevalent levels between 0.8 and 2.8 meters along the coast of New Jersey and New York. The storm surge was responsible for most of the damage to houses, totalling up to 650,000 destroyed or damaged buildings. Over 8.5 million people were affected by power losses that lasted for weeks in some areas. According to the National Hurricane Center report [33], Sandy caused 147 direct casualties along its path and brought damage in excess of $50 billion for the United States.

Raw datasets
Hurricane Sandy attracted extensive media coverage, both in traditional broadcasting media and online. In our work we look at digital traces of the hurricane on Twitter. Raw data collected for analysis is comprised of two principal sets of Twitter messages. The first one consists of messages with the hashtag "#sandy" posted between October 15 and November 12, in the period that precedes the formation of the hurricane and extends beyond its landfall in the continental United States. The data includes the text of messages and a range of additional information, such as message identifiers, user identifiers, followees counts, re-tweet statuses, self-reported or automatically detected locations, timestamps, and sentiment scores. The second dataset has similar structure and is collected within the same timeframe; however, instead of a hashtag, it includes all messages that contain one or more instances of specific keywords, deemed to be relevant to the event and its consequences ("sandy", "hurricane", "storm", "superstorm", "flooding", "blackout", "gas", "power", "weather", "climate", etc.). The full list of keywords used to build up the dataset is provided in the Supplementary Information, Table S6. In total there are 52.55 million messages from 13.75 million unique users available for analysis.

Location data and geocoding
Raw data is filtered to include only those messages that contain location information. Since only a minor fraction of the messages (about 1.2% for the hashtag dataset and 1.5% for the keywords dataset) are geo-tagged by Twitter, we attempt to extract additional information from incomplete self-reported data in user profiles. Such data includes self-identification of a country, state, province, city or town, or any arbitrary combination of those items. We analyse only profile data, instead of searching for location-specific text within messages, to avoid the ambiguity of dealing with context of such in-text location mentions (hypothetical, past or future travels, messages about other people, abstract mentions of various places, etc.). After crosschecking partial location information against coordinates of all major administrative regions and cities worldwide, 46% of messages (or 43% of users) were encoded with location data. Precision of geocoding varies between the exact latitude and longitude of a user, as recorded by Twitter, and the coordinates of the centre of an administrative unit that returned a match to a self-reported location. The rate of location detection compares favourably with other studies, e.g. 6.6% detection rate in Mislove et al. [34], where Google Maps API interpretation of self-reported location strings was used to obtain coordinates.
We further filter for users from the United States and Canada, to reduce variations in time zones and languages of tweets. Hurricane path and extent data, shown in Figure 1, are utilised to distinguish users based on whether they were affected by Sandy directly. After filtering for geocoded messages the hashtag dataset includes 3.65 million messages from 1.24 million users (25.6% of them directly affected by the hurricane) and the keyword dataset includes 24.15 million messages from 5.98 million users (14.1% in the affected region).

Relevance filtering
The last filter that we impose on the raw data is content analysis to insure a message is relevant to Hurricane Sandy. Our study of the awareness advantage relies on the time of the first hurricanerelated tweet for each user, which must be determined as correctly as possible. Potential issues may arise equally from data incompleteness or excess. Incomplete data is a problem for the hashtagbased dataset, because hashtags are not used systematically in every message and some (or even all) relevant messages may be overlooked. In our case, the hashtag dataset includes 3.65 million messages, but the same users within the same period of time are represented by 11.07 million messages in the extended dataset. Although some additional messages are not related to the hurricane, many are, and must be included in the analysis. To avoid "false positives" (messages with no relevance to Hurricane Sandy) we implement simple filtering described below.
The evolution of Hurricane Sandy provides a convenient frame of reference in order to check the relevance. Since the hurricane was first classified as such and officially assigned its name on October 22, any keyword with a significant level of activity before that date should be filtered out to avoid inclusion of unrelated information. Figures S1 and S2 of the Supplementary Information summarize the histograms of messages matching specific keywords and demonstrate that the majority of them suffer from irrelevance noise. Regrettably, certain keywords of interest ("storm", "power" and "gas") are contaminated by irrelevant messages simply due to their general nature and/or multiple meanings: for instance "storm" is mentioned in messages about small scale local weather events and "power" is used not just in the context of post-hurricane power outages, but also in context of politics against a backdrop of the presidential election campaign. To eliminate noise from the datasets, only messages with a word "sandy", either in a hashtag or keyword form, were included in the analysis. The effect of filtering is demonstrated in Figure 2, which compares histograms for messages without filtering, moderate filtering (words include "sandy", "storm", "hurricane", "huracán", "superstorm" and "frankenstorm"), and strict filtering ("sandy"). The results indicate that only the strict filtering succeeds in suppressing noise messages prior to the formation of the hurricane. Relevance filtering brings the total volume of data down to 4.51 million messages from 1.39 million unique users.

Lead-time in awareness
Arguably the most important aspect of the information flow during a disaster is awareness. Given the limitations of our dataset and the lack of retrospective studies into the link between disaster awareness and patterns of online activity, we assume that the time a person becomes aware about the hurricane and tweets about it are close to each other. To evaluate performance of the sensor method, we focus on the entry time t , defined as the time a user first appears in our dataset by posting a message relevant to Hurricane Sandy. Following the conventional terminology, we call an original random sample a control group, and a group formed from their friends a sensor group. Let us define the lead-time as the difference between the average entry times of the sensor group and its corresponding control group: Δt = t S − t C , with negative lead-times indicating awareness advantage. We estimate lead-times of sensor over control groups across the range of sample sizes from 500 to 100,000 users. Control groups are formed by random selection from the pool of users with known geographical location. Sensor groups are formed from the friends (followees) of users in their corresponding control groups; the two groups are of the same size and without user duplication. In all the analysis reported below and in the horizontal axes of figures, times are given as offsets in hours with respect to the 00:00 UTC on October 30, which is approximately the time of Sandy's landfall on the continental United States.
To start our assessment of the sensor method we look at basic indicators, such as the entry time, the total number of messages, and the counts of friends and followers for each user. We observe that users with early entry times are characterised by an increased level of activity, as seen in Figure 3A.
Early entrants also have higher network centrality expressed by their in-degree (number of followers) and out-degree (number of friends or followees), shown in Figure 3B. These direct relationships between entry times and other characteristics are especially pronounced in the prelanding stage of the hurricane, weakening in the post-landing stage.
An example of a distribution of messages over time is shown in Figure 4 for a random control sample and its corresponding sensor group. The inset presents a histogram of tweets, with negligible level of initial activity that builds up and peaks in the landfall day, slowly falling off afterwards. The pattern is largely the same for control and sensor groups, except for the absolute level of activity, with the sensor group being more active. The cumulative distribution function of entry times shows that the sensor group curve is shifted to the left, confirming earlier entry times.
The size of the sample in this particular example is 5,000 users, but all sample sizes considered in the study exhibit a similar left shift in the cumulative distribution and an elevated level of activity.
The magnitude of the shift and the scale of difference in the activity level both decrease when the size of a sample increases.
Preliminary findings discussed above indicate the link between awareness and centrality, which results in early awareness of sensor groups. It is important to estimate the magnitude of the leadtime and the influence of other factors, in particular the size of the sample and geographical location of users. Lead-time as a function of sample size is presented in Figure 5 (sampling without control over location is given in panel A in a solid black line). In the range of sample sizes considered, the lead-time varies between -11 and -5 hours, with sensor groups consistently showing earlier average entry time. An increase in the size of the sample shortens the lead time and reduces its variance, as previously reported in other studies [32], which is explained by the asymptotic convergence of control and sensor group properties to those of a whole population. Results for key metrics, including tweeting activity levels, lead times and entry times, are summarized in the Supplementary Information, Table S1.
We repeat the analysis with direct control over the location of users. Although it is difficult to accurately identify the area directly affected by the hurricane (because of the multitude of its effects including winds, storm surge, rain, snow, gas and electricity outages), the path and the extent of hurricane force winds provide a good approximation. The track and wind radii data obtained from National Hurricane Center [35] are used to outline the affected area, which is shown in Figure 1.
The border of this area is used for selective sampling of individuals directly hit by Hurricane Sandy. Such geographically selective sampling is possible in four combinations: both control and sensor groups are within the affected area; control groups in and sensor groups out; control groups out and sensor groups in; and finally both groups out of the hurricane-affected area.
Key statistics for these combinations are summarised in Figure 5 and Tables S2 -S5 in the Supplementary Information. It can be seen that geography strongly affects awareness. Four combinations of the geographical origin for control and sensor groups all result in the change of the lead-time magnitude compared to the sampling without regard to the location (the solid black line in Figure 5A). The change is moderate if both groups are drawn from the similar geography pool, with affected pairs giving slightly longer and unaffected pairs slightly shorter lead times (see labelled orange and blue trends in Figure 5A). The change is strong for mixed combinations, to the extent that the lead-time reverses its sign (and indicates lagging) when the control group is within and the sensor group is outside of the affected area (green line in Figure 5A). The longest lead times arise for the combination of two factors: geographical relevance and central position in the network topology. This case is illustrated by the purple trend in Figure 5A for control groups formed outside (random position in the topology and low geographical relevance) and sensor groups within the disaster area (high geographical relevance and central position in the network topology). It could be argued that the direct relevance of the event influences one's behaviour in seeking, transmitting and generating information more than one's position as a central node of a social network. A similar explanation is coined to explain other digital traces of Hurricane Sandy, i.e. photographs posted on Flickr [36], where the number of pictures peaks close to the landing time and suggests that observed severity of the disaster may motivate people to document it with higher intensity. Finally, it is noteworthy that the entry times for the sensor groups located inside the affected area are actually negative and correspond to the pre-landing phase of the hurricane, see Table S2 and S4 in the Supplementary Information.
In summary, our experiments show that the sensor method results in the awareness advantage on a scale between 3 and 26 hours, depending on the sample size and geographical origin of the groups.
To evaluate statistical significance of the lead times obtained above, we compare them to the null model where the timestamps of all messages in our database are randomly shuffled. Such a null model preserves the correlation between centrality and normal tweeting frequency, serving as an upper limit on the performance of the sensor method assuming that every user tweets about the disaster shortly after becoming aware about it. Comparison is presented in Figure 5B, with the null model lead times exceeding those in the actual data. This confirms that the spread of the Sandyrelated information on Twitter is not purely viral and endogenous, as in that case the actual lead times would outperform the null model [32]. Future development of more complex null models, better suited for exogenous processes, may be required to adequately test experimental results.

Sentiment study
We demonstrated above that the sensor method is generally successful in selecting users with high centrality, activity and awareness. An important question remains if they also differ in their emotional response. To study this, we employ several sentiment analysis techniques. Primarily, we use the sentiment scores generated by a proprietary algorithm from a data provider, analytics company Topsy Labs [37]. During the sentiment analysis of a message each word is matched against a dictionary of keywords and assigned a weight that reflects its emotional impact. Total weights are calculated, normalised by the word count and returned as either a relative sentiment (average of all scores taking into account their sign) or an absolute sentiment (average of absolute values). In our analysis we use the relative sentiment, because it is indicative both of the strength and polarity of sentiment. We also use discrete scores (1, 0 or -1) to distinguish positive, neutral and negative messages and to monitor their fractions in the stream of messages posted.
Since the exact details of the sentiment detection algorithm (i.e. the dictionaries of emotion words and their respective weights) were not published by Topsy and thus cannot be fully reproduced, we verify the analysis with two additional techniques freely available for academic research. The first one is a general-purpose text analysis library Linguistic Inquiry and Word Count [38], which is widely used for detection and classification of emotions in texts. On the most basic level, LIWC provides frequencies of occurrence of positive and negative emotional markers in texts. To combine these two measures into a single metric of sentiment polarity we follow Taboada et.al. [39] and use the difference between the positive score and the scaled (by the factor of 1.5) negative score, a procedure that compensates for a statistical prevalence of positive emotions. The second tool is the SentiStrength by Thelwall et.al. [40], developed specifically for the sentiment classification in short online messages characterised by the frequent use of non-standard spelling, slang, abbreviations and emoticons. Our comparison shows that all three techniques produce highly consistent temporal sentiment trends, shown in Figure 6, that differ only in the scaling factor and in the case of SentiStrength a moderate vertical offset (upward shift of approximately 10% of the peak value is implemented to bring the trend inline with LIWC and Topsy). We conclude that all of these techniques are equally adequate for the study, and the behaviour detected is robust regardless of the specific measurement tool applied.
The temporal evolution of sentiment is tracked as follows: we discretize time into non-overlapping bins of equal duration and take an average of relative sentiment scores for messages posted during each time step. To suppress the noise, we use basic smoothing by a three-point running average, when a value in a time-series is averaged with its nearest neighbours. Typical hourly sentiment trends are shown in Figure 7A-D for the control and sensor groups drawn from various combinations of affected or unaffected areas. The trends are quite noisy, making it necessary to analyse sentiment behaviour over large samples, in this particular instance of 100,000 users.
Sentiment features a noticeable diurnal oscillation pattern, previously reported in the analysis of daily and seasonal variations in online activity [41]. Discretising time by days produces smoother curves of a lower temporal resolution, as the ones shown in Figure 7E Figure 7B and C (or Figure 7F and G).
The composition of the message stream evolves with time, as illustrated in Figure 8. The fractions of positive and negative messages are relatively stable, oscillating daily at a certain level. During the disaster phase, the number of negative messages grows at the expense of the positive ones and results in a distinct negative overall sentiment, which lasts approximately from -100 to +100 hours. This increase in the frequency of negative messages at the expense of positive ones potentially gives an opportunity for a simple and universal monitoring technique. Checking the sentiment of a randomly selected sample for the negative overall sentiment, where the share of negatives grows at the expense of positives, may suffice to detect both the occurrence and location of an emergency or disaster. More broadly, the same concept may be applicable to any topic of prominence reflected in the interactions online. For instance, the period of negative sentiment around -300 hours is due to the October 17 presidential debates about the hotly disputed topic of energy policy. The sharp drop in sentiment at +220 hours is due to the weather related tweets discussing the November northeastern storm and the associated snowfall at a time when people still suffered severe consequences from Sandy, including power outages. Notably, this sharp drop is more pronounced in groups drawn from the disaster-affected region.
As an illustration of the sentiment sensing technique, we apply it to the United States in the period between October 21 and November 7. On a regular spatial grid all messages are aggregated hourly and the average sentiment is calculated to obtain a spatio-temporal evolution of density and sentiment of tweets (see Video S1 in the Supplementary Information online). Two snapshots for

Discussion
In this empirical study we found that the method based on the "friendship paradox" is generally successful in forming sensor groups with an awareness advantage over the randomly selected control groups. The magnitude of the lead-time varies with the size of a sample, showing an advantage of up to 11 hours in small samples of 500 users. This advantage shortens to 5 hours when the size increases to 100,000 users. Lead-time can change significantly when geographical restrictions are imposed on the formation of control and sensor groups, especially if one of them is from the disaster-affected area and the other is not. Maximum advantage detected in our study was about 26 hours and resulted from the combination of high network centrality and geographical relevance.
Additional study of sentiment revealed that the emotional response was universal and followed a similar temporal evolution pattern in both control and sensor groups. The stream of messages changed its composition during the active phase of the disaster, and the increased fraction of negative messages pushed the average sentiment into negativity. Similar behaviour was observed on the shorter scale during the observation period and was linked to other prominent events (the presidential debates and the northeastern snowstorm). Features demonstrated by the sentiment are promising in terms of developing a universal sensing technique that does not require any preconditioning in the form of specific keywords to monitor.
Our study presents a first empirical investigation of the sensor method in a network where information propagates in a mixed mode, both endogenously and exogenously, and factors like a relatively short time scale and strong geographical nature of a disaster affect performance of the method. The lead times we obtained may be sufficient for individuals to improve their own preparedness to a threat like a hurricane (warn others, stockpile water, food, medicines, fuel, batteries, protect properties, etc.), but unlikely to give authorities enough time to adjust their global large-scale response. Nonetheless, the importance of the efficient pathway for the propagation of emergency information, provided by sensor groups, should not be underestimated. Early exposure to witness reports is a factor in compliance with authorities, because behaviour in disasters is often collective [42] and evacuation decisions depend heavily on the perception of risk and peer influence [43]. There is also an inherent resistance towards recognising potential dangers [44] and online interactions may either facilitate better response to threats or create undesired consequences, like panic [45,46]. Management of the information flow must therefore be included into communications part of emergency planning guidelines [47].
It should be understood that the specific nature of a threat would be reflected in the performance of the sensor method. Since hurricanes, compared to other disasters, evolve slowly and are characterised by exceptional predictability using modern atmospheric simulation techniques, the value of the advantage on the scale of hours may be questionable as an early warning. But this might not be the case in other scenarios without extended warning time and predictability: for instance earthquakes [17][18][19][20][21], terrorist attacks, technological catastrophes, forest fires and flash floods. In events like these, a social network may effectively serve as a primary source of information from eyewitnesses [48], as well as a medium for its distribution.
The sentiment sensing technique proposed here is an attractive alternative to the existing methods of disaster detection on Twitter [20,21], which are based on the monitoring of message frequency and text mining for event-specific keywords. Our approach is indifferent to the nature of a disaster and does not require any filtering of a message stream to extract event-related tweets. However, additional study is needed to evaluate how well such a technique performs in an unfiltered stream of information.
One important limitation to the reliability of quantitative estimates of the lead-time via digital traces on social media lies in the assumption of a direct correlation between awareness and online activity on the topic. Such correlation needs additional experimental validation by traditional sociological methods. Regional and demographic differences are likely to exist in online behaviour, based on the communication platform adoption rate and patterns of use across groups from different regions, age, socio-economic status or cultural heritage. However, rigorous validation of these assumptions is impossible on the basis of the data that we use and is therefore beyond the scope of our research.
Overall, our findings confirm the potential of sensor method for efficient early detection of emergency information and offer new sentiment sensing technique for detection and localisation of disasters.

Author contribution statement
The authors contributed equally to the analysis performed, interpretation of results and preparation of the manuscript.

Competing financial interests
The authors declare no competing financial interests.

Data accessibility
The data was obtained through the analytics company Topsy Labs, and due to Twitter's policy is not available for re-distribution. We verified the results of the proprietary sentiment classification algorithm by Topsy with other openly available tools (LIWC, SentiStrength) to guarantee reproducibility of our findings. All control and sensor samples, which contain a limited subset of information required for aggregated analysis without user-identifying details, may be provided upon request. The path of the hurricane from the moment of its formation until dissipation is accompanied by the approximate extent of the hurricane force winds. Three threshold levels distinguished by shading correspond to the hurricane forces between Categories 1 and 3 (34, 50 and 64 knots respectively).
An outer extent of the Category 1 winds is outlined in red and serves as a border of the area directly affected by the hurricane. Strong filtering avoids early "noise" of irrelevant messages that may skew the estimate of an entry time. The most reliable form of such filtering is achieved only by including those messages that have "sandy" as a part of a text or as a hashtag. Analysis shows that users who appear in the dataset early demonstrate higher level of activity and are characterized by the higher counts of friends and followers (occupy a more central network position). These features are especially pronounced in the pre-landing stage of the hurricane history (landing time is taken as a reference zero point). The fact that both activity and centrality correlate with entry time (awareness) suggests that the "friendship paradox" holds and sensor groups have an advantage of awareness lead-time, the magnitude of which is to be established.    We monitor the fraction of positive (solid green line for control and dashed green for sensor groups) and negative messages (solid red for control and dashed red for sensor groups) in the total volume of all tweets. During the most severe stage of the hurricane and after its landing the composition undergoes transition from predominantly positive to predominantly negative. The density and polarity of sentiment is highlighted by green (positive) or red (negative). At the early stage, prevalent sentiment is either neutral or positive, and the interest in the hurricane is comparatively low, except for the Miami area. Close to the landing time (bottom) the sentiment in the area affected by the hurricane is overwhelmingly negative.
1. Tables S1 -S5, with summary of lead times and other metrics for all samples of various sizes and different combinations of geographical origin for control and sensor groups 2. Table S6, with a list of keywords in messages included into extended dataset 3. Figures S1 and S2, with histograms of use for specific keywords to illustrate relevance filtering Also, the Video S1 illustrating the sentiment sensing would be uploaded and available online.     Figure S1, with decreasing level of keyword relevance. Note the frequent incidence of use before 22 nd October 2012.