Topicality and Social Impact: Diverse Messages but Focused Messengers

Are users who comment on a variety of matters more likely to achieve high influence than those who delve into one focused field? Do general Twitter hashtags, such as #lol, tend to be more popular than novel ones, such as #instantlyinlove? Questions like these demand a way to detect topics hidden behind messages associated with an individual or a hashtag, and a gauge of similarity among these topics. Here we develop such an approach to identify clusters of similar hashtags by detecting communities in the hashtag co-occurrence network. Then the topical diversity of a user's interests is quantified by the entropy of her hashtags across different topic clusters. A similar measure is applied to hashtags, based on co-occurring tags. We find that high topical diversity of early adopters or co-occurring tags implies high future popularity of hashtags. In contrast, low diversity helps an individual accumulate social influence. In short, diverse messages and focused messengers are more likely to gain impact.


INTRODUCTION
Online social media provide a platform on the Internet in which people can easily and cheaply exchange messages.A

Topic Space
User Hashtag great body of information is generated and tracked digitally, creating unprecedented opportunities for studying user generated content and information diffusion processes [23,44].Messages in social media involve a variety of topics.Content, messages, or ideas are deemed semantically similar if they discuss, comment, or debate about the same topic; conversely, we can detect a topic by clustering a group of similar messages observed.In this paper we study the topical diversity of content and content creators -messages and messengers.We develop a method to detect topics and propose a way to distinguish messengers with diverse interests from those with focused attention, as well as messages on general matters from those on particular domains.Thus we are able to tell which categories of users and messages have better chances to gain influence and impact.

A B
In this study we use data from Twitter, one of the most popular social media platforms where Internet memes are generated, multiplied, and propagated.A user follows others to subscribe to the information they share.This generates a social network structure along which messages spread.Each Twitter user can post short messages called tweets, which may contain explicit topical tags, words, or phrases following a hash symbol ('#'), named hashtags.By using hashtags, people explicitly declare their interests in corre-sponding discussion and help others with similar preferences find appealing content.
The social network topology is determined by how people are connected.Each individual is represented as a node and each following relationship as an edge linking a pair of users (see the bottom layer in Fig. 1).Hashtags spread among people through these social connections and can be mapped into a semantic space, in which each node is a tag and similar ones are coupled forming topic clusters (see the top layer in Fig. 1).By examining which topics are attached to a user's messages, we can infer her interests; by examining the topics of tags that co-occur with a given hashtag, we can learn what that hashtag is about.In reality we are able to observe the social network structure and information diffusion flows, but not topic formation in the semantic space.To the best of our knowledge, the connection between these two layers of information diffusion is not yet well explored [39,36].
Let us highlight the main contributions of this paper: • We develop a way to extract topics from online conversation.A network of hashtags is built by counting how many times a pair of hashtags appear together in a post.Communities (clusters of densely connected nodes) in such a network are found to well represent topics as sets of semantically related hashtags.
• Given a user, we gauge the diversity of his topical interests by examining to which topic clusters each of his hashtags belongs.We can thus distinguish users with diverse interests from those with focused attention.The topical diversity of a hashtag is measured similarly, by considering its co-occurring hashtags.
• When a hashtag is adopted by people with diverse interests, or co-occurs with other tags on assorted themes, it is more likely for the tag to become popular.One interpretation is that diversity increases the probability of the hashtag of being exposed to different audience groups.We show that topical diversity of early adopters or co-occurring tags are good predictors for the future popularity of hashtags.
• In contrast, high topical diversity is not a helpful factor in the growth of individual social impact.Focusing on one or a few topics may be a sign of expertise.Inactive users attract followers by mentioning a variety of topics, while active users tend to obtain many followers by maintaining focused topical interests.Focused topical preferences promote the content appeal of ordinary users and celebrities alike.

BACKGROUND
One prerequisite task for identifying topical interests of messages and messengers is to identify topics.Several studies examined the recognition of topics in the online scenario and social media [25,49,40,1,14].Leskovec et al. [25] grouped short, distinctive phrases by single-rooted directed acyclic graphs used as signatures for different topics.Features extracted from content, metadata, network, and their combinations were leveraged to detect events in social streams [2,14].Another approach is based on the discovery of dense clusters in the inferred graph of correlated keywords, extracted from messages in a given time frame [1,43].Here we adopt a similar strategy to identify clusters of similar hashtags by detecting communities in the network topology [8,30] on account of topic locality.
Topic locality in the Web describes such a phenomenon that most Web pages tend to link with related content [12,26].The effect of topic locality is utilized in focused Web crawlers [27], collaborative filtering [16,17], interest discovery in social tagging [38,3], and many other applications [19,29,43,45].In our scenario, topic locality refers to the assumption that semantically similar hashtags are more likely to be mentioned in the same messages and therefore to be close to each other in the hashtag co-occurrence network.
We see a growing literature on discovering user interests and topics [21,33,28,10,45,50].A common approach to use a vector representation generated from all the posts by a user to represent her interest.Then whether a user would be interested in a newly incoming message is determined by the similarity between feature vectors of user interests and the message [10,46].LDA has also been applied to extract user interests from user generated content [45].Java et al. [21] looked into communities of users in the reciprocal Twitter follower network and summarized user intent into several categories (daily chatter, conversations, information sharing, and news updates); a user could talk about various topics with friends in different communities.Michelson and Macskassy discovered entities mentioned in tweets according to predefined folksonomy-based categories to allocate topics so as to build an entity-based topic profile [28].The diversity of user interests has not yet been thoroughly investigated.An exception is the work of An et al., who explored which news sources Twitter users are following and correlated the observation with the diversity of their political opinions [4].
In this paper we propose a simple but powerful method to detect topics and infer user interests, as well as definitions of topical diversity of users and content.
Hashtag popularity has been examined from various perspectives, including their innate attractiveness [7,37], the network diffusion processes [11,15,32,5,47,48], user behavior [46,20,51], and the role of influentials along with their adoption patterns [22,6].Romero et al. [36] predicted popularity of a tag based on the social connections of its early adopters, but did not consider topicality and connections among tags.
We believe that the proposed measurement of topical diversity would prompt new approaches to the prediction of future hashtag popularity.Several previous studies have supported our intuition.For example, network diversity was shown to be positively correlated with regional economic development [34,13]; community diversity at the early stage tend to boost the chances of a meme going viral [47,48].
Many methods for quantifying social impact and identifying influential users have been proposed.User influence can be quantified in terms of high in-degree in the follower network [9,41], information forwarding activity [35,41], seeding larger cascades [22,6], or topical similarity [43,45].

METHODS
In this section we describe our dataset and define several key concepts to facilitate the subsequent discussion.

Dataset
We collected public tweets from January to March 2013  1 We set the first two months as the observation period and the last month as the test period ; the former is used to build up the topic network and quantify user topical interests, and the latter works for evaluating the results of prediction tasks.Table 1 shows several basic statistics about the dataset, which is publicly available at carl.cs.indiana.edu/data/index.html#topic2014.
Hashtags during March 2013 are used for prediction tasks.We are interested in newly emergent tags, so that we are able to identify the start of their lifetime and track their growth for at least three weeks.We select hashtags that do not appear during January and February 2013, but are used by at least three distinct users during March 2013.In addition, only tags with the first tweet observed during the first week of March are considered, so that we can track their usage during the whole month.Eventually, 509,868 hashtags (3.03% of all hashtags in March) were chosen as emergent hashtags.

Topic Clusters
Hashtags are explicit topic identifiers on Twitter that are invented autonomously by millions of content generators.Since there is no predefined consensus on how to name a topic, multiple duplicate hashtags may be developed to represent the same event, theme, or object.For instance, #followback, #followfriday, #ff, #teamfollowback, and #tfb are all about asking others to follow someone back or suggesting people to follow; #tcot, #ttxcot, #twcot, and #ccot label politically conservative groups on Twitter.To reduce the duplication, we shift attention from single hashtags to more general categories -clusters of semantically similar hashtags -that we call topic clusters.
With the topic locality assumption that semantically similar hashtags are more likely to appear in the same tweets together, such topic clusters are expected to be densely connected.We detect these clusters by finding communities in the hashtag co-occurrence network.First we recover the network by only considering hashtags used by at least three distinct users and join occurrences observed in at least three messages.We do this to filter out noise from accidental co-occurrence and spam.The recovered network contains 974,529 nodes and 7,325,492 edges.Then communities are detected using the Louvain community detection method [8], which was selected because of its efficiency.We obtain 37,067 communities (the level 2 in the hierarchical structure found by the Louvain method).As exemplified in Table 2, communities in the hashtag co-occurrence network capture coherent topics.At the macroscopic level we can still observe strong topic locality (see Fig. 2).

Diversity of User Interests
Given a messenger u, we can track the sequence of hashtags (with repetition) that he used in the past, h1, h2, . . ., hn u .Each hashtag hi is attached to a topic T (hi), given by: where C(h) is a community containing h in the hashtag cooccurrence network.The set of distinct topics associated with all of u's hashtags is denoted as Tu, T (hi) ∈ Tu.The topical diversity of a user's interests can be estimated by computing the entropy of hashtags across topics: Table 3 compares two people, both having used 10 distinct hashtags for 20 times.User A was interested in trendy Twitter-specific tags almost exclusively (low H1), while user B paid attention to a set of very diverse conversations about countries, movies, books, and horoscope (high H1).Note that the opposite (and wrong!) conclusion, H1 > H2, would be drawn had we measured entropy based on hashtags rather than topic clusters.

Diversity of Content
Similarly, given a hashtag h, we recover the sequence of other hashtags (with repetition) that co-occurred with it, h1, h2, . . ., hm h .Each co-occurring hashtag (co-tag) hi is assigned to topic T (hi) based on the topic cluster to which it belongs (see Equation 1).Then the co-tag diversity of h, H2(h), is measured in the same way as the user diversity H1 (see Equation 2).

PREDICTING HASHTAG POPULARITY
Do diversity measures help us detect hashtags that will go viral in the future?In this section we explore whether the topical diversity of a hashtag's adopters or co-tags predicts its future popularity.Each node represents a cluster of hashtags on the topic as labelled; the area is proportional to the number of hashtags that the topic cluster contains; the color is assigned according to the degree so that high degree is more red and low degree is more blue.All these examples support the existence of topic locality.Hashtags in Twitter can be treated as channels connecting people with shared interests, because hashtags label and index messages enabling people to easily retrieve information and broadcast to certain groups.As illustrated in Fig. 3, users with focused interests are linked with few groups, while people who care about diverse issues are exposed to a larger number of interest groups through hashtag channels.We expect the latter category of users to play a critical bridging role, connecting many groups in the network.This would allow them to spread innovative information to multiple groups, as suggested by the weak tie hypothesis [18], thus boosting the diffusion of hashtags [31,47,48].In other words, we hypothesize that if a hashtag has early adopters with diverse topical interests, it is more likely to go viral.

Prediction via User Diversity
Given a hashtag h, we track the users who adopt it within t hours after h is created and compute the average interest diversity among these early adopters as a simple predictor.Irrespective of how long we track, we observe a positive correlation between the average user diversity and the future popularity of the hashtags, measured as the total number of adopters after one month (see Fig. 4a).To better evaluate the predictive power of adopter diversity, let us run a simple prediction task based on information at the early stage to forecast which hashtags from the test period will be popular in the future.A hashtag is deemed popular if the number of distinct adopters at the end of the test period is above a given threshold.Our evaluation algorithm has three steps: i) For each feature, we compute its value for each newly emergent hashtag h in the test period based on the set of early adopters of h within t hours after the birth of h.A hashtag is born when the first tweet containing it appears.The feature is either a measure of user characteristics averaged among early adopters, or a linear combination of several such measures.We track adoption events for t = 1, 6, and 24 hours since birth.

Business
ii) Hashtags are ranked by the feature values in descending order.
iii) We set a percentile threshold for labeling popular hashtags.The most popular hashtags are deemed "viral."Based on this ground truth, we can measure false positive and true positive rates and draw a receiver-operating-characteristic (ROC) plot.The area under the ROC curve (AUC) is our evaluation metric.The higher the AUC value, the better the feature as a predictor of future hashtag popularity.
We consider several user attributes of early adopters that have been shown in the literature to be strong predictors of virality [9,41,42,35,48].These include the number of early adopters n, number of followers fol (potential audience), and number of tweets twt that a user has produced during the observation period (activity).We additionally consider the diversity of topical interests, H1.The goal of our experiment is not to achieve the highest accuracy (a task for which different learning algorithms could be explored).We aim to compare the predictive powers of different features.Therefore we focus on the relative differences between AUC values generated by single or combined features rather than on the absolute AUC values.AUC values measured using different features are listed in Table 4.Among individual features, n is the most effective.When we combine it with other features, fol yields high AUC consistently, but the differences are very small.The performance of the diversity metric is competitive, matching the top results in several experimental configurations.These results are not particularly sensitive to the popularity threshold or the duration of the early observation window.

Prediction via Content Diversity
In this section we examine whether the future popularity of a hashtag is affected by the topical diversity of its early co-occurring tags.How people apply hashtags to label their messages depicts their topical interests and determines the topology of the tag co-occurrence network.In Fig. 1, a link between the topic layer and the social layer of the network marks an association between a user and a hashtag.This tag may attract an audience in the social network.The cooccurrence of two tags extends the audience groups of both.For example, link in Fig. 1 exposes user A to the blue topic and user B to the red cluster.Therefore we expect a hashtag to be exposed to more potential adopters, making it more likely to go viral, if it often co-occurs with many other hash-tags.To test this hypothesis, we measure the number m of co-tags.Furthermore, if co-tags are very popular, we would expect a stronger effect because they would provide a larger audience.We therefore measure the popularity of co-tags in terms of numbers of tweets T and adopters A during the observation period.And if co-tags are about diverse topics, this may further boost the effect by extending the audience to many groups with small overlap.In conclusion, we hypothesize that many popular co-tags about diverse topics should be a sign that a hashtag will grow popular.
Given an emergent hashtag h, we track other tags that cooccur with h within t hours after h is born and measure the topical diversity H2 of these co-tags.We observe a positive correlation between the diversity of early co-tags and the future popularity of the tag (see Fig. 4b).Then we apply the same method as in Sec.4.1 to test the predictive power of different traits associated with early co-tags.In this case, the prediction features for each target hashtag are computed based on early co-tags instead of adopters.Again, the goal of our experiment is to compare the predictive powers of different features, thus we examine the relative differences in AUC values generated by the various traits.The results are reported in Table 5.The number m of co-tags observed in the early stage is the best single predictor of virality.When we combine m with a second feature, co-tag diversity provides the best results irrespective of the threshold or the duration of the early observation window.Interestingly, m and H2 are both about diversity and perform better than the popularity-based features T and A.

Summary
In the discussion above, we evaluate the predictive powers of two categories of features for identifying future popular hashtags.These two sets of features, based on early adopters and co-tags, have different effectiveness.By comparing the AUC values in Tables 4 and 5, we find that adopter features yield better results.However, they also require additional prerequisite knowledge: in addition to tracking hashtag co-occurrences for building the topic network, we also need to record user-generated content.The features built upon early co-tags are less expensive, but the performance is slightly worse; a possible interpretation for this is that few tweets in the observation window may contain co-occurring tags, while they all have associated users.Therefore co-tag features are more sparse.Depending on what type of information is available, one might choose either approach or a combination of both.

SOCIAL INFLUENCE
High topical diversity of adopters and co-occurring tags is a positive sign that a hashtag is growing popular, as shown in the previous section.However, does high topical diversity also signal a growth in individual influence?On one hand, when an individual talks about various topics, she may have contact with many others through shared interests or hashtags, thus attracting more attention (see Fig. 3).On the other hand, focused interest may enhance expertise in specific fields, thus increasing the content interestingness and retweetability.In this light, low diversity triggered by expertise might help people become popular.In this section we evaluate these two contradictory hypotheses.Some people are more influential than others in persuading friends to adopt an idea, an action, or a piece of infor-Table 4: AUC of prediction results using different adopter features within t early hours.Prediction features include the number of followers (f ol), the number of tweets (twt), the diversity of topical interests of adopters (H1), and the number of early adopters (n).The threshold is expressed as a top percentile of most popular hashtags that are deemed viral for evaluation purposes.Best results for each column are bolded.Table 5: AUC of prediction results using different features among co-tags within t early hours.Prediction features include the number of tweets containing the co-tags (T ), the number of co-tag adopters (A), the diversity of co-tags (H2), and the number of observed co-tags (m).The threshold is expressed as a top percentile of most popular hashtags that are deemed viral for evaluation purposes.0.55 0.55 0.57 0.58 0.60 0.63 0.66 0.70 0.75 0.74 0.81 0.86 † A linear combination with coefficients determined by regression fitting using least squared error.mation.The concept of social influence has been discussed extensively in social media research.Most of the studies in the literature have considered users who are active [9], have many followers [9,35], are able to trigger large cascades [22,6], or get retweeted or mentioned a lot [9,41,35] as signals of high social influence.Which user characteristics make people popular and influential?Does the diversity of individual topical interests play a role in the social influence processes?Let us consider several individual properties: Number of retweets (RT ) How many times an individual is retweeted during the observation time period.We consider RT as a direct indicator of social influence, since it quantifies how many times the user succeeds in making others adopt and spread information. 2 The number of retweets is dependent on the length of the observation window, because we believe that social influence is accumulated in time and requires long-term endeavor [9].

Number of followers (f ol)
The number of followers suggests how many people can potentially view a message once the user posts it.
2 Due to the settings of the Twitter API, the number of retweets per user that we collect includes all the retweeters in every cascade.That is, suppose user B retweets user A and then C retweets B; both tweets are counted in RT for A, even though C did not directly retweet A. However, since the majority of information cascades are very shallow [6], RT is a good approximation of the direct retweet count.

Number of tweets (twt)
The number of tweets generated by the user; the higher the number, the more active the user.
Content interestingness (β) How interesting is the content posted by the user.Lerman studied the interestingness of online content on Digg and defined it as "the probability it will get retweeted when viewed" [24].To measure β in the Twitter context, we assume that the value of RT for an individual is proportional to the number of tweets twt he produced, the number of followers f ol, the chance α that a message is seen by a follower, and the appeal of the content.Treating α as a constant for simplicity, we obtain Diversity of interests (H1) See Sec.3.3.
Table 6 lists the results of a linear regression estimating how many times a user is retweeted according to several user features.Intuitively, users with many followers are more likely to spread their messages and thus get retweeted more frequently, because they have many more potential viewers.The number of followers is the most important factor, as supported by the largest positive coefficient in the regression.The number of generated tweets also has a positive coefficient in the regression, implying that being active helps users get retweeted more.The result confirms several existing studies suggesting that high social influence  requires long-term, consistent effort [9,41].The interestingness of the story is positively correlated with social influence as well, although not as strongly as the other factors.Finally, the negative coefficient of diversity in Table 6 suggests that users with diverse interests tend to have low influence.This supports the hypothesis that social influence is topicsensitive, requiring expertise in a specific field [45]; posting about the same topic is more effective for gaining social influence, compared to commenting on many different subjects.In summary, people can acquire social influence by having a big audience group, being productive, creating interesting content, and staying focused on a field.Unfortunately, it seems that there is no simple recipe of success.We illustrate how several user properties are related to the number of followers and the topical diversity of user interests in Fig. 5.Most users have a small number of followers and low entropy (Fig. 5a).Active users tend to have high diversity, as expected by the nature of entropy (Fig. 5b).The number of followers is shown in Fig. 5c to be a powerful factor to get retweeted more often, consistently with the regression results in Table 6.Finally, the content in-  terestingness appears to be correlated with the number of followers but strongly with user diversity (Fig. 5d).

Active vs. Inactive Users
Let us explore how the number of followers a user can attract is affected by the diversity of topical interests.The entropy measure for diversity is biased by user activity: generating more tweets with more hashtags tends to yield higher entropy.Thus we group users by productivity, so that individuals in the same group have comparable values of topical diversity.For users in the same group, we compute the Spearman rank correlation between the number of followers and diversity.We use Spearman because, unlike Pearson, it does not require that both variables be normally distributed.According to Fig. 6, low-engagement users attract followers by talking about various topics, while active users tend to obtain many followers by maintaining focused topical interests.For the most active users, topical diversity is not relevant; many of these accounts are spammers and bots.

Celebrities and Ordinary Users
When looking into the effect of interest diversity on content appeal, we need to control for the number of followers, since our interestingness measure is strongly correlated with the number of followers (see Fig. 5d).The negative correlations shown in Fig. 7 suggest that in general, focused posts promote content appeal.One possible interpretation is that people follow someone for a reason.Content has to be consistent in order to match such expectations; i.e., one is less likely to share a tip on cosmetics from a politician.This effect is stronger for users with few followers and celebrities; people with moderate popularity generate retweets with focused and diverse content.

CONCLUSION
We proposed methods to identify topics using Twitter data by detecting communities in the hashtag co-occurrence network, and to quantify the topical diversity of user interests and content, defined by how tags are distributed across different topic clusters.We found that popular hashtags tend to have adopters who care about various issues and to co-occur with other tags of diverse themes at the early stage.One practical application evaluated in this paper is to predict viral hashtags using features built upon the topical diversity of early adopters or co-tags.In the prediction using information on early adopters, the performance of topical diversity is competitive with other user features while combined with the number of early adopters.In the prediction with early co-occurring hashtags, features about diversity, including the number of early co-tags and their topical diversity, excel the popularity-based features.However, high topical diversity is not a positive factor for individual popularity.High social influence is more easily obtained by having a big audience group, producing lots of interesting content, and staying focused.In short, diverse messages and focused messengers are more likely to generate impact.
The interesting observation that high diversity helps a hashtag grow popular but does not help develop personal authority originates from the different mechanisms by which a hashtag and a user attract attention.In the diffusion process of a hashtag, adopters with diverse interests play a role as bridges connecting different groups and thus positively improve the visibility of the tag.These results are consistent with Granovetter's theory [18], as well as our recent findings on the strong link between community diversity and virality [47].On the hand, a user gains social influence through expertise or authority within a cohesive group with common interests.
Topical diversity provides a simple yet powerful way to connect the social network topology with the semantic space extracted from online conversation.We believe that it holds great potential in applications such as predicting viral hashtags and helping users strengthen their online presence.

Figure 1 :
Figure 1: We can represent the topics of online conversations in social media by a multi-layer network.The social network connects people.In the topic network, nodes represent hashtags that are linked when they co-occur; clusters represent topics (shown in colors).A person and a hashtag are connected when the person uses the hashtag.

Figure 2 :
Figure 2: Examples of connected topic clusters of related themes: (a) news and politics, (b) sports, (c) soccer, and (d) music and entertainment.Each node represents a cluster of hashtags on the topic as labelled; the area is proportional to the number of hashtags that the topic cluster contains; the color is assigned according to the degree so that high degree is more red and low degree is more blue.All these examples support the existence of topic locality.

Figure 3 :
Figure 3: User A has diverse topical interests, each connected group corresponding to a social circle with common interests.User B displays more focused interests.

Figure 4 :
Figure 4: (a) Correlation between the average topicbased entropy H1 of adopters in the first 24 hours and the total number of future hashtag adopters.(b) Entropy H2 of hashtags co-occurring in the first 24 hours across topic clusters as a function of future popularity of emergent hashtags.

Table 6 : 5 †
Linear regression estimating how many times a user is retweeted.For efficiency, the regression is based on a random sample of 10% of the users (N = 2, 171, 624).Variables are normalized by Z-score.*** p < 0.001 a. users b. twt d. β c.RT Followers Followers Followers Followers

Figure 5 :
Figure 5: Heatmaps of (a) the number of users, (b) the number of tweets generated, (c) how many times a user is retweeted, and (d) the content interestingness of a user, as a function of the diversity of topical interests, H1, and the number of followers in the observation window.

Figure 6 :Figure 7 :
Figure 6: Spearman rank correlation between the number of followers and the topical diversity of user interests as a function user activity.All shown correlation values are significant (p < 0.05).

Table 1 :
Basic statistics of the dataset, which is split into two periods: observation and testing.About 13% of the tweets contain hashtags.

Table 3 :
Comparison of two users with different diversity of topical interest.
Best results for each column are bolded.