Structural invariants and semantic fingerprints in the “ego network” of words

Well-established cognitive models coming from anthropology have shown that, due to the cognitive constraints that limit our “bandwidth” for social interactions, humans organize their social relations according to a regular structure. In this work, we postulate that similar regularities can be found in other cognitive processes, such as those involving language production. In order to investigate this claim, we analyse a dataset containing tweets of a heterogeneous group of Twitter users (regular users and professional writers). Leveraging a methodology similar to the one used to uncover the well-established social cognitive constraints, we find regularities at both the structural and semantic levels. In the former, we find that a concentric layered structure (which we call ego network of words, in analogy to the ego network of social relationships) very well captures how individuals organise the words they use. The size of the layers in this structure regularly grows (approximately 2-3 times with respect to the previous one) when moving outwards, and the two penultimate external layers consistently account for approximately 60% and 30% of the used words, irrespective of the number of layers of the user. For the semantic analysis, each ring of each ego network is described by a semantic profile, which captures the topics associated with the words in the ring. We find that ring #1 has a special role in the model. It is semantically the most dissimilar and the most diverse among the rings. We also show that the topics that are important in the innermost ring also have the characteristic of being predominant in each of the other rings, as well as in the entire ego network. In this respect, ring #1 can be seen as the semantic fingerprint of the ego network of words.


Introduction
In humans, language production is a deliberate and conscious action. However, it relies on many invisible mental processes that allow the construction of sentences in a very short time. For example, these cognitive processes are at play during the word retrieval stage, when the brain has to efficiently process, in a few milliseconds, its lexicon in order to find the right word, among thousands of others, that best fits the concept that needs to be expressed [1]. In order to achieve this impressive performance, cognitive strategies that exploit language properties, such as word frequency (e.g. when the most frequently used words are retrieved more quickly [2,3]), are activated. In this paper, we set out to find traces of these cognitive patterns in written production with a data-driven approach. To this end, we rely on the ego network model, which has already uncovered the cognitive limits of another human activity: socialisation.

The social ego network model
Anthropologists have shown that the number of meaningful social relationships that humans can maintain is not only limited to 150 [4] (the famous Dunbar's number) but it is also stable over time. The discovery of this regularity in human activity stems from the observation that, in different species of primates, there exists a correlation between the size of the neocortex (the part of the brain dedicated to high-level cognitive functions such as socialisation, language, etc.), and the average size of groups in natural environments. Extrapolating the expected size of a human group from the dimension of the human brain, as well as studying historical data such as the maximum size before fission of autonomous communities [5], the Dunbar number consistently emerges. It was then shown that these 150 active social relationships can be further subdivided into 4 concentric circles [6,7], the innermost one containing the most intimate social relationships [8], the outermost one enclosing all 150 social relationships. The typical size of these concentric circles is 5, 15, 50, and 150, respectively, with a constant scaling ratio of about 3 between consecutive circles. Note that the portion of a circle not included in its innermost ones is referred to as ring. This hierarchical structure of social relationships is called "ego network". Recent studies based on data collected from online social networks have shown that online relationships are subject to the same laws as offline ones: the size of the ego network (i.e., the total number of social relationships) remains in the same order of magnitude as the Dunbar's number, which indicates that the cognitive constraint yielding this number is not overridden by a communication medium that facilitates social interactions [8][9][10][11]. In OSNs (Online Social Networks), the typical number of circles is slightly higher than 4, due to the presence of an additional circle in the center of the ego network (containing about 1.5 people), but the scaling ratio is preserved at around 3 ( Fig. 1).

Ring 4
Alters Fig 1. The ego network of social relationships. The green dot symbolizes the ego and the black dots the alters with whom the ego maintains an active social relationship. A layer also contains the alters of the inner layers, unlike the rings.

From social ego networks to ego networks of words
The ego network model highlights the regularity of the structure of social relations, in real life and in OSN. In this paper, we adopt an analogous approach to investigate the regularities and invariants manifesting cognitive constraints in language production. Specifically, we conjecture that a similar structure, which we call "ego network of words", may also be used to describe the way humans use words, and that this structure may provide very significant information to characterise the peculiarities of individuals, similarly to the social dimension. In fact, it is known [12] that many traits of social behavior (resource sharing, collaboration, diffusion of information) are chiefly determined by the structural properties of social ego networks.
The motivation for this analogy is twofold. First, the use of words is, much like socialisation, a process that involves the use of cognitive resources, thus we conjecture that the ego network model may have larger applicability in describing how humans allocate cognitive resources, for example to language. Second, language is a social activity, whose emergence is potentially linked to the surge in active human relationships from the 50 of the closest primate to 150 for humans. This theory, known as social gossip theory of language evolution [13], postulates that language facilitates grooming social relations by reaching several peers at the same time. In addition, there is already well-established knowledge of a number of empirical cognitive limits affecting language, such as the bounded size of our vocabulary (which is consistently limited to approximately 42, 000 words for a native 20-year-old English speaker [14]), as well as the Zipf's law of words [15], which states that the frequency of a word is inversely proportional to its position in the frequency table for most human writings. We, therefore, choose to study the individual distribution of vocabulary, by forming concentric circles of words according to their frequency of use by the ego in question. Then, going beyond words as units of language, we focus on the topics to which the words refer. We thus complement the structural analysis with a semantic study, which completes our cognitive analysis framework. In the same way that the social ego network model has been used to provide a different perspective to social network analysis (such as for information diffusion [16]), we want to leverage the ego networks of words as microscopes to discover novel properties of language production.

Contribution and key findings
The main contribution of this work is the structural and semantic analysis of the ego networks of words for Twitter users. By using the ego network model, in this paper, we uncover complex structures showing that the cognitive effort to organise one's vocabulary is limited in many ways. We choose a corpus of text made up of tweets because it allows us to work with a varied sample of "authors" (e.g. more varied than a corpus of newspaper articles). Moreover, as Twitter is dedicated to the exchange of very short messages (240 characters), it is a medium that is very favourable to spontaneous reactions, with a more natural style and a reduced writing time. This time constraint is more likely to reveal human behaviour, in analogy with the social domain, where time limitations have been shown to significantly affect social cognitive constraints [13]. For our data-driven analysis, we collected tweets from generic as well as specialised Twitter users (Section 3). Using the ego-network-of-words model, we are able to find evidence of a structural regularity in the frequency of word usage by each individual (Section 4). The semantic analysis (Section 5) also establishes the existence of additional invariants, but most importantly it uncovers the nature of the innermost layer as the semantic fingerprint of the whole ego network, i.e., this layer groups together the most important topics on which the user is active. This strengthens the analogy with the social version of the ego network model, where the innermost layers include the most important social relationships of a person.
The key findings of the paper are the following.
• Similarly to the social case, we found that a regular concentric, layered structure (which we call ego network of words in analogy to the ego networks of the social domain) very well captures how an individual organizes their cognitive effort in language production. Specifically, words can be typically grouped in between 5 and 7 layers of decreasing usage frequency moving outwards, regardless of the specific class of users (regular vs professional).
• One structural invariant is observed for the size of the layers, which approximately doubles when moving from layer i to layer i + 1. The only exception is the innermost layer, which tends to be approximately 5 five times smaller than the next one. This suggests that the innermost layer, the one containing the most used words, may be drastically different from the others.
• A second structural invariant emerges for the external layers. Users with more layers organise differently their innermost layers, without modifying significantly the size of the most external ones. In fact, while the size of all layers beyond the first one linearly increases with the most external layer size, the second-last and third-last layers consistently account for approximately 60% and 30% of the used words, irrespective of the number of layers of the user.
• The semantic analysis of the words contained in the ego networks confirms that layer #1 is exceptional in the ego networks of words: it generates proportionally more topics than the other rings, these topics are more diverse, and its overall semantic profile is the most different with respect to those of other rings.
• In addition, topics that are important in ring #1 tend to be important in other rings as well (we call this the pulling power of ring #1). Thus, layer #1, despite being the smallest, can be seen as the semantic fingerprint of the ego network of words.
• The topics that are primary in some rings tend to be stronger than average among the primary and non-primary topics in the semantic profile of the other rings. This shows that, while layer #1 provides a particularly strong signal about prevalence in the ego networks, weaker signals show a more complex structure of influence among topics "resident" in different layers of the ego network of words.
This paper extends our prior publication in [17], where the structural analysis was carried out. Specifically, in this paper, we also present an extensive semantic analysis of the ego network of words. This allows us to provide a much more comprehensive understanding of the model, and highlight ways to characterise specificities of individuals as they emerge from their use of words, in addition to structural invariants observed through the structural properties of the ego networks.

Related work
To the best of our knowledge, no work has been published yet on models of individual word organisation similar in spirit to ours (i.e., by exploring the analogy with the social ego network model). However, some work has already been done on individual word frequency distribution by extending the notion of Zipf's law [18]. Based on Zipf's law, some have tried to find a generative model that could explain such a regularity-based human cognition [19], or just how the limited capacities of our memory naturally constrain our long-term use of words [20]. More generally, vocabulary size is often studied in the context of language learning for both children and adults, as well as to detect possible cognitive impairments [21]. For the semantic part, we have not identified any previous work on modelling user interests with a stratified approach, such as ours, that relies on the ego network of words. Most publications are about topic recommendations (relying upon a wide range of techniques, such as hashtag analysis [22], LDA [23] or ontology databases [24]), and about the emergence and monitoring of trending topics on Twitter [25,26].

The dataset
The analysis is built upon four datasets extracted from Twitter, using the official Search and Streaming APIs (note that the number of downloadable tweets -at the time of download -was limited to the most recent 3200 tweets per user). Each of them is based on the tweets issued by users in four distinct groups: Random users #1 This group has been collected by sampling among the accounts that posted a tweet or a retweet in English with the hashtag #MondayMotivation (at the download time, on January 16th, 2020). This hashtag is chosen in order to obtain a diversified sample of users: it is broadly used and does not refer to a specific event or a political issue. This group contains 5183 accounts after bot filtering.
Random users #2 This group has been collected by sampling among the accounts that posted a tweet or a retweet in English, from the United Kingdom (we set up a filter based on the language and country), at download time on February 11th, 2020. This group contains 2733 accounts after bot removal.
These four groups are chosen to cover different types of users: the first two contain accounts that use language professionally (journalists and science writers) and the other two contain regular users, which are expected to be more colloquial and less controlled in the language they use. Since the random user accounts are not handpicked as in the two first groups, we need to make sure that they represent real humans. The probability that an account is a bot is calculated with the Botometer service [27], which implements a state-of-the-art bot detection algorithm. This probability that the account is not human, which is called "complete automation probability" (CAP), is not only based on linguistic features such as grammatical tags, or the number of words in a tweet, but also on language-agnostic features like the number of followers or the tweeting frequency [28]. There is no standard CAP threshold to easily separate bots from humans: it depends on the expected balance of precision and recall. That is why we discard accounts with a CAP higher than 0.5, which considerably limits the number of false negatives (undetected bots). The Botometer service achieves a performance of 0.95 AUC on standard bot detection datasets [27]. With this configuration, the algorithm detects 29% of bot accounts in the dataset of random users#1 and 23% in the dataset of random users#2.
In our analysis, we only consider the timelines of active Twitter accounts, i.e., users that tweet regularly. Since this preprocessing step largely follows the standard approach in the related literature [29,30], further details are left to the Supporting information. Please note that we discard retweets with no associated comments, as they do not include any text written by the target user, and tweets written in a language other than English (since most of the NLP tools needed for our analysis are optimised for the English language).
3.1 Extracting user timelines with the same observation period As discussed above, for each user in our datasets we retrieved the most recent 3200 tweets (due to the Twitter API limitation), which constitute the observed timeline of the user. The time period covered by these tweets varies according to the frequency with which the account is tweeting: for very active users, the last 3200 tweets will only cover a short time span. Since random users are generally more active, their observation period is shorter, and this may create a significant sampling bias. In fact, the length of the observation period affects the measured word usage frequencies (specifically, we cannot observe frequencies lower than the inverse of the observation period). In order to guarantee a fair comparison across user categories and to be able to compare users with different tweeting activities without introducing biases, we choose to work on timelines with the same duration, by restricting to an observation window T . To obtain timelines that have the same observation window T (in years), we delete all those with a duration shorter than T and remove tweets written more than T years ago from the remaining ones.
Increasing T reduces the number of users we can keep for our analysis (see Fig. 2): for a T larger than 2 years, that number is halved, and for a T larger than 3 years, it falls below 500 for all datasets. On the contrary, the average number of tweets per timeline increases linearly with T (Fig. 3). The choice of an observation window will then result from a trade-off between a high number of timelines per dataset and a large average number of tweets per timeline. To simplify the choice of T , we only select round numbers of years. We can read in Table 1 that, beyond 3 years, the number of users falls below 100 for some datasets. On the other hand, the number of tweets for T = 1 year remains acceptable (> 500). Since we value the diversity of users (in order to limit any bias in the selection of Twitter accounts) over the number of tweets available, we make the choice of T = 1 year for the entire paper. Results with other T lengths can be found in [17]. We note that random users have a higher frequency of tweeting than others. This difference tends to smooth out when the observation period is longer (Table 1). This can be explained by the fact that the timelines with the highest tweeting frequency are excluded in that case because their observation period is too small (which further supports the fact that a smaller T reduces the selection bias of users).

Structural analysis of the ego network of words
In this section, we focus on the analysis of structural properties of the ego network of words, highlighting structural invariants in language production. Note that, in the social domain, pure structural properties of ego networks were instrumental [12] in characterising many traits of social behavior (resource sharing, collaboration, diffusion of information). For this reason, we believe it is important to assess them in the language domain as well, before moving on (Section 5) to more complex and domain-specific analyses. We first describe the methodology we use for our analysis in Section 4.1, then we discuss the results in Section 4.2. For ease of reading, the notation used in this section is summarised in Table 2. The section reports only the most significant results obtained by analysing the structural properties of the ego network. Interested readers are referred to [17] for additional results.

Methods
For each user, acting as ego, we want to build their ego network of words. To this aim, we first extract individual words from the user's tweets (Section 4.1.1), then we build the actual ego network from these words (Section 4.1.2).

Word extraction
Since the analysis focus on words and their frequency of use, we take advantage of NLP techniques for extracting them. As a first step, all the syntactic marks that are specific  to communication in online social networks (mentions with @, hashtags with #, links, emojis) are discarded (see Supporting information for a summary). Once the remaining words are tokenized (i.e., identified as words), those that are used to articulate the sentence (e.g., "with", "a", "but") are dropped. In linguistics, this type of word is called a functional word as opposed to lexical words, which have a meaning independent of the context. These two categories involve different cognitive processes (syntactic for functional words and semantic for lexical words), different parts of the brain [31], and probably different neurological organizations [32]. We are more interested in lexical words because their frequency in written production depends on the author's intentions, as opposed to functional word frequencies that depend on language characteristics. Functional words may also depend on the style of an author (and due to this they are often used in stylometry). Still, whether their usage requires a significant cognitive effort is arguable, hence in this work, we opted for their removal. Moreover, lexical words represent the biggest part of the vocabulary. Functional words are generally called stop-words in the NLP domain and we simply used an already existing list from the library spaCy [33] to remove them. As this work will leverage word frequencies as a proxy for discovering cognitive properties, we need to group words derived from the same root (e.g. "work" and "worked") in order to calculate their number of occurrences. This operation can be achieved with two methods: stemming and lemmatization. Stemming algorithms generally remove the last letters thanks to complex heuristics, whereas lemmatization uses the dictionary and a real morphological analysis of the word to find its normalized form. Stemming is faster, but it may cause some mistakes in overstemming and understemming. For this reason, we choose to perform lemmatization with the help of the package WordNetLemmatizer from the library NLTK [34] (which leverages the lexical database WordNet). Once we have obtained the number of occurrences for each word base, we remove all those that appear only once to leave out the majority of misspelled words. The Supporting information contains examples of the entire preprocessing part.
In the remaining of the paper, when we talk about the "words" of a user, we refer to the set of words left after removing functional words and after lemmatization.

Building the ego network of words
Let us focus on a user j. When studying the social cognitive constraints [29], the contact frequency between two people was taken as a proxy for their intimacy and, as a result, for their cognitive effort in nurturing the relationship. Similarly, the frequency f i at which user j uses word i is considered here as a proxy of their "relationship". Frequency f i is given by nij T , where n ij denotes the number of occurrences of word i in user j's timeline, and T denotes the observation window of j's timeline in years (T = 1y in our case, as discussed in Section 3.1). Using this frequency definition, we now investigate whether the words of a user can be grouped into homogeneous classes and whether different users feature a similar number and sizes of classes. To this aim, for each user, we leverage a clustering algorithm to group words with a similar frequency. The selected algorithm is Mean Shift [35], because as opposed to Jenks [36] or k-means [37], it is able to automatically detect the optimal number of clusters. In order to account for the long-tailed nature of frequencies, a standard log-transformation is applied to the frequency values prior to the Mean Shift run.
Thus, for each user, we feed the user's words to Mean Shift. The output of the clustering process is one value τ (e) for each ego network e, which describes the optimal number of classes (clusters) in which the word frequencies can be split. We rank each cluster by its position in the frequency distribution: cluster #1 is the one that contains the most frequent words, and the last cluster is the one that contains the least used words. Following the convention of the social ego network model discussed in Section 1, these clusters can be mapped into concentric layers (or circles), which provide a cumulative view of word usage. Specifically, layer L i includes all clusters from the first to the i-th. Layers provide a convenient grouping of words used at least at a certain frequency. We refer to this layered structure as the ego network of words. Note that, since layers in ego networks are cumulative (i.e., they include all words used at least a certain frequency), we will use the term "ring" to refer to their non-overlapping portion: for example, ring #2 contains all words that are in L 2 but not in L 1 (see Table 4 for the general formula). For the sake of example, let us focus on the second cluster identified by Mean Shift: cluster #2 corresponds to ring #2 in the ego network, and the union of ring #1 and ring #2 corresponds to the 2nd layer of the ego network. Another typical metric that is analysed in the context of social cognitive constraints is the scaling ratio ρ i between layers i and i − 1, which, as discussed earlier, corresponds to the ratio between the size of consecutive layers (see Table 4 for its formula). The scaling ratio is an important measure of regularity, as it captures a relative pattern across layers, beyond the absolute values of their size. Taken together, the optimal number of layers τ (e) , the circle L

Results
Here we study the ego networks of words in our four datasets, following the methodology described above.
The histograms of the obtained optimal number of layers τ are shown in Fig. 4. It is interesting to note that, despite the heterogeneity of users (in terms of tweeting frequency), the distributions are always quite narrow, with peaks appearing consistently between 5 and 7 clusters. Similarly to the social constraints case, also for language production, we observe a fairly regular and consistent structure. This is the first important result of the paper, hinting at the existence of structural invariants in cognitive processes.
We now study the size of the layers identified in Fig. 4. For the sake of statistical reliability, we only consider those users whose optimal number of layers (as identified by Mean Shift) corresponds to the most popular number of layers (red bars) in Fig. 4. This allows us to have a sufficient number of samples in each class. Fig. 5 shows the average layer sizes for every dataset. For a given number of clusters, we observe again a striking regularity across the datasets, meaning that each layer has approximately the same size regardless of the category of users. Fig. 6 shows the scaling ratio of the layers in language production. We can observe the following general behavior: the scaling ratio starts with a high value between layers #1 and #2, but always gets closer to 2-3 as we move outwards. This empirical rule is valid whatever the dataset (and whatever the observation period [17]). This is another    significant structural regularity, quite similar to the one found for social ego networks, as a further hint of cognitive constraints behind the way humans organise the words they use. In order to further investigate the structure of the word clusters, we compute the linear regression coefficients between the total number of unique words used by each user (corresponding to the size of the outermost layer) and the individual layer sizes. Due to space limits, in Table 3 we only report the exact coefficients for the journalists' dataset (but analogous results are obtained for the other categories) and in Fig. 7 we plot the linear regression for all the user categories. Note that the size of the most external cluster is basically the total number of words used by an individual in the observation window. It is thus interesting to see what happens when this number increases, i.e., if users who use more words distribute them uniformly across the clusters, or not. Table 3 shows two interesting features. First, it shows another regularity, as the size of all layers linearly increases with the most external cluster size, with the exception of the first one (Fig. 7). Moreover, it is quite interesting to observe that the second-last and third-last layers consistently account for approximately 60% and 30% of the used words, irrespective of the number of clusters. This indicates that users with more clusters split, at a finer granularity, words used at the highest frequencies, i.e., they organise differently their innermost clusters, without modifying significantly the size of the most external ones.
As a final comment on Fig. 6, please note that the innermost layer tends to be approximately five times smaller than the next one. This suggests that this layer, containing the most used words, may be drastically different from the others (as also evident from Table 3). The characterization of this special layer will be the main focus of the next section.

Discussion
We summarise below the main results of the section.
• Individual distributions of word frequencies are divided into a consistent number of groups. Since word frequencies impact the cognitive processes underlying word learning and retrieval in the mental lexicon [38], these groups can be an indirect trace of these processes' properties. The number of groups is only marginally affected by the class (specialized or generic) the users belong.
• Structural invariants in terms of layer sizes and scaling ratio are observed, similarly to the well-known results from the social domain [29]. Specifically, we found that the size of the layers approximately doubles when moving from layer i to layer i + 1, with the only exception of the first layer.
• Users with more layers organise differently their innermost layer, without modifying significantly the size of the most external ones, which consistently account for approximately 60% and 30% of the used words, irrespective of the number of clusters of the user.

Semantic analysis of the ego network of words
We have treated words as simple tokens so far. However, words have meanings and they can be linked to specific topics. In this section, we want to go beyond words and investigate which topics they refer to and how they are distributed in the different rings of the ego network. The analysis of this section revolves around the concept of semantic profile of a ring (in the ego network of words), which captures the topics associated with the words in the ring. Once semantic profiles are obtained, we are able to address the following high-level question: are all rings similar in the topics they contain, or does the ego network organize the topics in its rings in a specific way? For the convenience of the reader, we summarise in Table 4 the notation used throughout the section.

How to build semantic profiles
In this section, we describe how we carry out the semantic analysis of the ego network of words. First, in Section 5.1.1, we motivate our selection of the BERTopic framework for topic extraction. Then, in Section 5.1.2, we illustrate the steps for topic extraction. At the end of this process, each word occurrence in the ego network is associated with a specific topic. Accounting for the popularity of each topic in the rings of the ego network, in Section 5.1.3 we build the semantic profile of the ego network ring, as the topic distribution of the words in that ring.

Preliminaries
To calculate a semantic profile, we choose to consider the meaning of each word in its context rather than using a semantic dictionary [39] (a dataset where each word is mapped to a semantic category), which would not be able to detect more complex topics and would miss some meanings for a polysemous word. We acknowledge that a lot of effort has been put in the direction of ontologies in order to understand more precisely the interests of users, specifically on Twitter. Ontologies map knowledge of specific domains, such as Athena [24], which is a semantic web database extracted from a news portal that can be used for news recommendation purposes [40], or the BBC ontologies extracted from the BBC corpus of news, which allows politically-oriented topic mining [41]. However, even if their drawbacks (such as the rigidity of the knowledge model) can be partly fixed by coupling them with models based on embedding [42], we prefer having the maximum freedom in the topic identification process by using a transformers-based model such as BERT [43] which is the current state of the art in text embedding and then using an unsupervised method to detect topics.

Extraction of the topics
In order to avoid some issues with polysemous words, we must consider the ring of an ego network not only as a set of single words associated with a frequency of use but as a set of words with a given number of occurrences (from which the frequency is derived), each occurrence belonging to a user's tweet. We aim to associate each word occurrence with a topic. We first classify (in an unsupervised way) the tweets by topic using the BERTopic framework [44], then all word occurrences that constitute a tweet are assigned the same topic as the tweet itself (Fig. 8).
For the current analysis, we chose to focus only on ego networks with six rings, the case covering the most users. As described in the following, the BERTopic framework uses sequentially BERT [43] for tweet embedding, UMAP [45] for dimension reduction, Ego networks e belonging to the set of all ego networks E c ∈ C Topic c belonging to the set of all topics C m ∈ T Tweet m belonging to the set of all tweets T P m Semantic profile of tweet m, according to HDBSCAN P m (c) Likelihood that tweet m belongs to topic c, according to the semantic profile of the tweet W(e, r) Set of non-distinct words in ring r of ego network e W u (e, r) Set of distinct words in ring r of ego network e W(e, w u ) Set of occurrences of the unique word w u in the ego network e O(e, r) Number of word occurrences in ring r of ego network e o(w u , e) Number of occurrences associated with the unique word w u of ego network e P The number of topics discussed in ring r of ego network e N norm (e, r) N (e, r) normalised by the total number of word occurrences in r H(e, r) Entropy of the semantic profile P Strength of topics that are primary for both r x and r y in r y 's semantic profile S ry T OP (rx),BOT T OM (ry ) Strength of topics that are primary for r x but not for r y in r y 's semantic profile σ ry T OP (rx,ry) Strength of topics that are primary for both r x and r y with respect to the average strength of primary topics in r y 's semantic profile σ ry T OP (rx),BOT T OM (ry Strength of topics that are primary for r x but not for r y with respect to the average strength of non-primary topics in r y 's semantic profile and HDBSCAN [46] for clustering those tweet embeddings in a low-dimensional subspace.
5.1.2.1 Tweet embedding with BERT. BERT [43], which achieves state-of-the-art performance for natural language understanding, is used to assign to each tweet a point in the embedding space which is supposed to be a vector representation of its semantic meaning. BERT is a bidirectional transformer developed by Google, trained on the BookCorpus [47] and Wikipedia in English. It, therefore, relies on all the linguistic knowledge learned from a very large corpus to perform this task. BERT yields topics along 768 dimensions.

Dimensionality reduction with UMAP.
In order to mitigate the curse of dimensionality (to which clustering algorithm based on k-nearest neighbors are particularly sensible [48]), we use the UMAP clustering algorithm (with settings n_neighbors=15, n_components=5, metric='cosine' and the python package umap v0.1.1) to reduce the embedding space down to five dimensions as recommended in the BERTopic framework [44]. UMAP, like the T-SNE [49] algorithm, is able to capture latent non-linear dimensions but in a more scalable way. [46] is also able to find non-linear cluster structures from the density, as well as outliers, like DBSCAN (Fig. 9). However, instead of deciding the contours of a cluster based on a fixed density threshold, HDBSCAN uses hierarchical clustering (single linkage) to find the most stable partition. Here we use HDBSCAN with following settings: min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True with the python package hdbscan v0.8.26. Thanks to BERT embedding, the clusters of tweets we obtain are semantically homogeneous, and therefore represent the dominant topics of the dataset. Under these conditions, we can consider that a cluster corresponds to a topic. Table 5 shows the percentage of outliers detected by HDBSCAN, which corresponds to the percentage of tweets that cannot be associated with a specific topic. Since this percentage is quite high, even with the most conservative configurations (with the least outliers), we also assess the cluster configuration (i.e., the topic assignment) induced by In the first case, each point is classified as either belonging to a single cluster (colored points) or as an outlier (grey point), whereas in the second case each point is assigned a likelihood to belong to each cluster (the points take the color of the cluster they belong to most likely). a soft clustering approach. Indeed HDBSCAN allows two types of clustering: hard clustering, which classifies each tweet in one and only one cluster (or as an outlier), and soft clustering, which is able to measure the proximity of a tweet to several different clusters. The advantage is that it is possible to obtain this proximity even for outliers, which allows us to integrate them into the analysis. When using it for soft clustering, HDBSCAN provides, for each point (tweet) m, a probability distribution P m such that P m (c) is the likelihood that this point belongs to the cluster (topic) c, with c∈C P m (c) ≤ 1 (C being the set of topics). Thus, with soft clustering, the tweet is not assigned a single topic but a probability distribution over all the topics. For clarity reasons, in the case of hard clustering -where the tweet m is directly assigned one topic c m -let us use the same notation P m , where P m (c m ) is equal to 1 and zero otherwise. We will use these two configurations (hard clustering and soft clustering) to build two separate semantic profiles for each ego network ring. In Supporting information we discuss in detail why hard clustering is better suited for our analysis.  Table 5, the different datasets feature a different number of topics. In order to be able to compare the datasets, we reduced the number of topics down to the same number of topics (this set of topics -which is different for each dataset -will be noted as C from now on). Let us denote with C the full set of topics. Our goal is to merge them together until we obtain the target number of topics. To do so, the following operation is repeated: merge the smallest cluster c 1 (in the hard clustered configuration) with the cluster c 2 to which c 1 is semantically the closest. This semantic similarity is calculated as follows: all the tweets are grouped in a single document by cluster, then a TF-IDF vector is calculated for each of them. The similarity between the two topics is the cosine of their TF-IDF representation. The probability of the new topic c 1 ∪ c 2 is accordingly updated, for each tweet m, as P m (c 1 ∪ c 2 ) = P m (c 1 ) + P m (c 2 ). When merging step by step the clusters, the average similarity between them increases as can be seen in Figure 10. In the case of journalists and science writers, we see that exceeding 100 topics no longer allows the emergence of topics that are radically different from the others, while still enabling an acceptable number of topics to be isolated. Thus, in order to be able to compare the results related to the different datasets, we have chosen to limit the number of topics to 100 for each of them. For the sake of comparison, the 100 topics obtained for the hard clustering configuration are also used for topic reduction in the soft clustering case. This operation allows us to narrow down to one hundred topics the different semantic fields addressed in the same dataset while trying to provoke the least changes in the topic reassignment.

Extraction of the semantic profile
We define the semantic profile of an ego network ring as the distribution of topics to which the word occurrences that the ring contains (multiple occurrences of the same word may come from different contexts and thus refer to different topics) belong. Note that this analysis is carried out at the ring level, and not the circle level because circles are concentric and cumulative, thus the semantic profiles of circles would include by default overlapping topics, hence creating a bias in the analysis (similarly to counting topics twice). After the preprocessing described in the previous section, each word occurrence is associated with a topic (or several, in the soft clustered case), thus we can compute for each ego network's ring a topic distribution based on the word occurrences it contains. Let W(e, r) be the set of word occurrences contained in ring r of the ego network e, and m(w) the tweet the word occurrence w belongs to. The probability P (e) r (c) of observing topic c in ring r of ego network e is defined as follows: where c∈C P ring r in ego network e (depicted in Fig. 11). For this reason, we will also refer to P (e) r (c) as the share of c in the semantic profile P (e) r of r This unique semantic profile will be the starting point for all subsequent analyses in this section. In Supporting information, we provide four tables (one for each dataset) that detail for every topic the most characteristic words and the average share in the rings. Note: Two different semantic profiles can be built, depending on whether topics are assigned using hard vs soft clustering. In Supporting information we show that the use of soft clustering (and thus the inclusion of outliers) does not improve the reliability of the analysis. It gives too much importance to noisy data which favors the emergence of very generalized "super topics" that dominate all semantic profiles. We, therefore, present in Section 5.3 only the results obtained with hard clustering. In Supporting information we discuss soft versus hard clustering in detail and motivate why hard clustering is better suited for our analysis.

Metrics for the analysis of semantic profiles
After following the steps described in Section 5.1, we end up with a semantic profile for each ring of an ego network. In the following we discuss (i) how to characterise individual semantic profiles (Section 5.2.1), (ii) how to compare semantic profiles (Section 5.2.2), and (iii) how to leverage semantic profiles to investigate the role of the most important topics (Section 5.2.3).

Characterization of the semantic profile
Let us consider a ring r of ego network e for which we have extracted the semantic profile as discussed above. The semantic profile tells us how many distinct topics the words in ring r touch upon. Formally, the number of topics associated with a given ring can be calculated as follows: where we denoted with P (e) r (c) the probability of a observing topic c in the semantic profile P (e) r of ring r, and 1 is the indicator function. Note, though, that N (e, r) may offer only a partial perspective. In fact, rings have very different sizes (as discussed in Section 4) and it is expected to be much easier for larger rings (i.e., rings containing April 4, 2023 17/44 many words) to span a larger range of topics. For this reason, we will compare N (e, r) with its normalised version: where we weigh the number of topics "generated" by the ring by the number of word occurrences contained in the ring (denoted with |W(e, r)|).
N (e, r) and N norm (e, r) account for the mere presence of topics, regardless of their frequency of use. To capture the latter dimension, we next measure the entropy of P (e) r . Recalling that P (e) r is in fact a probability distribution, its Shannon entropy reflects its diversity: the entropy (and diversity) is maximum if a ring contains all topics equally (i.e., with the same values of P (e) r (c)), while the entropy is minimum if a ring contains only one topic. So, the greater the entropy, the greater the diversity. Denoting with H(e, r) the entropy of the ring r in ego e, its definition is as follows: For the 100 topics we consider, the minimum entropy is 0 and the maximum entropy is about 4.60. In Section 5.3, the average of N (e, r), N norm (e, r), and H(e, r) across all ego networks will be presented, i.e., N (r) = 1 |E| e∈E N (e, r) (analogously for the others).

Comparing the semantic profiles of different rings
Once we know which topics are covered by each ring of an ego network, the first step is to find out whether their semantic profile differs from one ring to another one or, instead, if the distribution is homogeneous over the whole ego network. Since all semantic profiles are based on the same 100 topics, it is easy to obtain a distance measure to compare the rings with one another. Recalling that the semantic profile is a probability distribution, for this purpose we can use the Jensen-Shannon (JS) divergence [50], which allows us to calculate the proximity between the 100-topic distributions that we obtained previously. Then, the corresponding JS distance is conventionally obtained as the square root of the JS divergence [51]. The JS divergence is basically a symmetric version of the well-known Kullblack-Leibler (KL) divergence, which is a standard metric for capturing the distance between probability distributions. For a tagged ego e, the KL divergence D KL between two semantic profiles P From rj ), the JS divergence can be obtained as: Once we have obtained a δ JS P ri , P rj , we compute its average across all ego networks in a standard way, i.e., δ r , we can check whether some topics are more important than others, and, if this is the case, whether they play a special role in the ego network's rings. We consider whether topics can be divided in two classes, i.e., "important" and "not-important" topics for each ring. To do so, we cluster the topics according to their presence in the specific ring under study, i.e, according to the values of P (e) r (c) where c ∈ C. To this aim, we use the Jenks algorithm [52] which allows finding natural breaks in the frequency distribution (similarly to k-means, we have to specify k, the number of groups we want to obtain). We rely on the Silhouette score [53] to validate the clustering results. Since we just want to find one natural break that separates important topics from the others, we set k = 2. Words are split into two groups, one with high-frequency use, and the other with low-frequency use. The former is the set of important (or primary) topics referred to as U Analogously, we can study the opposite effect, i.e., what is the strength of topics that are important in r x but not in r y in the semantic profile of r y . In this case, the formula will be the following: All the above metrics capture the pulling power of ring r x on ring r y . Another interesting perspective is whether topics that are primary elsewhere tend to be more or less dominant than the average topic in U where we basically compute the difference between the strength of topics that are primary in both r x and r y and the average strength of all primary topics in r y . The complementary perspective is whether topics that are primary elsewhere tend to be more or less dominant than the average non-primary topic in r y . To this aim, we leverage the following: which follows the same line of reasoning as σ ry T OP (rx,ry) .

Results
In this section, we study the semantic profiles in the ego networks of the Twitter users in our four datasets (Section 3).

Ring #1 is special in the ego networks of words
We start our analysis by studying how topics are associated with the different rings. For each ego network e, we will compute the number of topics per ring (N (e, r) and N norm (e, r), its normalized version) and their entropy H(e, r). These metrics are then averaged across all egos, as described in Section 5.2, and 95% confidence intervals are shown. In Fig. 12 (a), we can observe that the number of topics grows towards the external rings (from about 11 in ring #1 to over 16 in ring #6). However, not all rings contain the same number of word occurrences ( Fig. 12 (b)): as seen previously in Section 5.1.2, each word occurrence contributes equally and independently to the calculation of the topics distribution. Therefore, a ring containing more word occurrences is more likely to contain more different topics. When we normalise by word occurrences (N norm (r)), the maximum of the normalised topic count (Fig. 12 (c)) is observed in the first ring. Thus, ring #1 stands out as the ring that generates proportionally more topics than the other rings. In order to validate this hypothesis, we need to rule out that this result is not a mere side effect induced by the structure of the ego networks but it is a tell-tale sign of how humans pick the words in their innermost ring. In other words, we want to test whether keeping the ego network structure unchanged but swapping the words in the rings would still yield the same result regarding ring #1. To this aim, we designed a null model where the ego network structure remains the same but the words are shuffled (more details in the grey box below). In Fig. 12 (d), we show N norm (r) for the null model of ego networks. Since the maximum of N norm (r) is obtained at a different ring r than in the previous case, we can deduce that ring #1 is special not just as a side effect of the ego network structure but due to the nature of the words it contains. To further confirm this finding, note also that the number of topics per word occurrence is significantly lower for innermost rings in the null model with respect to the outermost rings whereas the opposite is true for real ego networks. This is a second element that hints at the peculiar role of innermost rings in real-life ego networks of words.

Building a null model of an ego network.
In order to show that the result is not only determined by the structure of the ego network (independently of the word organization inside), we chose to build "null", artificial ego networks based on those already existing. Let o(w u , e) be the number of occurrences of the word w u in ego e, such that the number of word occurrences in a ring r of a given ego e is defined as: The shuffling process can be considered as a succession of random swaps of words in the ego network. Let us consider a word w x with X occurrences in ring r x , and another word w x with Y occurrences in ring r y . During the shuffling process, assume the two words are swapped. In that new ego network, the number of occurrences of w x is forcibly set to the original number of occurrences of w y and vice versa: That way, we can preserve Eq (14). Words are shuffled along with their topic distribution P (e) wu in the original dataset. This topic distribution associated to a unique word w u is calculated based on its occurrence w ∈ W(e, w u ). Each of these word occurrences w is associated with a topic c w ∈ C such that P m(wc) (c) = 1. Hence, P Then the new topic distribution of a given ring r is the weighted average of the topic distribution P (e) wu of the unique words w u ∈ W u (e, r) that compose that ring after shuffling .
The full process is summarized with a toy example in Fig. 13.
April 4, 2023 To extend our study beyond the mere number of topics per ring, we now investigate the diversity in the way topics are distributed, leveraging the entropy of the semantic profiles defined in Section 5.2.1. This is a way of calculating the semantic diversity of the words that compose a ring, as would be a metric like the average pairwise semantic distance, but based on the semantic profile that we have previously calculated. Fig. 14 (left) shows different levels of entropy depending on the rings: H(r) grows towards the outer rings and is significantly lower in the innermost ring (for all datasets). This means that the outermost rings are, on average, semantically richer than the innermost ones. Then, we compare these results with those obtained from the null model (Fig. 14 on the right), to find out whether the differences in entropy are related to the intrinsic structure of the ego network. We find that the entropy of the null model is the same as the original model for all rings, but for ring #1, where the null model entropy is lower. This means that, even if words are organized in the ego network such that the diversity of topics grows toward the outermost rings, the diversity in ring #1 is higher than what we could expect if words were randomly assigned to rings, which is consistent with the previous findings of this section. We now carry out a pairwise comparison of the semantic profiles of rings, using the JS distance described in Section 5.2.2. we plot the, in Fig. 15. As one can expect, the diagonal is filled with zeros since the distance is calculated between two identical semantic profiles, and the upper triangle mirrors the lower triangle since the distance is symmetric. All datasets exhibit the same features: • The first row and column always contain the higher values. This means that ring #1 (i.e. the innermost ring) is always the most distant from the other rings. In other words, ring #1 is the most characteristic ring.
• The lower values are always the distance between ring #5 and #6. Thus, the pairs of most similar rings are always among the outermost ones.
• For one row or column, the lowest value is always neighbouring the diagonal: given one ring x, the least distant ring is always the previous ring x − 1 or the following one x + 1. This means that two rings close to each other are more likely to be similar.
The first observation is very important because it shows that the topic distribution associated with the most used words (those in the innermost ring) by a Twitter user is different from that associated with the least used words. This makes ring #1 unique in two ways. It generates proportionally more topics than the others rings ( Fig. 12 (c)), but the distribution in ring #1 is the furthest away from the others (Fig. 15). This hints at a significantly higher "semantic generative role" of inner rings as opposed to outer ones: each word occurring in an inner ring is able "generate" more topics on which the user engages. And these topics, on which that user focuses most (inner rings feature higher frequency of use of words) generate a distribution that is quite distinct from the one at the outermost rings, on which the user engages far less. Take home message for Section 5.3.1: Ring #1 is special in the ego network of words: it generates proportionally more topics than the other rings, its topic diversity is proportionally higher than expected, and its semantic profile is the most different with respect to the other rings. This suggests that ring #1 may be the semantic fingerprint of the ego network of words.

The role of primary topics from ring #1
In the previous section, we discovered that ring #1 is special. It, therefore, makes sense to investigate which topics are most important in this ring and if they tend to be equally important in the other rings. This will allow the reader to familiarize themselves with the methodology as well, before generalizing the analysis to other rings in Section 5.3.3. We measure the overall importance of r 1 's primary topics in another ring r y by computing K ry T OP (r1) (see Section 5.2.3), varying r y from innermost to outermost layer. Fig. 16 shows the coverage of r 1 's primary topics in the other rings, across all the ego networks. K ry T OP (r1) corresponds to the blue bars in the figure. K ry T OP (r1) accounts for approximately 50% of each ring and of the whole ego network (last bar). This small (5-6, on average) set of topics, which fills almost the entire innermost ring, is playing a big role in the entire ego network as well. To verify if the reverse statement is true (i.e., if topics that are important in the whole ego network are also important in ring #1), we build a new set of topics U e grouping the most important topics in the whole ego network and calculate K ry T OP (e) . Fig. 17 highlights the coverage of those topics across the rings. Although, in general, all primary topics at the level of the ego network are well represented in all rings, we observe a slight predominance in ring #1, as the innermost ring contains the biggest share of the most important topics of the ego network. This means that topics that are important to the ego network are over-represented in the innermost ring, i.e., an important topic discussed by a Twitter user is very likely to belong to U r1 e . Take home message for Section 5.3.2: Both results from Fig. 16 and 17   ego network. This observation is all the more interesting as ring #1 is semantically the most different from all the others (Section 5.3.1), confirming the special role of this ring in the ego network of words.

Pulling power of primary topics
Let us now focus on the primary topics in a generic ring r x (i.e., those in U (e) rx ). They can also appear in another ring r y , and can be found in either U ry . In the first case, the topics are primary in both rings, in the latter they are primary only in r x . We now tackle the following problem: which is the ring whose primary topics are most dominant among the primary topics of another ring? This involves measuring the strength, in the semantic profile of r y , of the topics that are important for both r y and r x . Using the notation of Section 5.2.3, this is equivalent to studying S ry T OP (rx,ry) for all possible pairs of r x , r y . We show S ry T OP (rx,ry) on the left side of Table 6. The diagonal is left blank for the sake of clarity (we are interested in the results when r x = r y ). For a given r y , the largest value is written in bold. We can clearly observe that the primary topics that are also primary in r 1 have almost always the largest share in the semantic profiles of the rings. Beyond the fact that the sum of important topics in ring #1 is also important in the other rings (Section 5.3.2), the table shows that they are on average the most likely to be important in all the other rings. Now we tackle the complementary question: what is the pulling power of primary topics in a ring on the non-primary topics in another ring? We measure this via S ry T OP (rx),BOT T OM (ry) , which is shown in the right part of Table 6.
From the left side of Table 6, we know which is the ring whose primary topics have the highest pulling power on the primary topics of others. But do they have a higher than average strength with respect to the primary topics in the ring as a whole (i.e., regardless of whether they are primary in other rings or not)? To investigate this problem, we show σ ry T OP (rx,ry) in Table 7. In the table, all the numbers are positive. This means that, on average, among the most important topics for a ring r y , if a topic belongs to the important topics of another ring r x , its strength will be more likely to be higher than the average strength of generic important topics in r y . A t-test has been performed to assess whether these differences are statistically significant: in all cases, we obtained p − value < .001. On the right side of the table we show σ ry T OP (rx),BOT T OM (ry) , which captures whether topics that are primary elsewhere but not in r y tend to have a higher share among the least important topics in r y . In this Table 6. Pulling power of primary topics. On the left, S ry T OP (rx,ry) for all r x , r y pairs in our datasets. On the right, S ry T OP (rx),BOT T OM (ry) . In bold, the highest value per column, corresponding to the r x for which the pulling power is higher in r y . case, too, the numbers are positive. It also means that, on average, among the least important topics of a given ring r y , a topic is more likely to have a higher strength if it belongs to the important topics in another ring r x . Again, the p-values are smaller than .001, confirming that such results are not due to statistical fluctuations. Take home message for Section 5.3.3: Studying the role of primary topics, we have learned the following.
• Primary topics from ring #1 tend to dominate among the primary topics of other rings. This shows the pulling power of the innermost ring, confirming its special role in the ego network. Vice versa, primary topics from ring #1 do not seem to dominate among non-primary topics of other rings.
• The topics that are primary in some rings tend to be stronger than average among the primary and non-primary topics in the semantic profile of another ring. This effect is especially acute when considering primary topics from ring #1 with respect to generic primary topics in other rings.

Discussion
The study of the semantic profile of the rings of the ego network confirms the relevance of the ego network of words model. This model allowed us to isolate the specific features of the topics associated with the words in the innermost ring. Indeed, the semantic profile in ring #1 is not only the most unique (the most semantically distant Table 7. Pulling power of primary topics that are also primary elsewhere vs "average" primary / nonprimary topic. On the left, σ ry T OP (rx,ry) for all r x , r y pairs in our datasets. On the right, σ ry T OP (rx),BOT T OM (ry) . The highest value per column is in bold. from the others), but it is also characterized by both a larger than expected entropy distribution and number of topics generated, when compared with a null model. The most important topics that ring #1 is composed of are not only a set of important topics in the other rings: for every ring, an important topic is more likely to be predominant if it is also important in the innermost ring. Hence, despite the small number of unique words and word occurrences it contains, the innermost ring strongly "predicts" the most important topics in the entire ego network. In light of these results, we can conclude that the semantic profile of the innermost ring r 1 is also the semantic fingerprint of the whole ego network of words.
As it has been done with social ego networks (using structural properties to study information diffusion [16], or to perform link prediction [54]), we can use the structural and semantic invariants of the ego network of words to investigate some classical data science problems, with a focus on natural language processing. This semantic fingerprint could be used to identify specific Twitter users, or groups of users, with a non-trivial interest distribution for certain topics (e.g. a mix of important topics in the innermost rings and marginal topics in the outermost rings). It could also be used for link prediction with the assumption that users with the same topic of interest in the innermost ego network circles are more likely to follow one another (this is the principle of homophily) or for the purpose of word recommendation in a typing assistance tool. Since we identified some semantic invariants (eg. the role of important topics in ring #1), we could leverage this property to identify outliers deviating from the standard and detect non-human behaviors. Finally, we could use the fact that ring #1 contains the important topics of the entire ego network to spare some time considering only the words in this innermost ring, within the context of topic mining.

Conclusion
Inspired by previous work modeling the cognitive constraints that regulate personal social relations, in this paper, we investigate, through a data-driven approach, whether a regular structure can also be found in the way people use words, as a symptom of cognitive constraints in their mental process. Based on a corpus of tweets written by both regular and professional users, we have shown that, similarly to the social case, a concentric layered structure (which we name "ego network of words") very well captures how an individual organizes their cognitive effort in language production and reveals some structural invariants in the way people organise their own vocabulary. Among these invariants, we can list (i) the number of layers (between 5 and 7), (ii) their regular growth from the center of the word ego network outward (the innermost layer is five times smaller than the following one, for all the other layers their size approximately double moving outward), (iii) the size of external layers (which is pretty stable, with the two penultimate layers accounting respectively for 30% and 60% of the words in the model, regardless of the total number of layers).
Then, going beyond words as units of language, we performed a semantic analysis of the ego network of words. Each ring of each ego network is described by a semantic profile that captures the topics associated with the words in the ring. We have found that ring #1 has a special role in the model. It is semantically the most dissimilar out of the six, and also the one which generates proportionally the largest number of topics. We also showed that the topics that are important in the innermost ring, also have the characteristic of being predominant in each of the other rings, as well as in the entire ego network. In this respect, ring #1 can be seen as the semantic fingerprint of the ego network of words. Finally, we found that the topics that are primary in some rings tend to be stronger than average among the primary and non-primary topics in the semantic profile of the other rings. This shows that, while layer #1 provides a particularly strong signal about prevalence in the ego networks, weaker signals show a more complex structure of influence among topics "resident" in different layers of the ego network of words. S1 Supporting information S1.1 Data preprocessing: filtering out inactive Twitter users In order to be relevant to our work, a Twitter account must be an active account, which we define as an account not abandoned by its user and that tweets regularly. A Twitter account is considered abandoned, and we discard it, if the time since the last tweet is significantly bigger (we set this threshold at 6 months, as previously done also in [30]) than the largest period of inactivity for the account. We also consider the tweeting regularity, measured by counting the number of months where the user has been inactive. The account is tagged as sporadic, and discarded, if this number of months represents more than 50% of the observation period (defined as the time between the first tweet of a user in our dataset and the download time). We also discard accounts whose entire timeline is covered by the 3200 tweets that we are able to download, because their Twitter behaviour might have yet to stabilise (it is known that the tweeting activity needs a few months after an account is created to stabilise).

S1.2 Ruling out soft clustering for the creation of semantic profiles
In discussed in the body of the paper, the hard clustering approach to topic extraction yields many unassigned words (Table 5). We have thus also tested soft clustering, where by each word occurrence is assigned, in any case, a probability distribution of belonging to one of the 100 topics. In Fig S1 we plot the fraction of the semantic profile covered by the top-x topics in the ring (where top-x is computed based on the semantic profile P (e) r ). Unlike hard clustering, soft clustering gives non-zero values to the least important topics of the ring. While soft clustering allows us to include all tweets in our analysis, it has a very negative side effect. As we show in the following of the section, very generic topics become prevalent, and mask more characteristic topics that hard clustering reveals, particularly for the innermost rings. Notice that this side effect makes all rings look alike in terms of number of active topics, as we can see from the fact that all distribution curves overlap in the right-hand side plots of Fig  To better investigate this aspect, we extract the important topics as described in Section 5.2.3. With two classes (important vs non-important), we obtain an average silhouette score of 0.9, confirming the good cluster configuration. We show these results for the Journalists dataset but similar conclusions can be drawn for the others. In Fig S2, we compare the level of importance of the 5 most dominant topics in the dataset (those who are important in the largest number of rings regardless of ego and ring rank), in the case of soft clustering and hard clustering. The figure shows that soft clustering allows some topics to dominate the whole Journalists dataset. With soft clustering, topics 93, 51, 55, 95 and 72 are important for all six rings (the ego line is filled with colored squares) of more than 50% of the ego networks. This, instead, is not the case when using hard clustering. The dominating topics in the case of soft clustering turn out being very generic ones. This is confirmed by looking at the most characteristic words in these topics in Table S3. For example topics 93 and 51, which were already among the most frequent in the hard cluster case are omnipresent in the soft cluster case, in addition to the topic 95 which is also generic but does not appear in the case of the hard cluster. We can therefore conclude that the price of a complete inclusion of tweets in our topic analysis through soft clustering only increases the noise level for all ego networks, materialized by a set of very generic topics that blur the real semantic characteristics of the rings. This is why we decided to put aside the results related to the soft clustering, in order to keep only the semantic distributions resulting from the hard clustering of HDBSCAN. Note that, in light of these results, the fact that we use only a small subset of available tweets does not impact on the relevance of our analysis. What we exclude are the tweets related to "noise" topics, in the sense that they are not able to strongly characterise the Twitter behaviour of users, and we focus only on tweets that are strongly belonging to topics, i.e., on the semantically characteristic part of users' Twitter activity. For each topic, a grid is drawn in which the colored square means that the corresponding topic belongs to the most important topics of ring X of the ego network Y. Those topics are important for all six rings (the line is fully colored) for respectively 49%, 28%, 19%, 21%, 9% of all the ego networks of the dataset for the hard clustered configuration (left) and 100%, 75%, 75%, 74%, 68% for the soft clustered configuration. Table S1. Hashtags, links, emojis in the datasets. In the process of word extraction, the tweet is decomposed in tokens which are usually separated by spaces. These tokens generally corresponds to words, but they can also be links, emojis and others markers that are specific to the online language such as hashtags. The table gives the percentage of hashtags, links and emojis, which are tokens filtered out from the datasets.  Original tweet content List of words after pre-processing

S1.3 Additional tables
The @Patriots say they don't spy anymore. The @Eagles weren't taking any chances. They ran a "fake" practice before the #SuperBowl spy, anymore, chance, run, fake, practice #Paris attacks come 2 days before world leaders will meet in #Turkey for the G20.