Measure of Node Similarity in Multilayer Networks

The weight of links in a network is often related to the similarity of the nodes. Here, we introduce a simple tunable measure for analysing the similarity of nodes across different link weights. In particular, we use the measure to analyze homophily in a group of 659 freshman students at a large university. Our analysis is based on data obtained using smartphones equipped with custom data collection software, complemented by questionnaire-based data. The network of social contacts is represented as a weighted multilayer network constructed from different channels of telecommunication as well as data on face-to-face contacts. We find that even strongly connected individuals are not more similar with respect to basic personality traits than randomly chosen pairs of individuals. In contrast, several socio-demographics variables have a significant degree of similarity. We further observe that similarity might be present in one layer of the multilayer network and simultaneously be absent in the other layers. For a variable such as gender, our measure reveals a transition from similarity between nodes connected with links of relatively low weight to dis-similarity for the nodes connected by the strongest links. We finally analyze the overlap between layers in the network for different levels of acquaintanceships.


Introduction
Are two connected individuals more similar than a pair of strangers? Over the last decades, advances in data collection methods have provided new opportunities for research on human behavior [1] including the topic of homophily, i.e., whether a pair of connected individuals tends to be more similar than pairs of randomly selected individuals. For instance, it is now possible to observe social interaction across multiple channels, e.g., by combining data describing face-to-face contacts, with data from online social organizations or smartphone data [2][3][4][5]. Multiple networks formed from the simultaneous interaction in different channels are often called multiplex or multilayer networks [6]. Homophily has been observed with regard to many different variables. Examples span across socio-demographic variables (e.g., age, gender, ethnicity), variables describing behavioral patterns (e.g., drinking behavior, smoking behavior, physical activity), variables representing attitudes, beliefs, or opinions (e.g., about politics and sport), and personality traits such as extraversion [7][8][9][10][11][12]. It is an open question though, if homophily is becoming more pronounced between stronger connected individuals. Here, we introduce an extended similarity measure with a tunable parameter, which allows us to check for homophily across links with a broad spectrum of weights. Based on the measure, we find a moderate degree of homophily with respect to behavioral patterns but no significant homophily with regard to the basic personality traits conscientiousness, agreeableness, and neuroticism.
Most commonly, homophily is investigated via likeability ratings about strangers, via a comparison of personality reports from a dyad, triplet etc. of acquaintances, or via network analyses. Recent studies based on personality reports by well-acquainted persons did find overlap between acquaintances concerning the levels of some of the basic personality traits [13][14][15]. Network studies focusing on observable variables such as gender or cigarette use have suggested that similarity in this regard is important for friendly acquaintanceship [16]. Overall, research so far suggests similarity between pairs of friends or acquaintances, but the detailed conclusions concerning homophily tend to differ depending on the methodology. In addition, the similarity of nodes, as we shall see below, is strongly related to the strength of the link connecting them.
For an accurate understanding of homophily, a long-term and detailed monitoring of social networks is needed for several reasons. In order to reveal a complete picture of homophily, it is essential to gain insights into the similarity at all levels, e.g., from best friends, acquaintances, to people in the network one hardly likes or spends time with. These distinctions are possible in weighted network analyses.
Here, we investigate the similarity of connected individuals in a multilayer social network, with connections based on phone calls, text messages, and physical proximity (Bluetooth). We estimate the similarity between connected persons within a specified network with regard to socio-demographic variables (sex, age, body mass index), behavioral patterns (physical activity, alcohol drinking, and smoking behavior), attitudes concerning politics and religion, and, ultimately, basic personality traits in terms of the Big Five, i.e., conscientiousness (e.g., being organized, precise, thorough), agreeableness (e.g., being kind, sympathetic, warm), neuroticism (e.g., being anxious, moody, touchy), openness to experience (e.g., being creative, philosophical, unconventional), and extraversion (e.g., being active, sociable, talkative). We focus on the Big Five as personality traits since they reflect an 'integrative descriptive taxonomy for personality research' [17].

Results
This work rests on a unique dataset. We have mapped out the social network between 659 freshman students starting in the year 2013 at the Technical University of Denmark and running over 24 months [5]. Using state-of-the-art smartphones equipped with custom data collection software, we have collected the communication patterns within this densely connected population across a number of channels [18]. Specifically, we measure telecommunication networks (phone calls, text messages), online social networks (Facebook connections and interactions), and networks based on physical proximity. The physical proximity networks are measured via the Bluetooth signal strength, and can be used as a proxy for face-to-face meetings [19]. As a complement to the network data, we also collect information on geo-spatial mobility using GPS, as well as a number of more technical probes.
In addition to the automated data collection, we have also acquired extensive questionnairebased data on participants' personality and behavior, comprising the following questionnaires: Big Five Inventory [17], Rosenberg Self Esteem Scale [20], Narcissistic Admiration and Rivalry Questionnaire [21], Satisfaction With Life Scale [22], Rotter's Locus of Control Scale [23], UCLA Loneliness scale [24], Self-efficacy [25], Perceived Stress Scale [26], Major Depression Inventory [27], The Copenhagen Social Relation Questionnaire [28], and Positive and Negative Affect Schedule [29], as well as several general health-, attitudes-and behavior-related questions.
Here, we consider three different types of social interaction networks based on calls, text messages, and physical proximity, respectively. We introduce a tunable link weight based on the strength of the interactions. To explain our definition of a link weight, let us start by considering the call network. The weight of a directed link from person i to person j is given by where n ij represents the total number of accepted calls from person i to person j. Links therefore take a value in the interval, w ij 2 [0, 1], and the sum of weights of outgoing links from any person equals unity. The power α is used to test if homophily is more pronounced between individuals who interact more frequently than for individuals who do not interact that often. The case α = 0 corresponds to a network where all links have equal weight. For intermediate α values, we predominantly test for similarity on the strongest links and, ultimately, for large values, e.g., for α % 2, we only consider the strongest out-going link for each individual. The network of text-messages (SMS network) is constructed in the similar fashion, but with n ij , representing the number of text messages sent from person i to person j. From the data on physical proximity, we can determine the time a pair of individuals has spent together. We say that a person i has spent an amount of time Δt together with person j if two consecutive Bluetooth scans are separated by a time Δt and, in addition, both scans estimate person j to be within approximately three meters distance. The link weight between i and j is where T ij is the total time that j has been within the three meter limit of i. In general, the proximity data contains information about a large number of more or less random encounters during lectures and classes. In order to prevent that these encounters dominate our data, we make use of proximity data sampled only in the weekends or from 6pm to 12am during the weekdays. We place no such restrictions on the call and SMS data. Finally, we construct a symmetric weight from the two directed weights by taking the average weight of the two directed links. From these three types of interaction, we construct the corresponding networks, see Fig 1. Here the size and color of the nodes are determined by the sum of link weights connecting to the node, while the width of a link is given by the square root of the link weight. The visual representation reveals that the networks tend to be dominated by a relatively small set of links with strong weights. In order to analyze homophily, we construct vectors (x i , x j , w ij ) for each link in the network where x i represents a variable (e.g., of a personality trait) associated with person i. The degree of homophily is estimated by a generalization of the intraclass correlation coefficient (ICC). The ICC quantifies the similarity of the variables x i and x j for the connected persons i and j in the network. Similarly to the Pearson correlation coefficient, the ICC is a measure of the tendency for x i and x j to assume similar values relative to their average value. Normally, the ICC is computed under the assumption that persons are either connected or not. Here we modify the ICC by including the weight of interactions w ij between persons. The weighted ICC, here denoted by r, is then computed for a network, (x i , x j , w ij ), from the expressions, The auxiliary variable s measures the variance within the sample, including both variables x i and x j , and the variable t is a measure of the co-variance of x i and x j . Please note how the contribution to the variance for each link is weighted by w ij . In general, the weighted correlation coefficient provides a basic measure of the importance of homophily in social interactions. In Fig 2, we show the ICC where all weights are proportional to the activity on the link, i.e., α = 1. The error bars are estimated using bootstrapping, where we for each value of α and for each network layer (Call, SMS, and BlueTooth), generate 10,000 reference networks by randomly reshuffling the links. We then measure the correlation coefficient in these reference network. The fraction of networks with an ICC larger than that of the true network provides us with a measure of the p-value.
We observe in Fig 2 that there is no pronounced homophily for the personality traits conscientiousness, agreeableness and neuroticism, even when we consider only the strongest links. In  Fig 3, we test the importance of link strength by varying the parameter α, i.e., we test for homophily by considering all social interactions equally important (α = 0) or by weighting frequent interactions higher (α > 0). We see that for the Big Five personality traits, only extraversion have ICCs which are significantly different (p < 0.05) from zero in all layers. List of p-values for the computed ICCs are listed in S3 Text and a description of how the p-values are computed can be found in Materials and Methods. In the sms layer, the ICC for extraversion ranges from values around zero when all links have equal weights to values around 0.2 for α = 2. For both the extraversion and openness traits, the proximity and call layers result in ICCs that are lower than the ICC of the text message layer. The ICCs for agreeableness, conscientiousness and neuroticism are for almost all values of α not significantly different from zero and are bounded above by approximately 0.12.
Homophily is pronounced in the phone call network for the variables capturing smoking and drinking behaviors. Here the ICCs are significantly different (p < 0.05) from zero and achieve values larger than 0.3 in the call layer and values up to 0.2 in the sms layer. This is in contrast to the other variables in our study, where homophily is most pronounced in the sms layer. The variables representing attitudes concerning politics and religion show a weak or no correlation. Less surprisingly, we see an over-representation of social interaction between Similarity of connected individuals in the network. The bars show the intraclass correlation coefficient for the different variables and for the networks formed from call activity, SMS activity and proximity data. The lines have a range of one standard deviation. We find no similarity with regard to the personality traits conscientiousness, agreeableness and neuroticism but we find a weak similarity with regard to extraversion and dissimilarity with respect to openness. In general a stronger similarity is found for socio-demographic, behavior-and attitudes-related variables. individuals of the same sex for calls and text messages when α = 0; the ICCs attain values around 0.2. Moreover, we observe that for increasing values of α, the stronger links in the text message network more frequently connect individuals of different sex, i.e. we see a transition from a positive ICC to a negative ICC as alpha is increased. Interestingly, albeit the correlation is slightly smaller, the proximity data at the same time shows that individuals with frequent face-to-face encounters tend to be of same sex.

Discussion
Multilayer networks -The overlap between the three layers in the multilayer network can be estimated from the pairwise Pearson correlation coefficients r p, kℓ of the link weights w L ij in two layers L k and L ℓ .
We find that for α = 1 the correlation coefficient is 0.75 between the call and SMS layers, 0.53 between the call and proximity layers, and 0.47 between the SMS and proximity layers. A similar approach has previously been suggested in Ref. [30] where, instead of the link weights, the degree of the nodes in the individual layers was considered. Using the link weights, we can now by tuning the parameter α test the overlap between the layeres for different levels of acquaintanceships. In Fig 4, we show the pairwise correlation between the three layers for different values of α. As expected there is a significant overlap between the layers, but they certainly also differ enough to be treated as more than a fluctuation of a single network. Interestingly, the overlap changes with the factor α, which opens a fundamental question in the analysis of multiplex networks. Which weights would be the right to use? The unweighted case α = 0 certainly leads to a correlation different than those of larger α values. In fact, strong links might not necessarily be present or strong in all layers, e.g. two persons that frequently communicate might prefer phone calls rather than text messages. At the intermediate range, interaction could be more equally distributed across the channels or layers. In other words, the degree of multiplexity in our network is tunable and depends on the perspective, whether strong or weak links should be favored. This observed sensitivity in overlap, could have implications for community detection algorithms on multiplex networks [31,32] or for the structural reducibility of overlapping layers [33]. We further note that the proximity (Bluetooth) layer is more densely connected than the other layers, in particular because the participants in the study meet at more informal gatherings at the university campus or have encounters which could either be spontaneous or of less personal character such as study groups. This could be one reason for the weaker similarity seen for most of the variables in the proximity data in Figs 2 and 3.
Here, we have performed an extensive mapping of similarity in a large social network based on detailed records of social interactions over a time span of nearly two years. From the frequency of interactions, all links in the network are assigned a weight, which we have been able to tune in order to look for homophily across varying levels of acquaintanceships. We show that tuning the weights can reveal new features of the node similarity. For the variables describing alcohol use, cigarette use, and extraversion, we see that individuals are more similar when they interact strongly. In contrast if the weights are disregarded, we see little or no similarity. Interestingly, the similarity of individuals is not monotonically increasing with the frequency of interaction for all variables, e.g., the intraclass correlation coefficient with regard to gender transitions from postive to negative values. The analysis of our data does not provide any evidence that the basic personality traits agreableness, conscientiousness, neuroticism and to some degree openness are an important factor in the formation of social networks. In fact, we find a small or non-existing correlation between these personality traits and social interaction, even when we only consider individuals that interact very frequently. Finally, the measure, we have introduced, shows that the degree of muliplexity in our network is tunable as we vary the balance between weak and strong links.

Materials and Methods
In constructing the multilayer network, we include links from participants that meet minimum requirements with respect to the total time window in which they are active and their level of activity. In particular, we require that the data recording period is longer than 3 months and associated with at least 170 calls, 950 text messages and 200 hours of Bluetooth interaction. These numbers correspond to the typical social activity of a person during a 3 months period, which, we believe, is a reasonable time scale for the resolution of social behavior. These requirements reduce the dataset to 659 participants and is introduced to avoid the addition of noisy links in the network. The average user in the study has been active for 530 days, has been part of 952 phone calls, and has exchanged 5313 text messages. The average number of hours that a user has been in the proximity of others is 1073. The proximity network is based on asynchronous Bluetooth scans by each smartphone every 5 minutes, which are collected into 5 minute time-bins and symmetrized. Many of the recorded interactions are with people outside the study and can therefore not be included in the analysis of homophily. In the call and SMS data, the total weight of a single individual therefore depends on the fraction of calls or text messages that are directed to other participants in the study.
The significance of our estimated ICCs have been computed in the following way. For each value of alpha and each layer in the network, we generate 10,000 reference layers (i.e. networks) by shuffling the links within a layer. We then measure the ICC in these reference layers. The fraction of network layers with an intraclass correlation coefficient larger than that of the original network layer provides us with a estimated of the p-value. A table of all computed p-values have been included in S3 Text.
This study was reviewed and approved by the appropriate Danish authority, the Danish Data Protection Agency (Reference number: 2012-41-0664). The Data Protection Agency guarantees that the project abides by Danish law and also considers potential ethical implications. All subjects in the study provided written informed consent.