Distinctiveness Centrality in Social Networks

The determination of node centrality is a fundamental topic in social network studies. In contrast to established metrics, which identify central nodes based on their brokerage power, the number and weight of their connections, and the ability to quickly reach all other nodes, we introduce five new measures of Distinctiveness Centrality. Those new metrics attribute a larger value to keeping a connection with the network periphery instead. We compute their main features, i.e. range and expected value for the class of scale-free networks. The results show that they are all consistent with the goal of keeping the network periphery connected, and provide a viewpoint of centrality alternative to that of established metrics.


Introduction
The determination of node centrality is a fundamental and popular topic in social network studies [1][2][3], which never stopped attracting the interest of scholars e.g. [4][5][6]. The concept of centrality has been interpreted in many ways and several metrics have been proposed to study the positional power of social actors [7,8]. Similarly, different validation approaches were used to assess the role of these metrics in the identification of influential nodes Three of the most famous centrality metrics -i.e. degree, closeness and betweenness centrality -were described by Freeman While degree counts how many direct connections a node has, closeness and betweenness are calculated considering also indirect connections. Closeness is measured as the reciprocal of the sum of the length of the shortest paths between a node and all other nodes in the graph; it gives an idea of how quickly a social actor can reach its peers. Betweenness centrality counts how many times a node lies in-between the paths that interconnect the other nodes, thus serving as a bridge and acquiring brokerage power.
Scholars like Burt [10] also posed the question whether having a dense ego-network is beneficial to social capital. He showed that individuals may hold positional advantages or disadvantages based on the network they are embedded in (i.e. on the connections among their peers). In particular, missing links among the actors in a node's neighborhood (structural holes) are often seen as an advantage, as the node can act as a mediator, use a divide-ct-impera strategy, or combine ideas from different sources and come up with the most innovative one [11]. Accordingly, a high ego-network closure is often seen as a constraint to the brokerage power of the ego, who cannot mediate among its peers.
Several variations of the above-mentioned metrics were proposed [1,12], as well as different algorithms for their fast computation on large graphs [13]. Indeed, metrics such as weighted betweenness centrality are costly to compute, with one of the most efficient algorithms requiring a O(n + m) space and running in O(nm + n2l0gn) time, where n is the number of nodes and m is the number of links [14].
The majority of centrality metrics tend to attribute higher influence to nodes that are highly connected or which are connected to other important nodes, like in the case of eigenvector centrality [15]. The idea is that the daughter of the United States president will be important since the moment she is born, if her mother is already the president. It does not matter how many direct connections she has; few links to extremely important peers are enough. In general, more connections and connections to important nodes (or hubs) are seen as an advantage. Connections to the network periphery, on the other hand, are often regarded as less important.
In this paper, we question this last assumption and propose a new set of metricswhich we call Distinctiveness Central/ity -that attribute more importance to nodes which have links to loosely connected others. While we still recognize the pivotal importance of traditional centrality metrics, we also believe that there may be contexts in which connections to peripheral nodes should be valued more. For example, it might be the case that nodes with more peripheral connections keep the network together, avoiding fragmentation. These nodes may be the only ones who can reach certain peers and be used as seed for the diffusion of practices that promote health in the population. This is partially aligned with Borgatti's approach for the identification key players [16]. In other applications, for example when analyzing word co-occurrence networks [17] to evaluate brand importance [18], brands with connections to distinctive words may be more important as they show unique traits that distinguish them from competitors. They convey a different brand image. These are just a few examples showing the need of new centrality metrics, which can favor non-redundant connections towards loosely connected nodes. Accordingly, we introduce a new set of indicators that capture the value of distinctive connections and add to the information explained by traditional centrality measures. Distinctiveness centrality is also very fast to compute, even for large graphs.
The reminder of the paper is organized as follows: in the next section we define a set of five measures of distinctiveness centrality and compare them with well-known centrality indicators, to show that the information they capture is different. Subsequently we present their properties, in terms of variation range and expected values. We derive lower and upper bounds that could be used for normalization, to allow the comparison of scores obtained on different networks. In the last section we discuss our findings and proposal for future research.

Metrics definition
In this section we define five metrics, which were all conceived following the same logic: they all attribute a higher importance to nodes which are strongly connected to loosely connected peers, so that they make the network periphery more reachable. In the computation of network centrality, all our metrics penalize connections to hubs or nodes which are very well connected by themselves. The concept of degree centrality is reinterpreted following this logic.
Let's consider a network that we represent through a weighted undirected graph G, which is described by the triplet G : (V, E, Let V be the set of vertices of cardinality [V] I n, E : (.:r:, y) : x, y € V, st: 75 y be the set of edges, and W be the set of weights associated to the edges. If the vertices 2 and j are not connected, we assume 20,-, : 0. In the following, we use the terms vertices and nodes, and arcs and edges, interchangeably.
For the purpose of illustrating the computation and the specific features of each metric, in this section we employ a 6-node toy network, which can be seen in Fig 1. Q) 5 @ For the generic node 2 € V we introduce the following five distinctiveness centrality metrics (for the sake of clarity we omit the subscript 2 in all subsequent definitions, but each metric definition is to be refereed to node 2), where gj is the degree of node j and I( f) is the indicator function which equals 1 if f ITRUE, i.e. if the edge connecting node 2 to node j exists, and 0 otherwise. This metric is similar to weighted degree centrality [8], as it sums the weight of all arcs connected to a node. However, weights here are penalized based on the number of connections that a node peers have. The minimum weight takes place when node 2 is connected to a node connected to all other nodes, so that it exhibits the maximum possible degree gj = 22 -1 and the weight is rescaled by log10(22 --1) = 0. The rationale is that node 2 adds the minimum improvement possible to reachability of node j by connecting it, since node 2 is already connected to all other nodes. We get instead the maximum weight when node 2 is the only node connected to node j, since we have log10(22 -1) / 1 : log1O("/2 -1). The rationale here is that node j would be unreachable if it were not connected by node 2. In Table 1, we see that employing D1 provides us with a different view of the centrality of each node, with respect to both the degree and the weighted degree. Node A, though having the same degree and weighted degree of nodes C and D is considered as more central through the metric D1, since it allows to reach node E that would otherwise be isolated. On the other hand, node E, though having the same degree and weighted degree as node F, is considered as more central than node F through the metric D1, since it connects to a less connected node (node A versus node B). This metric can be seen as degree centrality [8] adjusted through the same logarithmic term used in D1. Alternatively, it can be seen as a variant of D1 where arc weights are not considered, but just the number of connections a node has. Similarly to what happens with D1, the minimum contribution to the metric takes place when the node (j) connected to the node of interest is connected to all the other nodes in a network (in that case we have gj : 22 -1, while the maximum contribution takes place when the node of interest is the only connection of the node at hand (j), so that we have gj : 1. Over the set of all possible topologies of size 22, the minimum value of D2 is obtained in a star topology for the terminal nodes (for which we have D2 : 0) and the maximum is obtained for the hub (for which we have D2 = (22 -1) log1O(22 -1). The rationale is again that a node is more important the more it allows to connect nodes whose reachability is scarce. The values of D2 for our toy network are reported again in Table 1. As can be seen, the ranking assigned to nodes under this metric is the same as under D1. Cf course, in the case of unweighted networks, the two metrics D1 and D2 return exactly the same values.
Global Weight Distinctiveness Centrality. It is defined as " 22,121 % (Z7152 10112) -wij + 1 Here again the index is made of a sum of terms, where just the nodes adjacent to the node of interest are included. Each adjacent node is accounted for through the weight of the arc connecting it to the node of interest. However, that weight is itself weighted by a logarithmic term that introduces a penalization for those nodes that are highly connected and with large arc weights. In fact, the denominator in the logarithm argument is actually the sum of the arc weights for the arcs connected to the nodes adjacent to the node of interest, excluding the arc connecting it to the node of interest The numerator of the logarithm argument is just a normalization factor (the sum of all arc weights in the graph), introduced to consider the proportion of the total weights that is accountable to the connections of node j. The major difference with respect to D1 is that the arc weight, rather than the degree, is considered in the penalization factor. The values obtained for our toy network are shown in Table 1. We observe that the ranking is again that obtained under D1 or D2, but node A is now much closer to node B.
Weighted Proportional Distinctiveness Centrality. This metrics is defined as D4 shares with D3 the use of arc weights, but differs for the choice of the penalization factor, which is now the simple ratio of the weight of the arc connecting to the adjacent node to the sum of weights of the arcs connected to the node adjacent to the node of interest. We therefore expect the metric to be large for nodes that are highly connected to nodes that are poorly connected. For our toy network we obtain the results shown in Table 1.
November 22   This measure uses a different logic in penalizing connections of a node to peers that are highly connected. It was conceived considering arc weights like in weighted degree centrality. However, the weight of the arc that connects nodes 2 and j (wij) is rescaled based on the proportion of that weight with respect to the weighted degree centrality of node j. For example, if node A is connected to node B, with an arc of weight 10, and the latter has no other connections, the weight of this arc will be considered in full for the calculation of D4. On the other hand, if B is connected to other nodes and has a weighted degree centrality of 100, then the arc between A and B will contribute with a value of 10 >1< (10/ 100) to the sum which determines D4 of A.

J75
This metric just considers the reciprocals of the degrees of the adjacent nodes. Again, the rationale is that adjacent poorly connected nodes count more, so that the most influential nodes are those connecting poorly connected nodes. Its values for our toy network are shown in Table 1. The ranking of nodes is consistent with that obtained under the other metrics, so that they can be seen to share the same overall goal; of course, the ranking may differ when we go to more complex topologies. Finally, though we have considered non degenerate topologies so far, all distinctiveness centrality metrics produce a score of 0 for isolated nodes, by adopting the natural convention that the weight of a non-existing arc is 0. The definition of distinctiveness centrality could also be easily extended to directed networks, considering the sets of arcs leaving and reaching each node. Table 1 additionally shows the values of some of the most popular centrality metrics, i.e. non-normalized betweenness and closeness [2] and eigenvector centrality [15]. Calculations were made using the Python Networkx package [19]. The values of Burt's constraint and effect size metrics [10,11] are also reported in the table, computed considering edge weights. From a quick comparison of the values reported in the table, we see that the information captured by each metric is different, as well as their determination of influential nodes. This is also true for the effect size measure, whose conceptualization is based on the concept of redundancy -an ego has redundancy if her contacts are connected to each other as well. The rankings produced by the five distinctiveness centrality measures are the same, even if the distances among scores change from one measure to the other. Distinctiveness centrality rankings differ from those obtained through constraint, effect size, degree, closeness, betweenness and eigenvector centrality. In order to extend the comparison of distinctiveness centrality with these other measures, we generated 1000 random scale-free networks, according to the Barabasi*Albert preferential attachment model (with 50 nodes and 2 edges that are preferentially attached to existing nodes with high degree, when the network grows). We used the Networkx Python package [19]. Weights of existing arcs were assigned through a uniform selection of random integers in the range [1,20]. For each network we calculated the Spearman's rank correlation coefficients of all metrics, to see how similar centrality rankings were. Average correlations are reported in Table 2. We see that no perfect overlaps (p = 1 or p = -1) are present, which means that no two metrics are perfectly interchangeable (i.e. redundant). We see that our distinctiveness metrics correlate better with degree and betweenness (as expected, since the degree is used to large extent, and the new metrics prize the bridging power) but poorly with closeness (since they take care of really different properties). Among the distinctiveness metrics, the maximally correlated metrics are D1 and D3 (,0 = 0.9915), which are both edge weight-related, and D2 and D5, which are instead degree-related. As expected, rankings produced by our metrics are fairly similar to each other, which means that they are consistent with the same goal (attributing more value to bridging to the network periphery).

l\/[etrics range
Established centrality measures, such as degree, closeness, and betweenness, share a common property of being subject to normalization, so that they take values in the [0,1] range. This property is desirable, since it allows us to make centrality statements of the low-high kind (i.e., if a centrality measure is close to 0, we can state the centrality is low, while the reverse can be stated if the centrality measure is close to 1). In addition, it also allows us to compare the centrality of networks of different sizes.
We would like this property to hold also for our new centrality measures. In order to perform a normalization, we need however to be able to set an upper bound for the unnormalized metrics, so that the metric can be normalized by dividing its unnormalized value by that upper bound. A proper normalization factor has to depend on the network size only. Since we are dealing with networks where the edges may be weighted, we allow the normalization factor to depend on the maximum edge weight also. The upper bound has to be computed over the whole set of possible network topologies. Here we limit to the case of connected networks, where no node is isolated, so that g, Z 1, V2.
We can derive such an upper bound for the metrics defined in the section on Metrics def2222t20n. Before starting, we note that a topology-related quantity appearing in 3 out of the five metrics (precisely, D1, D2, and D5) is the degree, which appears through its reciprocal and summed over all the nodes. Its value is therefore maximum for a node connected to as many nodes whose degree is the lowest possible. This is exactly what happens for the hub in a star topology, since it is connected to all the other 22 1 nodes, which in turn have a degree equal to 1. In the following we will therefore consider the hub of star topology as that maximizing the metrics mentioned above.
On the other hand, if we look for a lower bound on a metric, the same line of  Table 2. Spearman's correlation of centrality metrics reasoning considered above leads us to look for a node connected to as few nodes whose degree is the largest possible. The nodes possessing such a feature are he terminal nodes in a star topology. In fact, a terminal node is connected just to the hub, so that it has the smallest degree possible and is connected to the node with the largest degree possible at the same time.
Global Weight Distinctiveness Centrality. In the case of the metric D3, we can consider separately its components in order to arrive at its upper bound. It is clear, by inspection of the metric, that the maximum of D3 is reached if the weight 20,-j is as high as possible, since it appears both as a factor inside the sum and as a subtractive term in the denominator of the logarithm argument. At the same time we note that the denominator <2Z:1 20',-1.5) -201,-is the sum of the weights of the edges connected to the /W52 node neighbor to the node of interest (i.e., that for which we compute the metric), so that this sum has to be minimum to achieve the maximum of D3. Instead, the numerator [:1 % is simply the sum of the weights of all the edges in the network, which is maximized if we maximize all those weights. Summing up, the maximum of D3 is achieved when: a) the weights of the edges connected to the node of interest are maximum; b) the weights of the edges connected to the neighbors of the node of interest are minimum (excepting the edges connecting the neighbors to the node of interest); c) all the weights of the other edges are maximum. These three conditions are met when we consider the hub in a star topology, and all the edges have the same weight (which is then the maximum). In this case, we have TL 210122) -11121" = 0,  This upper bound again depends just on the network size and on the maximum weight, so that it can be perfectly employed as a normalization factor. Following the same approach, we can find a lower bound for D3 as well. In this case, we must try to: a) minimize the weight 2019-; b) maximize the sum of the weights of the other edges of node j; c) minimize all the other weights. This is accomplished if we consider a terminal node in a star network, whose connection to the hub has the minimum weight, while all the edges connecting the other terminal nodes to the hub have maximum weight. The lower bound is then found as 222:1 % 'I2 -2)IIl8.X('LU»1'j) -1-I since each neighbor shares the weight 201,-: w_,-,; with node 2, and min(2u_,~;,,) : 0 1,22 when each neighbor of node 2 is not connected to any other node. As can be easily envisaged, that upper bound is reached for the hub of a star topology, when all the edges have equal weight. As for the metric D1, this metric D4 is amenable to being normalized just if a bound can be imposed a priori on the edge weights.
We have to note that, contrary to the metrics examined so far, the lower bound on D4 is not zero. Actually, we get the lower bound on D4 for the node that's connected to just another node with the minimum edge weight, and that neighbor is connected in turn to all the other nodes with the maximum edge weight. It's the case of a terminal in a star topology when is edge to the hub has the minimum weight while all the edges from the hub to the other terminals have the maximum weight. We get therefore the following lower bound 2 Again, the maximum of this metric over the set of network topologies depends just on the network size, so that it is amenable to be normalized (by the way, the normalization factor is exactly the same as for the degree centrality). For the minimum value of D4, we can easily recognize that the minimum is reached when the node of interest is  (   Table 3. Normalized distinctiveness centrality metrics for the toy network of Fig 1   connected to just a single node that has instead the largest degree possible. This is what happens to any terminal node in a star topology, so that the minimum value of D4 is The overall range for this metrics is then i n -.

< D < (18) n _ _ 5 _
Normalization is again possible just by dividing by n -1, i.e. relying just on the size of the network. The normalized metrics for the toy network of Figure Fig 1 are reported in Table 3. As can be seen, the dynamics of D4 appears to be more compressed with respect to the other metrics (in particular D2 and D5). This is probably due to the presence of the weights in that metrics, which leads to a multiplicative increase of the normalization factor.

Expected value of metrics
In the Metrics range section, we have computed the range of each metric over the set of nodes for any possible topology and weight distribution. The maximum may be employed to normalize the metrics and get values in the [0,1] range. However, normalization acts over the set of all possible topologies and is not indicative of the values actually taken by the metrics, which can be well below their extreme value. In this section, we provide formulas for the expected value of the five metrics we have proposed, so that we get a picture of where the bulk of values taken by the metrics lies. We employ the scale-free topology, as described in the work of Barabasi and Albert [20]. In scale-free networks, the distribution of the degree follows a power-law distribution. The network grows with new nodes attaching to already existing nodes, following a scheme known as preferential attachment.
Since all the expressions involved in the metrics definitions are random sums (the number of terms in the sum is itself a random variable), we employ Wald's identities (see, e.g., Section 34.14. 2.11 of [21]  where N is a random variable, and X,~X,V1j = 1,2,. ..,N, with N and X independent of each other, we have E0/l = EINI 'lElXl (20) 10/20 In scale-free networks the degree distribution follows a power law, i.e. the probability function of the degree G is IP[G : ls] : oz]-0'1, where a : 1l_1)-1 Preliminarily, we compute the expected value and the variance of N, since the sum in all metrics runs over a number of terms equal to the degree of the node 2', which likewise follows the power law distribution: Weighted Distinctiveness Centrality. We start with computing the expected value of the metric D1. We have a random sum of a product, so that we can write its expected value as the following product (due to the independence of W and Y): where W is the random variable representing the weight, and Y is the following random variable 1 Y : 10810 W/T If we assume a uniform distribution for the weight (W~U (0, 1.0)), we have For the random variable Y we have  (26), though in the following we prefer computing the exact values.
Some of the terms involved in that expression tend to a limit value as the network size grows (i.e., when n -> oo). This entices us to see how that index behaves when the network size grows. By recalling the definition of the Riemann Zeta function §('y) : 1%, we recognize that for a very large network we have 1 lim oz : Z n-+oo C(qQ and similarly n-1 l i _>m Z 150-11 = §('y -1). k:1 Unfortunately the term k-V loglo l <J does not seem to show a finite asymptotic value. We can get an approximate expression for it, if we compute the associated integral, which can be solved by integration by parts, and then reverting to the discrete form: In Fig 3, we see a trend similar to that of D1, though the range of values is very different and a bit compressed with respect to changes of 7. Global Weight Distinctiveness Centrality. We can now consider the D3 metric, defined in Equation (3), which we can write in a more synthetic form, for the sake of easing the derivations that follow

M3
For very large networks, this metric tends to a finite limit, hence independent of the actual number of nodes, since

Discussion and conclusions
The set of distinctiveness centrality metrics we presented in this paper could be used in multiple settings -in all cases where it is important to value the role of nodes which are connected to low-degree peers. These nodes are often a bridge to reach the network periphery. We have additionally evaluated the upper and lower bound of each metric, as well as their expected values for scale-free networks. As we show in the Metrics definition section, the node influence determined by distinctiveness centrality is different from that of degree, weighted degree, closeness, eigenvector, and betweenness centrality, Burt's [10,11] constraint and effect size. The information captured by our metrics is new.
In the field of Semantic Network Analysis, Fronzetti Colladon [18] recently presented the Semantic Brand Score (SBS), a measure of brand importance which is calculated from the analysis of potentially-big textual data. While it is not in the scope of this paper to discuss the constructs of brand importance, we maintain that our distinctiveness centrality metric (D2) could be considered as an alternative to degree centrality for the measurement of Diversity (one of the components of the SBS). Indeed, the calculation of the SBS is based on the construction of a network of co-occurring words, where nodes are words that appear in the analyzed texts and links among them are determined by the frequency of their co-occurrences. For example, if the sentence " it is a beautiful day" appears 7 times, the word nodes "beautiful" and "day" will be connected by an edge of weight 7. In this context the SBS dimension of Diversity counts how many different textual associations exist for each node, and in particular for those nodes that are considered " brands" in the analysis. Diversity is operationalized through degree centrality [2], without penalizing the connections of the brand node to high-degree nodes. In our view, it could be useful to distinguish brands with common textual associations (shared with many other nodes) from brands that have more exclusive relationships with specific words. To this purpose, distinctiveness centrality (D2) could be considered as a reasonable candidate. The idea of adjusting the SBS Diversity metric is also aligned with the logic behind the term frequency-inverse document frequency (TF-IDF) normalization process that is very often used in text analysis [24,25]. According to Robertson [26] words within a document can be divided in those with eliteness and those without. TF-IDF helps understanding how important is a word to a document, which is part of a corpus. Specifically, we can consider a matrix where text documents are represented by rows and columns are the words in the corpus. This matrix is populated by values that reflect the frequency of appearance of each word in each document. However, frequency is not sufficient to understand the word-importance to a document, as well as Prevalence is not sufficient to define the SBS. There might be words, such as " and", which add little meaning to the discourse and appear with high frequency in all documents. To identify distinctive words, frequency values are transformed into TF-IDF values, which increase proportionally to the number of times a word appears in a document and are offset by the number of documents in the corpus that contain that word. This is what D2 and our other distinctiveness centrality metrics do: they attribute more importance to the links that more strongly connect a node with low-degree peers; in the case of a word network, strong links to distinctive words are privileged.
Future research could further explore the properties of our newly defined centrality indicators, for example studying their expected values on network topologies other than scale-free, such as random [27] or small-world networks [28]. Lastly, the scores and rankings produced by our metrics could be more extensively compared with those of other centrality measures.