A methodology and theoretical taxonomy for centrality measures: What are the best centrality indicators for student networks?

In order to understand and represent the importance of nodes within networks better, most of the studies that investigate graphs compute the nodes’ centrality within their network(s) of interest. In the literature, the most frequent measures used are degree, closeness and/or betweenness centrality, even if other measures might be valid candidates for representing the importance of nodes within networks. The main contribution of this paper is the development of a methodology that allows one to understand, compare and validate centrality indices when studying a particular network of interest. The proposed methodology integrates the following steps: choosing the centrality measures for the network of interest; developing a theoretical taxonomy of these measures; identifying, by means of Principal Component Analysis (PCA), latent dimensions of centrality within the network of interest; verifying the proposed taxonomy of centrality measures; and identifying the centrality measures that best represent the network of interest. Also, we applied the proposed methodology to an existing graph of interest, in our case a real friendship student network. We chose eighteen centrality measures that were developed in SNA and are available and computed in a specific library (CINNA), defined them thoroughly, and proposed a theoretical taxonomy of these eighteen measures. PCA showed the emergence of six latent dimensions of centrality within the student network and saturation of most of the centrality indices on the same categories as those proposed by the theoretical taxonomy. Additionally, the results suggest that indices other than the ones most frequently applied might be more relevant for research on friendship student networks. Finally, the integrated methodology that we propose can be applied to other centrality indices and/or other network types than student graphs.

sure it is accurate.

Abstract 24
In social network analysis (SNA) several measures were developed to assess a nodes' centrality 26 in a graph. Among these measures, we find connectivity or degree-based measures, geodesic distance-27 based indices, geodesic path-based scores and centrality measures related to a node's neighborhood. 28 Moreover, studies that investigate networks use mostly degree, closeness and/or betweenness 29 5 (e.g., access to information), and the consideration (or not) of neighborhood properties such as prestige 107 (e.g., Song & al., 2015;Lü & al., 2016;Ghazzali & Ouellet, 2017;Ashtiani & al., 2018). We have 108 observed that such taxonomies of centrality measures are rare in the literature. Therefore, this paper 109 proposes a theoretical classification of several centrality indices. 110 111

The need for a validation of taxonomies 112 113
In the literature, we found that the rare taxonomies of centrality measures were not systematically 114 verified (e.g., on real data and/or by thorough methodologies). As taxonomies are a kind of ontology, 115 which is a concept used in knowledge engineering and has been defined as 'a formal specification of a 116 shared conceptualisation' (Borst, 1997), their appeal to a community (i.e. sharedness) and their fit 117 with the reality they represent (i.e. conceptualisation) need to be validated (Guarino & al., 2009). As 118 such, they need to be sound, complete, lucid and laconic (Guizzardi, 2007). An ontology is sound 119 when there is no construct excess and each of its constructs match an underlying reality in the intended 120 universe of discourse (e.g. a student network). For example, a sound taxonomy of centrality measures 121 for student networks only contains theoretical categories that faithfully represent a centrality notion or 122 dimension for student networks in the real world. A complete ontology has no construct deficit and 123 hence, a construct for each aspect of the underlying reality. For example, by means of an exhaustive 124 list of theoretical categories, a complete taxonomy of centrality measures would reflect each centrality 125 dimension existing within student networks (i.e. including those that have not been observed yet). A 126 lucid ontology has no construct overload (i.e. homonymy) and hence only constructs that map to (at 127 most) a single aspect of the underlying reality. For example, a lucid taxonomy would not contain a 128 theoretical category that would refer to several latent dimensions of student centrality. A laconic 129 ontology has no construct redundancy (i.e. synonymy) and hence, at most one construct for each 130 aspect of the underlying reality. For example, a laconic taxonomy would contain only a theoretical 131 category for each centrality dimension existing within student networks. In absence of these criteria, 132 the taxonomy could lead to ambiguous interpretations of the centrality measures pertaining, for 133 instance, a student network. Therefore, this paper has the ambition to develop a methodology that 134 verifies the proposed theoretical taxonomy on real data -i.e., a student network. 135

Student networks: the centrality question 137 138
Links between student centrality within their peer network and education outcomes are the object 139 of many studies. Those researches concern performance and achievement (e.g., Thomas, 2000;Yang 140 and Tang biological networks in Ashtiani & al., 2018), the studies dedicated to student networks mostly 149 conceptualized centrality as (1) the simplest measure of centrality -i.e., the degree centrality or 150 number of entertained ties, (2) the closeness centrality -i.e., the distance with other nodes in the 151 network -and/or (3) the betweenness index-i.e., the number of times that a node is located on shortest 152 paths linking other nodes. Among these studies, some results showed positives effects of student 153 centrality on education outcomes while others demonstrated none or even negative impacts. For 154 instance, in regards with student performance, while the degree and closeness centralities seemed to 155 have a positive effect on achievement, the impact of betweenness centrality appeared to be less clear. 156 7 different centrality indices do not always match (e.g., a node has high scores on some centrality 162 measures, but average or low scores on other indicators of centrality) (Kiss & Bichler, 2008;Landherr 163 & al., 2010). In order to identify central actors within a network, it is therefore crucial to test several 164 centrality measures (i.e., based on the specific type of network and based on the centrality definitions), 165 and to identify the most appropriate indices among those measures (Kiss & Bichler, 2008;Landherr & 166 al., 2010;Batool & Niazi, 2014). Finally, as stated earlier, many other metrics than degree, closeness 167 and betweenness were developed in SNA in order to assess a node's centrality within a graph. 168 However, student networks were rarely represented using those alternative centrality measures. One 169 additional objective of this paper is to highlight which centrality measures might be the most 170 informative for student networks. This question has received little attention in student networks 171 research, and the literature on the topic mainly concerns other types of networks than student graphs 172 (e.g., biological networks, terrorist cells, customer networks). As Ashtiani & al. (2018) argue for 173 biological networks, we argue there is a need for guidelines pertaining the relevance of centrality 174 measures for student networks. This work aims at selecting the best suited centrality measures for 175 representing a student network or for studying the impacts of student networks on education outcomes 176 such as student performance. Using relevant centrality indices might enable a deeper understanding of 177 student networks and of the mechanisms within those networks.  (2) Which type of centrality might be suitable to investigate when we study the links between 232 student centrality and educational outcomes such as learning, academic performance, dropout. 233 Indices considered as irrelevant in regard to our network and to research questions related to 234 the network were not selected for further analyses; 235 (3) For the remaining indices: we analyzed if there were any measures whose formula was highly 236 similar (i.e., that differ only by very few parameters). For instance, the communicability 237 betweenness centrality, the flow betweenness centrality, the load centrality and the stress 238 centrality are each variants of the betweenness centrality. In those cases, we chose the 239 centrality index figuring in the highest number of documents on Google Scholar (e.g., for the 240 betweenness centrality, the measure that was proposed by Freeman in 1979); 241 (4) For the remaining indices: in order to continue the methodological process with a reasonable 242 number of indices, centrality measures that figured in very few documents on Google Scholar 243 were not selected for further analyses. 244 Among the complete list of suitable centralities presented in Appendix 1, the set of the chosen 245 centrality measures is composed of the eighteen following indices. We consider a centrality measure 246 that takes into account the edges direction (i.e., that can be computed separately on the incoming and 247 on the outgoing ties) as two distinct indices:

Computation of the centrality measures on a real network 282 283
We used the igraph package (Csardi & Nepusz, 2006)

The PCA as methodological tool 291 292
The PCA is a factorial analysis method which uses the correlations (i.e., inter-dependencies) 293 between variables -in our case the centrality indices -to reduce the p dimensional space of these 294 variables into a k dimensional space (with < ). PCA results in a minimal number of principal 295 components (i.e., factorial axis or latent dimensions) that corresponds to maximum data dispersion, 296 these principal components being linear combinations of the initial variables. First, we performed PCA 297 on our centrality indices to highlight the k latent dimensions of centrality within our student network. The data were collected in October 2016 at Saint-Louis University in Brussels, Belgium. 574 313 first-generation freshmen students (i.e., students registered in their first year of studies and for the first 314 time) were interrogated about their friendship ties at university. In the survey, the student' friends 315 were described as the 'persons with whom students spend personal time, with whom they interact on a 316 regular basis (in face to face, by phone or on online social medias), that they see outside classes, that 317 they trust, and/or with whom they share their personal issues' (Thomas, 2000;Cho & al., 2007;13 Hommes & al., 2012).The nodes graph, i.e., the student network, was drawn from the collected data. 319 Since the survey was not mandatory, students who did not participate could nevertheless be cited as 320 ties -a case of missing or non-respondent actors (Robin & al., 2004). A thorough analysis of our graph 321 reveals that 296 students were nominated at least one time by the 574 respondents but did not 322 complete the survey. According to Wasserman & Faust (1994), SNA methods require the complete 323 recording of interactions between actors belonging to the studied network. Using the respondent only 324 approach (i.e., in our case deleting the nominations that correspond to the 296 students who did not

Correlations between the centrality measures 337 338
We computed the eighteen centrality measures for each node (i.e., for each student). The ρ 339 correlation coefficients between each centrality measure together with their significance levels are 340 showed in Table 2

The centrality latent dimensions within student networks and the verification of the taxonomy 377
Three conditions are necessary for PCA to be relevant. First, the variables must be correlated, 379 which seems to be the case as shown above. Second, the Bartlett's test verifies whether highly  399 400 Table 4 shows the variables and their factor loading on the component for which their saturation 401 is the highest. The closeness, residual closeness, eccentricity and geodesic k-path out-centralities are 402 correlated with the first factorial axis. According to the definitions and formula (in Appendix 2) of 403 these four indices, this first component or latent dimension might therefore reflect the ease with which 404 a node reaches the other nodes, connects them and transmits information throughout the network. The 405 second dimension, highly correlated with the Kleinberg's authority & hub centrality scores and with 406 the eigenvector prestige score, relates to centrality through the number of connections with prestigious 407 friends. The geodesic k-path and residual closeness in-centralities, the betweenness and the Page rank 408 score load on the third factor. This dimension might therefore denote the ability to control the received 409 information, and a node's degree of significance, especially by being located on the (local) shortest 410 paths converging toward the node. The fourth dimension (the maximum neighborhood components 411 and the cross-clique connectivity) is linked to the degree of cross-connectivity of a student and of its 412 neighbors. The fifth component, highly correlated with the eccentricity and closeness in-centralities, 413 relates to the ease with which a node is reached by the other nodes in the network and to its ability to 414 receive information. The sixth and last dimension reflects the degree of bottleneck, i.e., the degree of 415 confluence through a given student.

422
In order to verify the proposed taxonomy, we compared the theoretical classification (in Table 1) 423 with the six centrality dimensions emerged from the PCA -i.e., with their composition in terms of the 424 eighteen centrality indices (in Table 4). 425 Within the taxonomy, four indices are gathered within a first category (i.e., the category number 1 426 in Table 1), which is built on the criterion of a geodesic distance-based formula and on the criterion of 427 access to information as centrality corollary: the eccentricity, closeness, residual closeness and 428 geodesic k-path centralities. Those four indices, but computed for the outgoing ties only, saturate on 429 the first latent dimension in the PCA (Table 4), which therefore matches with the category number 1 in 430 Table 1. This theoretical category is therefore validated, but only for centralities computed on the 431 nominations that are made by a node. Moreover, the closeness and eccentricity centralities that are 432 computed on the incoming ties -i.e., the nominations received by a node -and that both saturate on the 433 fifth factorial axis (Table 4), seem to form a subset within the first theoretical category in Table 1. 434 Then, in the theoretical taxonomy, we assigned the residual closeness and the geodesic k-path 435 centralities to the first category, but also to a second family of indices (i.e., category number 2 in Table  436 1) that are based on a geodesic-path formula, and that relies on information control and diffusion. As 437 seen above, residual closeness and geodesic k-path out-centralities have been verified as being part of 438 the theoretical category number 1. However, they are validated for the theoretical category number 2 439 when they are computed on the incoming ties. They form a latent construct (i.e., the third dimension in 440 Table 4) together with the betweenness index, the latter being also validated for the theoretical 441 category number 2 of centrality measures. As shown for the eccentricity and the closeness centralities, 442 according to the nature of the ties (i.e., in-versus out-), the residual closeness and geodesic k-path 443 indices seem therefore to be divided into two distinct categories. 444 Then, the second latent dimension resulting from the PCA ( respectively on the degree of connectivity and on the prestige of the connections, who both reflect 448 power and influence. 449 The fifth theoretical category in Table 1 concerned centrality indices who take the topology 450 properties of the neighborhood into account in their formula and who relate to information diffusion 451 and to cohesiveness roles. The cross-clique connectivity and the maximum neighborhood components 452 (in-and out-) were proposed as being part of this category, which is confirmed by the PCA, those 453 three indices gathering on a same latent factor (i.e., the fourth dimension in Table 4). It should be 454 noted that since a clique is composed of three or more nodes, the cross-clique connectivity was also 455 proposed as part of the third theoretical category in Table 1, which is based on the number of 456 connections. However, the PCA confirms that cross-clique connectivity belongs to the same category 457 as the MNC scores.  Table 4). Yet based on shortest paths in their 462 algorithm, bottlenecks seem therefore to measure a different type of centrality than the residual 463 closeness, geodesic k-path and betweenness indices. Second, we expected the Page Rank score to be 464 validated within the same theoretical category as the eigenvector and the Kleinberg's authority & hub 465 scores, since the Page rank formula takes into account the prestige of the incoming ties when 466 computing a node's centrality. Instead, PCA shows a maximum saturation of Page rank on the same 467 factorial axis as the geodesic k-path (in-), the residual closeness (in-) and the betweenness centralities. 468 Table 5 summarizes the above : the first latent dimension validates the first theoretical category, 469 but for indices computed on the outgoing ties only, while the fifth dimension matches, but only for 470 incoming ties, with two centrality measures that were proposed within the theoretical category number 471 1. A unique dimension (i.e., the dimension number 2) validates the proximity of indices that were 472 proposed as belonging to two theoretical categories (i.e., the third and fourth categories in Table 1). 473 Then, except for the Page rank score (for which we expected a saturation on the second dimension) 474 and except for the two Bottleneck indices (who both saturate on the sixth dimension), the third 475 dimension matches with the theoretical category number 2. Finally, the fifth theoretical category of 476 centrality measures is validated by the fourth PCA dimension. 477 478 Table 5. Centrality dimensions emerged from the PCA and theoretical classification: Comparison.

Centrality measures
Latent dimensions (Table 4) Taxonomy categories (Table 1) Eccentricity (out-) 1 1 Closeness (out-) Residual closeness (out-) Geodesic k-path (out-) Eigenvector prestige score 2 3 & 4 Hub score Authority score Page rank 6 2 Bottleneck (out-) As detailed above, the first category is based on geodesic distance and access to information, the second category 480 is based on geodesic path and diffusion of information, the third category is based on connectivity and power, 481 the fourth category is based on neighborhood's prestige and on power, and the fifth category is based on 482 neighborhood's topology and diffusion of information.

The most representative measures of centrality for student networks 485
One of the objectives of the paper was to find the best centrality measures -i.e., the most 486 representative and significant indices -when we investigate and represent student networks. As in 487 Ashtiani & al. (2018) for biological networks, our goal was therefore to establish, among a set of 488 centrality indices, the measures that best categorize the central students from the peripherical ones.

499
Then, we computed the average contribution of each centrality measure to the factorial plan (i.e., 500 the average contribution on the six factorial axes). We compared each average contribution to a 501 threshold of (1 18 ⁄ ) × 100 = 5.55%, i.e., a centrality measure's theoretical contribution since there 502 are eighteen indices. Higher (resp. lower) values than 5.55% indicate a contribution that is above 503 (resp. below) the theoretical average contribution. Table 6

V. Discussion 520
We applied the integrated methodology that we developed -i.e., (1)  correlations between several centrality measures (e.g., between the eccentricity and the closeness 531 centralities, between the geodesic k-path centrality and the betweenness index, between the closeness 532 centrality and the eigenvector prestige score…). Concerning the centrality dimensions that exist within 533 student networks, our results show the emergence of six latent constructs: (1) the ability for a student 534 to reach and to transfer information, (2) its ability to be reached and to receive information, (3) its 535 significance for the network structure and its control over the information flow, (4) its importance 536 through the number of connections with prestigious students, (5) its degree of cross-connectivity, and 537 (6) its position as a confluent node. First, these results indicate that a centrality measure that is 538 computed for the incoming links seems to differ from the same centrality measure that is computed for 539 the outgoing ties in its meaning and in its impacts on nodes. Our results show that the eccentricity, 540 closeness, residual closeness and geodesic k-path centralities that are computed for the outgoing ties 541 saturate on a different latent construct than the eccentricity and the closeness centralities that are 542 computed for the incoming links. In regard to student networks, this result implies that it is not 543 because a student is close to the other nodes of the network through its outgoing ties that he or she is 544 automatically close to the other nodes of the network through its incoming connections. This also 545 demonstrates that in student networks, the access to information might be different depending on a 546 node's incoming and outgoing ties. Then, according to the nature of the ties (i.e., in-versus out-), the 23 residual closeness and geodesic k-path centralities are also divided into two categories or dimensions. 548 In other words, the number of nodes that a student can reach -the student being located on (local) 549 geodesic paths -might differ from the number of nodes that can reach the student, also through (local) 550 geodesic paths. Moreover, the fact that residual closeness and geodesic k-path centralities are divided 551 into two dimensions shows that the outgoing links seem important for the access to information, while 552 the incoming ties appear relevant for information control. In conclusion, a node might be highlighted 553 as significant when centralities are computed on its incoming (resp. outgoing) links, but not shown as 554 central when its outgoing (resp. incoming) ties are used in the computations. Second, PCA shows that 555 Page rank saturates on the same factorial axis as the geodesic k-path (in-), the residual closeness (in-) 556 and the betweenness centralities. As far as student networks are concerned, these results suggest that 557 students highlighted as central, because they are cited by many other students having a high degree of 558 Page rank, are also significant through a high number of neighbors that can reach them (i.e., those 559 neighbors being located at maximum k steps towards them). On this basis, we might infer that students 560 having a high Page rank score are geographically close to the student(s) they cite, and belong to the 561 same neighborhood as these students' closest neighbors. 562

563
Related to the validation of the theoretical taxonomy, except for Bottleneck and Page Rank 564 scores, the five proposed categories of the theoretical classification are for the most part validated, 565 since they match the latent dimensions highlighted by the PCA. The integrated methodology applied 566 on real data however allowed improving the taxonomy by adding some granularity, for instance by 567 showing that the direction of the ties should be considered in a theoretical classification of centrality 568 measures. Specifically, regarding the four necessary criteria of taxonomies (i.e., sound, complete, 569 lucid and laconic), the methodology, tested on real data, showed that: 570 -The proposed taxonomy seems to be sound (i.e., does not contain useless constructs) since 571 each proposed theoretical category of indices matches with one latent dimension of centrality 572 within the real network. 573 -In order to be complete (i.e., to cover each aspect of the centrality notion within a student 574 network), an additional category within the theoretical taxonomy should be proposed for the 575 Bottleneck centrality, which seems to cover a particular type of centrality. 576 -In order to be to be lucid (i.e. containing categories that map to (at most) a single aspect of 577 centrality), as explained above, categories that take the direction of the ties into account 578 should be added to the taxonomy. 579 -In order to be laconic (i.e., with no construct redundancy), the two categories 'Connectivity 580 based' (i.e., category number 3 in Table 1) and 'Prestige of Neighborhood' (i.e., category 581 number 4 in Table 1) should be merged into a single construct, since they do not seem to 582 relate to different aspects of centrality (i.e., since the indices proposed for both categories -583 i.e., the eigenvector prestige score and the Kleinberg's authority centrality scores -saturate on 584 only one latent dimension). component measure (out-), and (4) the Kleinberg's authority score. Using these indices is expected to 593 allow for capturing and investigating different dimensions of student centrality (i.e., its degree of 594 confluence, its ability to be reached and to receive information, its degree of cross-connectivity and its 595 centrality through prestigious connections), while making sure to select the best centrality candidates -596 i.e., which reflect a maximum of variability between students. 597 598 Two limitations of this research must be pointed out. First, as explained above, 50% of the 599 information contained in a variable must be preserved in the factorial plan in order to consider its 600 representation as sufficient and of good quality. Each variable met this requirement, but in comparison 601 with the other centrality measures (for which the representation was above or equal to 78%), the Page 602 rank score and the bottleneck indices presented lower percentages of contained information (59% for 603 the Page rank score, 59% for the bottleneck computed on the outgoing ties and 63% for the bottleneck 604 computed on the incoming ties). If a variable is not well represented in the factorial plan, its proximity 605 with other(s) variable(s) on the factorial plan may be misinterpreted (i.e., proximity may be 606 assimilated as correlation whereas it is not the case in reality) (Tufféry, 2012). In our case, while the 607 page rank score has a percentage of contained information that is above than 50% -the critical 608 thresholdits proximity with the geodesic k-path Centrality (in-), with the residual closeness 609 centrality (in-) and with the betweenness should be considered with caution. The considered proximity 610 between the two bottleneck's scores should also be validated by subsequent studies. The second 611 limitation relates to the high proportion (i.e., 34%) of missing actors or non-respondents (i.e., students 612 that were cited by the respondents, but that did not complete the survey). Several authors (e.g., 613 Huisman, 2009; Žnidaršič & al., 2012) showed that high levels of survey non-response impact the 614 structural properties of social networks and might cause underestimation of the computed coefficients 615 (Kossinets, 2006). Moreover, as explained above, the complete recording of interactions between 616 actors belonging to the studied network is required in SNA. In order to fulfill this condition, two 617 methodological approaches can be used: the complete cases approach or respondent only approach 618 (i.e., deleting the nominations corresponding to students that do not complete the survey) or the 619 imputation approach (i.e., imputing ties for the missing actors). Since the respondent only approach 620 might produce more biased estimates than the imputation approach (e.g., Huisman, 2009;Wang & al., 621 2016; Gile & Handcock, 2017) we chose to impute ties for the non-respondents, by means of ERGMs, 622 which give better performances than simpler imputation procedures (Huisman & al., 2018). 623 624 Three perspectives could be pursued in future research. First, the theoretical taxonomy could be 625 tested on other student networks, but also within other contexts than educational settings -i.e., on 626 other types of networks (e.g., organizational, biological …), in order to verify its validity (i.e., its 627 soundness, lucidity and laconic nature) and to generalize its results. Also, in order to ensure its 628 completeness and to include centrality measures that have not been observed yet, other measures than 26 those that were chosen in this paper should be tested. Second, our study should be replicated on other 630 student networks in order to validate the best centrality measures when we investigate such graphs. argue that the choice and identification of the best centrality candidates should be the first step when 640 investigating networks and when identifying its key players. The integrated methodology that we 641 propose might therefore be useful for future studies related to networks and node centralities (i.e., to 642 test any set of indices on any type of network). 643

646
In this research, we proposed a theoretical taxonomy of a chosen set of centrality indices that we 647 thoroughly described, we highlighted latent centrality dimensions that exist within student networks, 648 we verified and we improved the proposed taxonomy on an existing student network, and we pointed 649 out which centrality measures should be used when investigating student networks. We also presented 650 an integrated methodology that allows pursuing these objectives and that might be applied to other 651 centrality indices and/or on other types of networks. Our results demonstrate that for student networks, 652 the direction of the ties (i.e., the incoming versus outgoing links) should be considered in the centrality 653 computations, since they bring more information about a student's centrality within its peer network. 654 Results also show that when dealing with student networks, centrality measures that best represent the 655 six factorial axes that emerged from the PCA (i.e., the six indices that saturate the most on each 656 factorial axis) should be integrated in future studies since they seem to cover different latent 27 dimensions of centrality. Finally, our results encourage using other indices than those usually 658 employed when investigating student networks. 659 660