On the calculation of betweenness centrality in marine connectivity studies using transfer probabilities

Betweenness has been used in a number of marine studies to identify portions of sea that sustain the connectivity of whole marine networks. Herein we highlight the need of methodological exactness in the calculation of betweenness when graph theory is applied to marine connectivity studies based on transfer probabilities. We show the inconsistency in calculating betweeness directly from transfer probabilities and propose a new metric for the node-to-node distance that solves it. Our argumentation is illustrated by both simple theoretical examples and the analysis of a literature data set.


Introduction
In the last decade, graph theory has increasingly been used in ecology and conservation studies [1] and particularly in marine connectivity studies (e.g., [2] [3] [4] [5] [6]). Graphs are a mathematical representation of a network of entities (called nodes) linked by pairwise relationships (called edges). Graph theory is a set of mathematical results that permit to calculate different measures to identify nodes, or set of nodes, that play specific roles in a graph (e.g., [7]). Graph theory application to the study of marine connectivity typically consists in the representation of portions of sea as nodes. Then, the edges between these nodes represent transfer probabilities between these portions of sea.
In many marine connectivity studies, it is of interest to identify specific portions of sea where a relevant amount of the transfer across a graph passes through. A well-known graph theory measure is frequently used for this purpose: betweenness centrality. In the literature, high values of this measure are commonly assumed to identify nodes sustaining the connectivity of the whole network. For this reason a high value of betweenness has been used in the framework of marine connectivity to identify migration stepping stones [2], genetic gateways [16], and marine protected areas ensuring a good connectivity between them [5].
Our scope in the present letter is to highlight some errors that can occur in implementing graph theory analysis. Especially we focus on the definition of edges when one is interested in calculating the betweenness centrality and other related measures. We also point out two papers in the literature in which this methodological inconsistency can be found: [3] and [5].
In Materials and Methods we introduce the essential graph theory concepts for our scope. In Results we present our argument on the base of the analysis of a literature data set. In the last Section we draw our conclusions.

Materials and methods
A simple graph G is a couple of sets (V, E), where V is the set of nodes and E is the set of edges. The set V represents the collection of objects under study that are pair-wise linked by an edge a ij , with (i,j) 2 V, representing a relation of interest between two of these objects. If a ij = a ji , 8(i,j) 2 V, the graph is said to be 'undirected', otherwise it is 'directed'. The second case is the one we deal with when studying marine connectivity, where the edges' weights represent the transfer probabilities between two zones of sea (e.g., [3] [4] [5] [6]).
If more than one edge in each direction between two nodes is allowed, the graph is called multigraph. The number of edges between each pair of nodes (i,j) is then called multiplicity of the edge linking i and j.
The in-degree of a node k, deg + (k), is the sum of all the edges that arrive in k: deg + (k) = ∑ i a ik . The out-degree of a node k, deg − (k), is the sum of all the edges that start from k: deg − (k) = ∑ j a kj . The total degree of a node k, deg(k), is the sum of the in-degree and out-degree of k: In a graph, there can be multiple ways (called paths) to go from a node i to a node j passing by other nodes. The weight of a path is the sum of the weights of the edges composing the path itself. In general, it is of interest to know the shortest or fastest path σ ij between two nodes, i.e. the one with the lowest weight. But it is even more instructive to know which nodes participate to the greater numbers of shortest paths. In fact, this permits to measure the influence of a given node over the spread of information through a network. This measure is called betweenness value of a node in the graph. The betweenness value of a node k, BC(k), is defined as the fraction of shortest paths existing in the graph, σ ij , with i 6 ¼ j, that effectively pass through k, σ ij (k), with i 6 ¼ j 6 ¼ k: with (i,j,k) 2 V. Note that the subscript i 6 ¼ k 6 ¼ j means that betweenness is not influenced by direct connections between the nodes. Betweenness is then normalized by the total number of possible connections in the graph once excluded node k: (N − 1)(N − 2), where N is the number of nodes in the graph, so that 0 BC 1. Although betweenness interpretation is seemingly straightforward, one must be careful in its calculation. In fact betweenness interpretation is sensitive to the node-to-node metric one chooses to use as edge weight. If, as frequently the case of the marine connectivity studies, one uses transfer probabilities as edge weight, betweenness loses its original meaning. Based on additional details -personally given by the authors of [3] and [5]-on their methods, this was the case in those studies. In those cases, edge weight would decrease when probability decreases and the shortest paths would be the sum of edges with lowest value of transfer probability. As a consequence, high betweenness would be associated to the nodes through which a high number of improbable paths pass through. Exactly the opposite of betweenness original purpose. Hence, defining betweenness using Eq 1 (the case of [3] and [5]) leads to an inconsistency that affects the interpretation of betweenness values. Alternative definitions of betweenness accounting for all the paths between two nodes and not just the most probable one have been proposed to analyze graphs in which the edge weight is a probability [8] and avoid the above inconsistency.
Herein, we propose to solve the inconsistency when using the original betweenness definition of transfer probabilities by using a new metric for the edge weights instead of modifying the betweenness definition. The new metric transforms transfer probabilities a ij into a distance in order to conserve the original meaning of betweenness, by ensuring that a larger transfer probability between two nodes corresponds to a smaller node-to-node distance. Hence, the shortest path between two nodes effectively is the most probable one. Therefore, high betweenness is associated to the nodes through which a high number of probable paths pass through.
In the first place, in defining the new metric, we need to reverse the order of the probabilities in order to have higher values of the old metric a ij correspond to lower values of the new one. In the second place we also consider three other facts: (i) transfer probabilities a ij are commonly calculated with regards to the position of the particles only at the beginning and at the end of the advection period; (ii) the probability to go from i to j does not depend on the node the particle is coming from before arriving in i; and (iii) the calculation of the shortest paths implies the summation of a variable number of transfer probability values. Note that, as the a ij values are typically calculated on the base of the particles' positions at the beginning and at the end of a spawning period, we are dealing with paths whose values are calculated taking into account different numbers of generations. Therefore, the transfer probabilities between sites are independent from each other and should be multiplied by each other when calculating the value of a path. Nevertheless, the classical algorithms commonly used in graph theory analysis calculate the shortest paths as the summation of the edges composing them (e.g., the Dijkstra algorithm, [17] or the Brandes algorithm [18]). Therefore, these algorithms, if directly applied to the probabilities at play here, are incompatible with their independence.
A possible workaround could be to not use the algorithms in [17] and [18] and use instead the 10 th algorithm proposed in [19]. Therein, the author suggests to define the betweenness of a simple graph via its interpretation as a multigraph. He then shows that the value of a path can be calculated as the product of the multiplicities of its edges. When the multiplicity of an edge is set equal to the weight of the corresponding edge in the simple graph, one can calculate the value of a path as the product of its edges' weights a ij . However, this algorithm selects the shortest path on the basis of the number of steps (or hop count) between a pair of nodes (Breadth-First Search algorithm [20]). This causes the algorithm to fail in identifying the shortest path in some cases. For example, in Fig 1 it would identify the path ACB (2 steps with total probability 1 × 10 −8 ) when, instead, the most probable path is ADEB (3 steps with total probability 1 × 10 −6 ). See Table 1 for more details.
However, by changing the metric used in the algorithms, it is possible to calculate the shortest path in a meaningful way with the algorithms in [17] and [18]. In particular, we propose to define the weight of an edge between two nodes i and j as: This definition is the composition of two functions: h(x) = 1/x and f(x) = log(x). The use of h(x) allows one to reverse the ordering of the metric in order to make the most probable path the shortest. The use of f(x), thanks to the basic properties of logarithms, allows the use of classical shortest-path finding algorithms while dealing correctly with the independence of the It is worth mentioning that the values d ij = 1, coming from the values a ij = 0, do not influence the calculation of betweenness values via the Dijkstra and Brandes algorithms. Note that for any (i,l,j) 2 V thus being suitable to be used in conjunction with the algorithms proposed by [17] and [18]. Also, note that both a ij and d ij are dimensionless. Eq 2 is the only metric that allows to consistently apply the algorithms in [17] and [18] to transfer probabilities. Other metrics would permit to make the weight decrease when probability increases: for example, 1 − a ij , 1/a ij , −a ij , log(1 − a ij ). However, the first three ones do not permit to account for the independence of the transfer probabilities along a path. Furthermore, log(1 − a ij ) takes negative values as 0 a ij 1. Therefore, it cannot be used to calculate shortest paths because the algorithms in [17] and [18] would either endlessly go through a cycle (see Fig 2a and Table 2) or choose the path with more edges (see Fig 2b and Table 2), hence arbitrarily lowering the value of the paths between two nodes.

Results
The consequences of the use of the raw transfer probability (a ij ) rather than the distance we propose (d ij ) are potentially radical. To show this, we used 20 connectivity matrices calculated for [21]. They were calculated from Lagrangian simulations using a 3D circulation model with a high horizontal resolution of 750 m [22]. Spawning was simulated by releasing 30 particles in the center of each of 32 reproductive sites (hereafter identified as nodes) for benthic polychaetes alongshore the Gulf of Lion (NW Mediterranean Sea), on the 30 m isobath, every hour from January 5 until April 13 in 2004 and 2006. Note that the connectivity matrices' values strongly depend on the circulation present in the Gulf during the period of the dispersal simulations. The typical circulation of the Gulf of Lion is a westward current regime [25]. This was the case of matrices #7,#11,#12,#15,#17. However, other types of circulation are often observed.
In particular matrix #1 was obtained after a period of reversed (eastward) circulation. Indeed, this case of circulation is less frequent than the westward circulation [26]. Matrices #14, #10 and #13 correspond to a circulation pattern with an enhanced recirculation in the center of the gulf. Finally, matrices #2, #3, #5, #6, #8, #9, #14, #16, #18, #19, #20 correspond to a rather mixed circulation with no clear pattern. The proportions of particles coming from an origin node and arriving at a settlement node after 3, 4 and 5 weeks were weight-averaged to compute a connectivity matrix for larvae with a competency period extending from 3 to 5 weeks.
As an example, in Fig 3 we show the representation of the graph corresponding to matrix #7. The arrows starting from a node i and ending in a node j represent the direction of the element a ij (in Fig 3a) or d ij (in Fig 3b). The arrows' color code represents the magnitude of the  edges' weights. The nodes' color code indicates the betweenness values calculated using the metric a ij (in Fig 3a) or d ij (in Fig 3b). In Fig 3a the edges corresponding to the lower 5% of the weights a ij are represented. These are the larval transfers that, though improbable, are the most influential in determining high betweenness values when using a ij as metric. In Fig 3b the  Furthermore, it is expected to have a positive correlation between the degree of a node and its betweenness (e.g., [23] and [24]). However, we find that the betweenness values, calculated on the 20 connectivity matrices containing a ij , have an average correlation coefficient of −0.42 with the total degree, −0.42 with the in-degree, and −0.39 with the out-degree. Instead, betweenness calculated with the metric of Eq 2 has an average correlation coefficient of 0.48 with the total degree, 0.45 with the in-degree, and a not significant correlation with the outdegree (p-value > 0.05).
As we show in Fig 4, betweenness values of the 32 nodes calculated using the two nodeto-node distances a ij and log(1/a ij ) are drastically different between each other. Moreover, in 10 out of 20 connectivity matrices, the correlation between node ranking based on betweenness values with the two metrics were not significant. In the 10 cases it was (pvalue < 0.05), the correlation coefficient was lower than 0.6 (data not shown). Such partial correlation is not unexpected as the betweenness of a node with a lot of connections could be similar when calculated with a ij or d ij if among these connections there are both very improbable and highly probable ones, like in node 21 in the present test case. Furthermore, it is noticeable that if one uses the a ij values (Fig 4a), the betweenness values are much more variable than the ones obtained using d ij (Fig 4b). This is because, in the first case, the results depend on the most improbable connections that, in the ocean, are likely to be numerous and unsteady.

Conclusion
We highlighted the need of methodological exactness inconsistency in the betweenness calculation when graph theory to marine transfer probabilities. Indeed, the inconsistency comes from the need to reverse the probability when calculating shortest paths. If this is not done, one considers the most improbable paths as the most probable ones. We showed the drastic consequences of this methodological error on the analysis of a published data set of connectivity matrices for the Gulf of Lion [21].
On the basis of our study, it may be possible that results in [3] and [5] might also be affected. A re-analysis of [3] would not affect the conclusions drawn by the authors about the small-world characteristics of the Great Barrier Reef as that is purely topological characteristics of a network. About [5], according to Marco Andrello (personal communication), due to the particular topology of the network at study, which forces most of the paths -both probable or improbable-to follow the Mediterranean large-scale steady circulation (e.g., [27]). As a consequence, sites along the prevalent circulation pathways have high betweenness when using either a ij or d ij . However, betweenness values of sites influenced by smaller-scale circulation will significantly vary according to the way of calculating betweenness.
To solve the highlighted inconsistency, we proposed the use of a node-to-node metric that provides a meaningful way to calculate shortest paths and -as a consequence-betweenness, when relying on transfer probabilities issued from Lagrangian simulations and the algorithm proposed in [17] and [18]. The new metric permits to reverse the probability and to calculate the value of a path as the product of its edges and to account for the independence of the transfer probabilities. Moreover, this metric is not limited to the calculation of betweenness alone but is also valid for the calculation of every graph theory measure related to the concept of shortest paths: for example, shortest cycles, closeness centrality, global and local efficiency, and average path length [28]. Doglioli.