Potential Theory for Directed Networks

Uncovering factors underlying the network formation is a long-standing challenge for data mining and network analysis. In particular, the microscopic organizing principles of directed networks are less understood than those of undirected networks. This article proposes a hypothesis named potential theory, which assumes that every directed link corresponds to a decrease of a unit potential and subgraphs with definable potential values for all nodes are preferred. Combining the potential theory with the clustering and homophily mechanisms, it is deduced that the Bi-fan structure consisting of 4 nodes and 4 directed links is the most favored local structure in directed networks. Our hypothesis receives strongly positive supports from extensive experiments on 15 directed networks drawn from disparate fields, as indicated by the most accurate and robust performance of Bi-fan predictor within the link prediction framework. In summary, our main contribution is twofold: (i) We propose a new mechanism for the local organization of directed networks; (ii) We design the corresponding link prediction algorithm, which can not only testify our hypothesis, but also find out direct applications in missing link prediction and friendship recommendation.


Introduction
Many social, biological and technological systems can be well described by networks, where nodes represent individuals and links denote the relations or interactions between nodes. The study of structure and functions of networks has therefore become a common focus of many branches of science [1]. A big challenge attracting increasing attention in the recent decade is to uncover the mechanisms underlying the formation of networks [2]. Macroscopic mechanisms include the rich-get-richer [3], the good-get-richer [4], the stability constrains [5], and so on, while microscopic mechanisms include homophily [6], clustering [7], balance theory [8], and so on. Mechanisms can also play a part in regulating the mesoscopic structure, like the formation and transformation of groups and communities [9][10][11]. Real networks usually result from a hybrid of several mechanisms, for example, new nodes may form links according to the rich-get-richer mechanism, and simultaneously, new links among old nodes could be a consequence of the mechanism of clustering [12].
The so called clustering mechanism declares that two nodes have a high probability of making a link between them if they share some common neighbors [13]. This mechanism is indirectly supported by increasing evidences of high clustering coefficients (the clustering coefficient of a node is defined as the density of links among its neighbors, and the clustering coefficient of the network is the average of all nodes' clustering coefficients [14]) of disparate networks [7]. Through investigation on a social network consisting of 43,553 university members, Kossinets and Watts [15] found direct evidence that two students sharing more common acquaintances are more likely to become acquaintance with each other.
The clustering mechanism also works for directed networks, for example, in Twitter, more than 90% of new links are added between nodes sharing at least one common neighbor [16]. In addition, evolving network models driven by common neighbors could reproduce some significant features of both directed and undirected networks [17,18].
Homophily mechanism states the observed tendency of people to communicate with others of similar profiles or experiences [6]. Experiments on social networks strongly support this mechanism. Positive evidences come from various examples, such as an acquaintance network of university members [15], a large-scale instant-messaging network containing 1:8|10 8 individuals [19], friendship networks of a set of American high schools [20], a social network of a cohort of college students in Facebook [21], and so on. A variety of characteristics, such as race, tastes for music and movies, grade, age, location, language and sharing experience, are significant to the link formation. Homophily mechanism also plays a role in other kinds of networks, for example, in directed document networks, links (e.g., hyperlinks between web pages and citations between articles) tend to connect similar documents in content [22]. In some literature, the clustering mechanism is considered as a special case of homophily mechanism, where two nodes having some common neighbors are recognized as being in similar network surroundings. In this article, we prefer to distinguish these two mechanisms. Recent experiments on directed social networks show that the clustering mechanism may be even stronger than the homophily mechanism [23].
Reciprocity mechanism is the tendency of nodes to response to incoming links by creating links to the source [24]. It is a specific mechanism for some directed networks, but not applicable everywhere. For example, the reciprocity mechanism plays a significant role in the growth of social networks of Facebook-like community [25] and Flickr [26], but it has much less impacts on Slashdot [27] and it does not work at all on food webs [28].
This article focuses on directed networks. Examples of directed networks are numerous: the world wide web is made up of directed hyperlinks, the food webs consist of directed links from predators to preys, and in the microblogging social networks, fans form links pointing to their opinion leaders. High reciprocity is a specific property for some directed networks, in addition, the formation of directed links also obey the aforementioned mechanisms, for example, users in Twitter are likely to form links to neighbors of their neighbors and to friends of their friends in near ages, which are in accordance with the clustering and homophily mechanisms [16]. Besides a few representative works on local organizations (e.g., loops, small-order subgraphs, etc.) of directed networks [29][30][31][32][33], link formation of directed networks receives less attention and has not been well understood compared with undirected networks. Here we propose a hypothesis of link formation for general directed networks, named potential theory. Combining the potential theory with the clustering and homophily mechanisms, we could deduce a certain preferred subgraph. We apply the link prediction approach [34] to verify our deduction. That is, we hide a fraction of links and predict them by assuming that a link generating more preferred subgraphs is of a higher probability to exist (see details in Methods and Materials). Experiments on disparate directed networks ranging from large-scale social networks containing millions of individuals to small-scale food webs consisting of a hundred of species show that the prediction according to the preferred subgraph is more accurate and robust than prediction according to other comparable subgraphs. Besides the insights of the underlying mechanism for directed network formation, our work could find applications in friendship recommendation for social networks and missing link prediction for biological networks.

Potential Theory
A graph is called potential-definable if each node can be assigned a potential such that for every pair of nodes i and j, if there is a link from i to j, then i's potential is a unit higher than j. Clearly, a link is potential-definable yet a graph containing reciprocal links is not potential-definable. Figure 1 illustrates some example graphs with orders from 2 to 4, where graphs (a) and (c) are not potential-definable and graphs (b) and (d) are potentialdefinable. Notice that, the condition ''potential-definable'' is only meaningful for a very small graph since a graph consisting of many nodes is very probably not potential-definable. Although potentialdefinable networks are always acyclic, the directed acyclic networks [35] are usually not potential definable. For example, the feed forward loops are directed acyclic networks but not potential-definable.
The potential theory claims that a link that can generate more potential-definable subgraphs is more significant and thus of a higher probability to appear. Our definition of subgraph is more general than the traditional one. Given a directed graph D(V ,E) with V and E the sets of nodes and directed links. A graph D'(V ',E') is called a deduced subgraph of D if V '5V and E' contains all the links in E that connect two nodes in V '. Our definition only requires V '5V and E'5E, that is, E' is not necessary to include all links connecting nodes in V '. As shown in figure 2, (b), (c) and (d) are subgraphs of (a) according to our definition, but only (b) is a deduced subgraph of (a).
Since any graph containing reciprocal links is not potentialdefinable, here we do not take into account the reciprocity mechanism. The clustering mechanism prefers short loops (not necessary to be directed loops) and it only works for local surrounding, and thus we only consider loop-embedded subgraphs with orders 3 and 4. Two nodes connected by reciprocal links are not treated as loops. To avoid the repeated count, we only consider the minimal loop-embedded subgraphs that do not contain loop-embedded subgraphs themselves. are potential-definable, and the numbers labeled beside nodes are example potentials. Graphs (a) and (c) are not potential-definable, and if we set the top nodes' potential to be 1, some nodes' potentials cannot be determined according to the constrain that a directed link is always associated with a decrease of a unit potential. doi:10.1371/journal.pone.0055437.g001  Figure 3 illustrates all the six different minimal loop-embedded subgraphs of orders 3 and 4. These subgraphs are named after Ref. [29] but our motivation is different from motif analysis and we adopt a different definition of subgraph (In Ref. [29] they only consider deduced subgraph). Among these six subgraphs, only Bifan and Bi-parallel are potential-definable. Since generally we could not obtain the explicit attributes of nodes, the homophily mechanism here only refers to the homogeneity in topology related to the potential levels. In a potential-definable subgraph, two nodes with the same potential cannot directly connect to each other and thus the homophily mechanism only works when we consider each subgraph as a whole. Specifically, a subgraph is more homogeneous if the nodes therein are of fewer potential levels. For Bi-fan the links are equivalent to each other and nodes are of two different potentials, while in Bi-parallel, links are different (two are from high-potential nodes to moderate-potential nodes, and the other two are from moderate-potential nodes to low-potential nodes) and nodes are of three different potentials. According to the assigned potentials, we could say the Bi-fan structure is more homogeneous (of fewer potential levels) than the Bi-parallel structure, then the homophily mechanism prefers the former one.
In a word, taking into account the potential theory, together with the clustering and homophily mechanisms, it is thought that the Bi-fan subgraph is the most preferred one and a link that can generate more Bi-fan subgraphs should be of higher probability to exist. This hypothesis receives strongly positive supports as indicated by the most accurate and robust performance of Bi-fan predictor within the link prediction framework. Figure 4 illustrates the selecting procedure for the final winner Bi-fan, as well as the respective contributions of the three mechanisms.

Experimental Results
Corresponding to these six subgraphs we get 12 individual predictors by removing one link from every subgraph (S1-S12, see figure 5). To evaluate the accuracy of a predictor, a network is divided into two parts -training set and testing set. Denote one pair of disconnected nodes in the network as a nonexistent link, then all links can be classified into three categories: observed links are the ones in the training set, missing links are the ones in the testing set, and nonexisting links are the remain links. All the missing links and nonexisting links constitute the set of nonobserved links. A good predictor will assign higher scores to missing links than nonexistent ones. We adopt the Area under the Receiver operating characteristic Curve (AUC) to evaluate the prediction accuracy: a higher AUC value corresponds to a better predictor. Please see details about the link prediction algorithm and the evaluation metric for algorithmic performance in Methods and Materials. Table 1 shows the prediction accuracy, measured by AUC values, of all the 12 individual predictors. In 14 out of 15 real networks, except Youtube, the predictor S 5 performs best. The advantage of the predictor S 5 to others is usually remarkable, while for Youtube, the performance of S 5 is very close to the  optimal one, S 12 . The last row of Table 1 shows the average AUC values, which again emphasizes the great advantage of S 5 . Roughly speaking, the very simple rule -a link generating more Bi-fan subgraphs has higher probability to exist -is nearly 90% right. Table 2 shows the comparison of the prediction accuracy of some hybrid predictors. We explain again that the predictor S 1 zS 2 zS 3 means that the score of a non-observed link is defined as the number of created S 1 , S 2 and S 3 resulting from the addition of this link. In fact, the six predictors in Table 1 correspond to the six minimal loop-embedded subgraphs in figure 3. Therefore, Table 1 directly gives the comparison of the six candidate subgraphs. Again, Bi-fan wins.
Looking at the results presented in Table 1 and Table 2, another significant advantage of the Bi-fan structure is the high robustness, that is to say, even when the predictor S 5 is not the best in some cases, its performance is very close to the optimal one. In contrast, for any other predictor, no matter what predictor-an individual predictor or a hybrid one, it is very sensitive to the network structure, and will occasionally give very bad predictions.  Table 1. AUC values of the 12 predictors shown in figure 5.

Discussion
This article studied the underlying mechanism of the link formation for directed networks. We presented a hypothesis named potential theory, which claims that a link that can generate more potential-definable subgraphs is of a higher probability to appear. This mechanism cannot be solely used to infer network structure for there are too many potential-definable subgraphs (e.g., directed paths of any lengths are potential definable). Therefore, we also take into account two well-known local mechanisms: clustering and homophily. By combining the three mechanisms, it is inferred that Bi-fan is the most preferred subgraph in directed networks. Via comparison of the link prediction accuracies of 12 individual predictors as well as six minimal loop-embedded subgraphs, Bi-fan performs best: not only for its higher AUC value than others, but also for its robustness, namely for disparate testing networks, its performance is either the best or very close to the best. Notice that though the experimental results provided supportive evidences, they can only be considered as a necessary condition, but not a sufficient condition or a solid proof for the potential theory.
The local driven mechanisms underlying directed network formation are less understood compared with those for undirected networks. This kind of study is thus of theoretical significance, and our work provided insights into the microscopic architecture of directed networks. Although the potential theory is more complicated than the clustering and homophily mechanisms as well as the balance theory, its meaning is easy to be captured, that is, the potential-definable property implies a local hierarchy and the potential value of a node indicates its level in the hierarchical structure. For example, the directed loops are not hierarchyembedded and the directed path is strictly hierarchically organized; the former is not potential-definable and the later is potential-definable. The hierarchical organization is a well-known macroscopic feature for many undirected [36,37] and directed [38,39] networks, and our work indicates that for directed networks, nodes tend to be locally self-organized in a hierarchical manner. We guess this kind of microscopic hierarchical organization will contribute to the macroscopic hierarchical structure. In the near future, we will study more data sets in a more detailed way to check whether the potential theory and our hypothesis about hierarchical organization are valid or not and to see the applicable range (to which networks it works and to what extent it can explain the network formation) of the potential theory.
Lastly, we would like to say again that the link prediction problem is very fundamental to both information filtering and network analysis [34,40], and it could find out countless applications. In this work, we applied the link prediction approach to evaluate driven mechanisms of network formation, at the same time, our method can be directly applied to predicting missing links and recommending friendships for large-scale directed  Figure 6. Illustration of the scores of links according to our method. The red dashed arrows are probe links. If we adopt the predictor S 1 , the scores for n 1 ?n 3 and n 4 ?n 2 are S 1 (n 1 ?n 3 )~2 (n 1 ?n 5 ?n 3 and n 1 ?n 2 ?n 3 ) and S 1 (n 4 ?n 2 )~0, respectively. More examples are as follows: S 2 (n 1 ?n 3 )c n 1 ?n 2 /n 3 f g ; S 5 (n 4 ?n 2 )c n 4 ?n 5 /n 1 ?n 2 f g ; S 6 (n 4 ?n 2 )c n 4 ?n 5 ?n 3 /n 2 f g ; S 9 (n 4 ?n 2 )c n 4 ?n 5 ?n 3 ?n 2 f g . doi:10.1371/journal.pone.0055437.g006 networks, since the accuracy of our method is much higher than the common-neighbor-based methods as indicated by the performance of predictors S 1 , S 2 , S 3 and S 4 .

Link Prediction Algorithm
Given a directed network D(V ,E), the fundamental task of a link prediction algorithm is to give a rank of all non-observed links in the set U\E, where U is the universal set containing all DV D(DV D{1) possible directed links. If one wants to find out missing links or recommend friendships, one can go for the links with the highest ranks. The mainstream method is to assign each nonobserved link a score, and the one with higher score ranks ahead.
We design the predictors corresponding to the six minimal loopembedded subgraphs shown in figure 3. By removing one link from every subgraph, we get twelve predictors as shown in figure 5. If we adopt the predictor S i , it means the score of a non-observed link u?v is defined as the number of the ith subgraphs created by the addition of this link. Notice that, a link may generate ten 3-FFLs, but their roles can be different. For example, these ten 3-FFLs may include two S 1 , three S 2 and five S 3 . So if we adopt the predictor S 2 , the score of this link is three. Therefore, if we would like to see the contribution of a link to the created 3-FFLs, we can adopt the predictor S 1 zS 2 zS 3 , which means that the score of a non-observed link is defined as the total number of created S 1 , S 2 and S 3 by this link, equivalent to the number of created 3-FFLs. Figure 6 illustrates a simple example about how we calculate the scores.
Given a predictor we can rank all the non-observed links according to their scores. To evaluate the algorithmic performance, we randomly divide the observed links E into two parts: the training set E T is treated as known information while the testing set (probe set) E P is used for testing and no information therein is allowed to be used for prediction. Clearly, E~E T |E P and E T \E P~w . In our experiments, the training set always contains 90% of links, and the remaining 10% of links constitute the testing set.

Evaluation Metric
We use a standard metric, area under the receiver operating characteristic (ROC) curve [41], to test the accuracy of link prediction algorithms. It is usually abbreviated as AUC (Area Under Curve) value. This metric can be interpreted as the probability that a randomly chosen missing link (a link in E P ) is given a higher score than a randomly chosen nonexistent link (a link in U\E). In the implementation, among n times of independent comparisons, if there are n' times the missing link having higher score and n'' times the missing link and nonexistent link having the same score, we define the AUC value as [34]:

AUC~n
'z0:5n'' n : If all the scores are generated from an independent and identical distribution, the AUC value should be about 0.5. Therefore, the degree to which the AUC value exceeds 0.5 indicates how much better the algorithm performs than pure chance.

Data Description
Our experiments include 15 real directed networks drawn from disparate fields. Details are as follows and the basic structural features are presented in Table 3. If a network is unconnected, we only consider its largest weakly connected component.
Biological networks. Three of them are food webs, representing the predator-pray relations, and another one is a neural network of C.elegans.  DV D and DED are the number of nodes and links, k in max and k out max are the maximum of in-degree and out-degree of all nodes, and SkT is the average degree of all nodes (average in-degree equals average out-degree). SdT and C are the 90-percentile effective diameter [56] and the clustering coefficient for directed networks [57] N C.elegans [45] -A neural network of the nematode worm C.elegans, in which an edge joins two neurons if they are connected by either a synapse or a gap junction.
Information networks. We consider networks of documents where a directed link from i to j means the document i cites the document j, and a network of weblogs where a directed link stands for a hyperlink. N FriendFeed [50] -FriendFeed is an aggregator that consolidates the updates from the social media and social networking websites, social bookmarking websites, blogs and microblogging updates, etc. Members can manage their social networking contents with one Friend-Feed account and follow others' updates. This data set captures the who-follow-whom relationships.
N Epinions [51] -Epinions.com is a who-trust-whom online social network of a general consumer review site. Members of this site can decide whether to ''trust'' each other.
N Slashdot [52] -Slashdot.org is a technology-related news website known for its specific user community. This site allows individuals to tag each other as friends or foes.
N Wikivote [53,54] -Wikipedia is a free encyclopedia written collaboratively by volunteers around the world. Active users can be nominated to be administrator. A public voting begins after some users are nominated. Other users can express their positive, negative or neural idea towards all the candidates. The most voted candidate will be promoted to admin status. This process implies a social network in which users are nodes and the action of voting from someone to another demonstrates a directed link. This data is from English Wikipedia on 2794 elections.
N Twitter [55] -Twitter is an online social networking service where users can post texts within 140 characters. It also allow users to ''follow'' other users whereby a user can see updates from the users he follows on his twitter page. In this network, a link from user A to user B means that user A is following user B. The data used here is a sample from the whole dataset in [55].