Uncovering and Testing the Fuzzy Clusters Based on Lumped Markov Chain in Complex Network

Identifying clusters, namely groups of nodes with comparatively strong internal connectivity, is a fundamental task for deeply understanding the structure and function of a network. By means of a lumped Markov chain model of a random walker, we propose two novel ways of inferring the lumped markov transition matrix. Furthermore, some useful results are proposed based on the analysis of the properties of the lumped Markov process. To find the best partition of complex networks, a novel framework including two algorithms for network partition based on the optimal lumped Markovian dynamics is derived to solve this problem. The algorithms are constructed to minimize the objective function under this framework. It is demonstrated by the simulation experiments that our algorithms can efficiently determine the probabilities with which a node belongs to different clusters during the learning process and naturally supports the fuzzy partition. Moreover, they are successfully applied to real-world network, including the social interactions between members of a karate club.


Introduction
The theory of network science has significantly improved our understanding of complex systems. In the last fifteen years, an explosive growth of interests and activities on the structure and dynamics of complex networks is appearing. One of the most favorable but challenging tasks in network science is cluster analysis, which is aimed at revealing possible partitions of a network into subsets of nodes (clusters, or communities) [1]. Markov chains are frequently used as analytic models in the quantitative evaluations of stochastic systems. Examples of their use may be found in diverse areas such as computer, biological, physical and social sciences as well as in business, economics and engineering [2][3][4]. The fundamental property characterizing the model, referred to as the Markov property, is that given the present, past and future transitions of the system are independent of each other [5,6]. The information that is often sought from this model is either the transient or stationary probability of the system being in a given state. When the number of states is small, it is relatively easy to obtain the transient and stationary solutions allowing the prediction of the system behavior. However, as models become more complex the process of obtaining these solutions becomes much more difficult. There is also a wide class of situations, where the modeler does not need information about each state of the system but about classes of states only. This leads to the consideration of a new process, to be called the aggregated or lumped, whose states are the state classes of the original Markov chain. The new stochastic process need not to be Markovian. In order to be able to utilize all the power of the Markov chain theory, it is important to be able to claim that for a given initial distribution the aggregated process has the Markov property.
In a previous paper [7], a k-means approach is proposed to partition the networks based on optimal prediction theory proposed by Chorin and coworkers [8,9]. The basic idea is to associate the network with the random walker Markovian dynamics [10], then introduce a metric on the space of Markov chains (stochastic matrices), and optimally reduce the chain under this metric. The final minimization problem is solved by an analogy to the traditional k-means algorithm [11,12] in clustering analysis. This approach is motivated by the diffusion maps [13] and MNCut algorithms in imaging science [5].
To give a coarse definition about the study of complex networks from the viewpoints of applied mathematics, it is about the research of dynamical systems on graphs. The graph structure may be fixed, or time-varying; the dynamical system may be deterministic, or stochastic. Since these networks are typically very complex, it is of great interest to see whether they can be reduced to much simpler systems [5][6][7][13][14][15][16][17][18][19][20]. And in a broader aspect, it is also closely related to the model reduction theory of differential equations [21,22]. The concept of lumpability on hard partitions is a useful tool to analysis the dynamic of the network based on the a coarse grain view. Lumped Markov chain model of the random walker (i.e., a reduced-order Markov chain in which the clusters of the original network become nodes) which is easily derived from the original (highorder) Markov chain model. This notion of proximity of 'nodes' in the lumped markov chain reflects the intrinsic geometry of the meta-node set in terms of connectivity of the cluster in a diffusion process.
In the current paper, we first analyze the property of lumped markov chain and propose two novel ways of inferring the lumped markov transition matrix. Some useful results are proposed based on the analysis of the properties of the lumped markov process. Furthermore, we construct two algorithms -the steepest descent method with projection (SDP) and the reduced conjugate gradient method with projection (CGP) -from minimizing the objective function under the generalized framework in this paper. According to two choices of projection operators P1, P2, we obtain the formulations -SDP1, SDP2, CGP1, CGP2 -which have been applied to two artificial networks, including the ad hoc network, as well as real-world network, the karate club network. The proposed algorithms are easy to be implemented with reasonable computational effort and the final results do make sense in the considered models. It is demonstrated by these experiments that the algorithms can always perform successfully during the learning process and lead to a good clustering result.

Materials and Methods
We will start with a brief review of markov random walks on complex network [8,9]. Let G(S,E) be a network with N nodes and M edges, where S is the nodes set, E~fe(x,y)g x,y[S is the weight matrix and e(x,y) is the weight for the edge connecting the nodes x and y. A simple example of the weight matrix is given by the adjacency matrix: e(x,y)~0 or 1, depending whether x and y are connected. We can relate this network to a discrete-time Markov chain with stochastic matrix p with entries p(x,y) given by where d(x) is the degree of the node x [10][11][12][13]23]. This Markov chain has stationary distribution and it satisfies the detailed balance condition m(x)p(x,y)~m(y)p(y,x): We take a partition of S as S~| K k~1 S k with S k \S l~6 0 if k=l. Our aim is to aggregate the nodes in each cluster in order to give the exact expression of lumped markov chain. To do so, first, we regard each set S k in the state space S~fS 1 , . . . ,S K g as corresponding to the nodes of a N-nodes networkĜ G(S,E), where E~fê e(S k ,S l )g S k ,S l [S , and the weightê e(S k ,S l ) on the edge that connects S k and S l is defined aŝ e e(S k ,S l )~X It can be easily shown thatp p is a stochastic matrix on the state space S and satisfies a detailed balance condition with respect tom m, i.e.m m k :p p kl~m m l :p p lk : This construction of a lumped markov chain is shown in Figure 1.
To give a more clear probability expression, we denote such probability function as r k (x) to represent the probability which the node x belongs to the k-th cluster with. Naturally we need the assumption that for all x [ S.
Thus, the lumped markov transition matrix can be rewritten based on mp Another approach of deriving the lumped markov transition matrix is based on lifted transition matrix. Any lumped stochastic matrixp p can be naturally lifted to the space of stochastic matrices on the original state space S viã where Figure 1. Lumped markov process of a network. For a given partition S~S 1 S S2 S S3 on a network G(S,E), we define a lumped networkĜ G ,E ð Þ by aggregating all nodes belonging to a subset S k into a meta-node. The new weightsê e(S k ,S l ) are computed via weight averaging the transition probabilities between nodes x[S k and y[S l ,k,l~1-3, and the lumped Markov chain with transition probabilitiesp p(S k ,S l ) can also be obtained. doi:10.1371/journal.pone.0082964.g001 The basic idea in [24,25] is to introduce a metric, also called the Hilbert-Schmidt norm, in the space of stochastic matrices. The optimal partition and the corresponding reduced Markov chainp p is found by minimizing subject to the constraint Eq. (7).
To minimize the objective function J in Eq.(12), we definê which has the similar form asp p kl in hard clustering. Thenp p Ã kl is indeed a stochastic matrix because P z[S m k (z)~1 for all k. Furthermore we have thatp p Ã kl satisfies the detailed balance condition with respect tom m, that iŝ With the above background, we have the following basic results.
The expressions of LJ Lp p and LJ Lr according to Eq.(12) are given by where r~(r k (x)) k~1,..., The diagonal matrices I m , Im m are N|N and K|K respectively, with entries where d(x,y) and d kl are both Kronecker delta symbols. The proof of Lemma 1 can be found in Appendix_S1.pdf. Eq.(15a) and Eq.(15b) give the partial derivatives of the objective function in Eq. (12), which are the critical points of constructing gradient methods.
Based upon this formulation, we can find the optimalp p kl for any fixed partition fS 1 ,:::,S N g. From the optimality condition we can obtain Eq. (8). It indicates that when the partition fS 1 ,:::,S K g is known, the minimizer of Eq. (11) is unique, which can be given by Eq. (8), and the correspondingp p is the stochastic matrix in the class Eq.(9) which provides the best rank-K approximation of the original one under the metric Eq.(11).

SDP
In this section, based on the lumped markov transition matrixp p and corresponding results derived before, we devote to find ways to uncover partitions. An obvious choice to solve the constrained optimization Eq.(11) is the steepest descent method [12]. However, the components ofp p and r may become negative and non-normalized during the descent procedure for our problems, which makes the probabilistic interpretation useless. To ensure the nonnegativity and normalization conditions forp p and r, we add a projection step after each renewal. This leads to the following Steepest Descent method with Projections(SDP).
(1) Set up the initial state r (0) as the indicator matrix for each node in the network with the k-means algorithm in [7], n~0. (2) Perform the following iteration until Er (nz1) {r (n) EƒTOL: Here P is some type of projection operator which maps a real vector into a probability vector, aw0 is the learning rate and TOL is a prescribed tolerance. If v i0 v0, we take Pv i0~0 and make projection again to a dimension reduced hyperplane P i=i0 v i~1 . Repeat the projection procedure to a lower and lower dimensional hyperplane until no component is negative.
The application of projection operators for SD can simply solve the original constrained optimization Eq.(11), while to solve it exactly may involve much more complexity. Note that with this SDP method, we can not guarantee thatp p and r are the exact minimizer of the original problem Eq.(11) with non-negative and normalization constraints, but we take them as an approximate minimizer. The numerical results show that this strategy works fine for many examples.
The learning rate a was usually chosen that started from a reasonable initial value and then reduced to zero with the iteration number n in such a way that 0ƒa(n)ƒ1, and The typical example of such case is a(n)~a 0 =n, where a 0 is a positive constant [26]. Another choice is to fix the learning rate a as a positive constant [24,25], which we utilize here, since the initial partition is good enough that the objective function Eq.(12) descends much more slowly when the learning rate becomes smaller, while larger values of a cause blow up. The number of the iteration steps is difficult to be estimated, which may depend on the structure of the network itself, the

CGP
Another choice is to minimize the objective function using a simplified formulation of traditional conjugate gradient method, which is frequently used in machine learning [27]. It can be also regard as the above steepest descent method with a non-zero momentum term, which leads to the following reduced Conjugate Gradient method with Projections(CGP).
(1) Set up the initial state r (0) as the indicator matrix for each node in the network with the k-means algorithm in [13], n~0.   (2) Perform the following iteration until Er (nz1) {r (n) EƒTOL: Here P is some type of projection operator which maps a real vector into a probability vector, a,bw0 is the learning rates and TOL is a prescribed tolerance. (3) The final r (n) gives the fuzzy partition for each node.
We again note that this is just a reduced form of conjugate gradient method, and it is demonstrated by simulation experiments that such method performs more efficiently than SD, just like the superiority traditional conjugate gradient has. The learning rate a and b are still chosen as constants by experience due to the same reason mentioned above. The computational cost in each iteration of CGP is the same as SDP, which is also O(K 2 (MzN)) for bothp p and r. Associating the two projections described above with SDP and CGP respectively, we refer to the derived algorithms: SDP1, SDP2, CGP1, CGP2, as the fuzzy partitioning algorithms for networks.

Results
In this section, simulation experiments on artificial network, the ad hoc network with 128 nodes, is carried out to demonstrate the performance of the proposed algorithms, via comparing the clustering results with some priori quantities. Moreover, the algorithms is applied to real-world network, the social interactions between members of a karate club.

Ad hoc network with 128 nodes
The first example is the ad hoc network with 128 nodes. The ad hoc network is a benchmark problem used in many papers [7,16,17,22]. It has a known partition and is constructed as follows. Suppose we choose N~128 nodes, split them into 4 clusters with 32 nodes each. Assume that pairs of nodes belonging to the same clusters are linked with probability p in , and pairs belonging to different clusters with probability p out . These values are chosen so that the average node degree d is fixed at d~16. In other words, p in and p out are related as 31p in z96p out~1 6: ð22Þ We will denote S 1~f 1 : 32g,S 2~f 33 : 64g,S 3~f 65 : 96g,S 4~f 97 : 128g.
To test on a less diffusive network, we take z out~9 6p out~5 and generate the network according to Eq.(1). This network has a fuzzy clustering structure that some nodes should have immediate weights belonging to different clusters. We set the parameters by K~4,TOL~10 {6 , and the learning rates a P1~5 ,a P2~6 ,b P1~0 :52,b P2~0 :51 in this model computation. J k{means~6 :0687 is obtained after initialization [12]. The numerical results are shown in Table 1. Here we compare r k (x) with an interesting quantity, the degree fraction r E k (x), which is defined as where E k (x) is the number of nodes that are connected with x and lie in cluster S k . Thus we have P 4 k~1 E k (x)~d(x). With this definition, r E k (x) means the fraction of the edges connected with the node x in the k-th cluster. Note that this is not the same as the clustering probability, even though it is an interesting quantity to be compared with. We expect that the degree fraction Eq.(2) is close to our result r k (x) for network though generally this assertion The association probability that each node belongs to different clusters. r R and r Y means the probability belonging to red or yellow colored cluster in Figure 4 respectively. doi:10.1371/journal.pone.0082964.t003 needs to be justified or disconfirmed theoretically. To verify this fact, we define the mean and maximal L ? -error of r e ? r~m ax for error comparing. Table 1 shows that the deviation between these two is about 0:2. Obviously CG algorithm improves the convergence rate of SD. The projection P1 has the smallest e m r , while the projection P2 reaches a better minimum which indicates a more accurate result. In Figure 2 we plot the probability distribution function (pdf) of r k and r E k (k~1,2,3,4). We observe that the shape of the pdf for r k or r E k are almost the same. Note that all the r k 's have a lower peak centered at about 0.85, which corresponds to the nodes in this cluster, and a higher peak centered at about 0.05, which corresponds to the other nodes outside of this cluster. The case for r E k is similar but with the lower peak centered at about 0.7 and the higher peak centered at about 0.1. We note here that the center 0:7 corresponds to the choice of the parameters z out =d~5=16 ¼ : 0:3. If we classify the nodes according to the majority rule, i.e. if m~arg max k r k (x) for a given node x then we set x[S m , we obtain the 4-cluster partition exactly for this model. This also verifies the accuracy of our algorithms, but fuzzy algorithms give more detailed information for each node.

Karate club network
This network was constructed by Wayne Zachary after he observed social interactions between members of a karate club at an American university [28]. Soon after, a dispute arose between the clubs administrator and main teacher and the club split into two smaller clubs. It has been used in several papers to test the algorithms for finding clusters in networks [7,[14][15][16][17][18].
There are 34 nodes in karate club network, where each node represents one member in the club. In Zachary's original partition, each node belongs to only one sub-club after splitting. We label it as red or yellow color in the figures to show its attribute in the graph representation. From the viewpoint of the fuzzy clustering, the attribute of each node is no longer an indicator function but rather a discrete probability distribution. In our following notations, the association probability r R and r Y means the probability of each node belonging to red or yellow colored cluster respectively.
We set the parameters by K~2,TOL~10 {6 ,a P1~1 :5, a P2~2 :0,b~0:6. Here J k{means~4 :1798 is obtained after initialization [7]. The numerical results are shown in Table 2. It can be demonstrate again that CGP is more efficient, and P2 can reach a smaller value of J min . Figure 3 shows the convergence history during 5-30 iterations for the methods. It can be obviously seen that CGP, which decreased the objective function faster than SDP, performs more efficiently. The final association probabilities are presented in Table 3, where r R and r Y are the probability of belonging to the red or yellow colored group shown in Figure 4 respectively. Comparing r R or r Y between P1 and P2, we find that almost all the errors are less than 10 {2 , but the association probability r is quite different from the 0-1 distributions obtained in the k-means algorithm. Now let us compare the association probability r R and r Y obtained by our methods with the original partition result obtained by Zachary. In [28], Zachary gave the partition S Y~f 1 : 8,11 : 14,17,18,20,22g and S R~f 9,10,15,16, 19,21,23 : 34g. If we classify the nodes according to the majority rule, i.e., if r R (x)wr Y (x) then we set x[S R , otherwise we set x[S Y , we obtain the same partition as Zachary's (see Figure 4(a)). We note that this hard partition deduced by fuzzy partition is more reasonable than the result of k-means(see Figure 4(b)), since node 10 is classified correctly this time.
However, we have more detailed information in fact. From Table 3, we find r Y~1 for nodes f5 : 7,11 : 13,17 : 18,22g, which lie at the boundary of the yellow colored group; and r R~1 for nodes f15 : 16,19,21,23 : 27,30,33g, which mostly lie at the boundary of the red colored group. The others belong to the red and yellow colored groups with nonzero probability, especially the nodes f3,9,10,14,20,31g have more diffusive probability and they play the role of transition nodes between the red and yellow colored groups. We can visualize the data r more transparently with the color vector Eq.(23) for each node. We can conclude from Figure 4(c) that how much probability each of the 34 members stands by both parts with. Members f5 : 7,11 : 13,17 : 18,22g and f15 : 16,19,21,23 : 27,30,33g have an obvious attitude on following their leader, i.e. the club administrator or the main teacher. Others such as f3, 9,10,14,20,31g hold neutralism that they can support either leader according their weights.

Discussion
We address the expression of lumped markov transition matrix for networks with two novel method in this paper. This can also be considered as a generalization of markov random walk dynamic in statistics for the networks. We successfully constructed the steepest descent method with projection (SDP) and the reduced conjugate gradient method with projection (CGP). They are derived to search for the local minimum of the objective function in Eq. (12) under the fuzzy clustering framework, which is extended from a deterministic framework for network partition based on the optimal prediction of a random walker Markovian dynamics [7]. The simulation experiments have shown that the algorithms can efficiently determine the fuzzy partition matrix. Partitioning the network with thresholding operation according to the node's maximal weight can give a more reasonable clustering result than the previous k-means algorithm [7]. We use two datasets(Ad hoc network with 128 nodes and Karate club network) to validate algorithms and achieve good results Numerical results show that our algorithms with two different projections produce similar results, while the CGP2 algorithm has better efficiency and accuracy. Moreover, the algorithms succeed in real-world learning tasks.

Supporting Information
Text S1 The ad hoc network is a benchmark problem used in many papers [8], [9], [12], [16]. It has a known partition and is constructed as follows. Suppose we choose N = 128 nodes, split them into 4 clusters with 32 nodes each. Appendix S1 The Proof of Lemma 1. (PDF)