Detecting Overlapping Protein Complexes by Rough-Fuzzy Clustering in Protein-Protein Interaction Networks

In this paper, we present a novel rough-fuzzy clustering (RFC) method to detect overlapping protein complexes in protein-protein interaction (PPI) networks. RFC focuses on fuzzy relation model rather than graph model by integrating fuzzy sets and rough sets, employs the upper and lower approximations of rough sets to deal with overlapping complexes, and calculates the number of complexes automatically. Fuzzy relation between proteins is established and then transformed into fuzzy equivalence relation. Non-overlapping complexes correspond to equivalence classes satisfying certain equivalence relation. To obtain overlapping complexes, we calculate the similarity between one protein and each complex, and then determine whether the protein belongs to one or multiple complexes by computing the ratio of each similarity to maximum similarity. To validate RFC quantitatively, we test it in Gavin, Collins, Krogan and BioGRID datasets. Experiment results show that there is a good correspondence to reference complexes in MIPS and SGD databases. Then we compare RFC with several previous methods, including ClusterONE, CMC, MCL, GCE, OSLOM and CFinder. Results show the precision, sensitivity and separation are 32.4%, 42.9% and 81.9% higher than mean of the five methods in four weighted networks, and are 0.5%, 11.2% and 66.1% higher than mean of the six methods in five unweighted networks. Our method RFC works well for protein complexes detection and provides a new insight of network division, and it can also be applied to identify overlapping community structure in social networks and LFR benchmark networks.


Introduction
In the past several years, large-scale proteomics experiments have produced many PPI data sets from different organisms [1]. These data sets are generally represented as undirected weighted or unweighted networks with proteins as a set of nodes and interactions as a set of edges. Edge weight estimates the reliability of such interaction. Protein-protein interactions play significant roles in cell's structural components and the process ranging from transcription, splicing site and translation to cell cycle control [2]. It is essential to extract overlapping protein complexes or regulatory pathways from PPI networks to investigate disease-related gene and drug target.
Densely connected regions in a graph can be identified by some unsupervised clustering method. However, many clustering methods are not ideal for PPI networks [1]. Some proteins may have multiple functions, hence the corresponding proteins could belong to more than one complex. Recently, a lot of clustering algorithms have been proposed to detect overlapping protein complexes in PPI networks [1,3,4,5,6,7]. Each of them has limitations: some algorithms only work in unweighted networks, and can be applied to weighted data sets only after binarizing them by deleting edges whose weights are below a given threshold, while others need to assign the number of complexes firstly [8,9]. Although the notion of the overlapping protein complexes is easy to understand, constructing an effective algorithm for overlapping protein complexes is highly non-trivial for two reasons: firstly, the number of protein complexes is unknown for a given PPI network; secondly, a protein complex should contain many reliable interactions within its subunit, and it should be well-separated from the rest of the PPI networks [1].
Fuzzy sets and rough sets have been widely applied to many fields, such as fuzzy clustering [10,11], rough k-means clustering [9,12,13,14,15], fuzzy c-means clustering [16,17], rough-fuzzy c-means clustering [18,19,20] and dynamic rough clustering [21,22]. One of the most remarkable attempts to clustering problems may be c-means clustering and its derivatives. However, those algorithms are mainly applied to two dimensional microarray gene data, image data and forest cover rather than three dimensional network data, and mainly adapt rough set and fuzzy set theory to c-means clustering [18]. Those algorithms have the following weaknesses, firstly, the number of clusters c is an input parameter, and an inappropriate choice of c may yield poor results. In most cases, it is difficult to assess the numbers of clusters (c value) in original datasets. Thus, diagnostic checks have to be performed on and on to determine the number of clusters in the data set when performing c-means. Secondly, the choice of the initial cluster centers has a great impact on the clustering results; once the initial value selected is not good, it could not draw effective clustering results. Thirdly, the algorithm requires constant adjustment for sample classification and constantly calculating the adjusted new cluster centers, so when the data is very large, the algorithm time complexity will increase.
In order to solve the three dimensional datasets clustering problems in PPI networks and the weaknesses of c-means clustering, we present a novel method based on rough-fuzzy clustering (RFC) to detect overlapping protein complexes in PPI networks. RFC integrates the merits of fuzzy sets and rough sets, focuses on fuzzy relation model rather than graph model. RFC utilizes fuzzy set to create fuzzy relation between nodes and transform the fuzzy relation into fuzzy equivalence relation, and then create equivalence classes which correspond to nonoverlapping protein complexes. The upper and lower approximations of rough sets are used to decide whether one protein belongs to one or more complexes, so we obtain overlapping complexes. RFC can automatically obtain the number of clustering by the number of equivalence classes, removing the limitation of selecting the initial clustering number. RFC also has advantage in datasets with large number of prototypes.
To test RFC's performance, we apply it to identify overlapping and non-overlapping community structure in artificial synthetic networks and social networks. To evaluate RFC quantitatively, we apply it to detect overlapping protein complexes in four weighted yeast data sets [23,24,25] and five unweighted yeast data sets [23,24,25,26], and then we execute six other popular clustering methods (ClusterONE [1], CMC [27], MCL [28], GCE [29], OSLOM [30] and CFinder [3]) in the same data sets. Predicted complexes derived by the seven methods are separately compared with reference complexes from the Munich Information Centre for Protein Sequence (MIPS) [31] and the Saccharomyces Genome Database (SGD) [32]. Finally, results derived by the seven methods are compared with some evaluation criteria to assess RFC.

Materials and Methods
The definitions of rough-fuzzy clustering Prior to providing a detailed description of our algorithm, we introduce some terminologies widely used in the forthcoming sections. Let G~V , E ð Þbe an undirected graph, where V is a set of nodes, and E is a set of edges. Definition 1. Let N(u) be the neighbors of node u. Sim(u, v), similarity for node pair u and v, is Here, we define similarity between nodes based on their shared neighbors, if u and v are not directly neighbors, Sim(u, v)~0; if u and v are directly neighbors, the more shared neighbors of u and v, the larger value of Sim(u, v); if u and v are the same node, Sim(u, v)~1, that is, 0ƒSim(u, v)ƒ1. If two nodes have similar topological structure, they may share similar functions [11]. Similarity in network topological structure decides the degree of similarity between a pair of nodes.
Definition 2. Let V be a nonempty set, and R be an equivalence relation. For each v[V , the equivalence class of object v for R is defined as follows [12]: Definition 3. For set X (V , the upper and lower approximations of X for R are defined as follows, respectively [12]: Here, R R(X ) is the upper approximation of X for equivalence relation R, R(X ) is the lower approximation of X for equivalence relation R. Obviously,1(R(X )(X ( R R(X ). BN R (X )~ R R(X ){R(X ) is called as boundary region of X for equivalence relation R, and their relationship is shown in Figure 1.
Let u be an object of set X i . It is obvious in Figure 1 that the upper and lower approximations of X i are only a few subsets of V. The family of the k upper and lower approximations of the X i [V =R,i~1, ::: ,k necessarily meet the following basic rough set properties [12]: Figure 1. The relationship among Set X and its possible lower approximation, upper approximation and boundary region for equivalence relation R. In the figure, we provide the relationship among set X, lower approximation R(X ), upper approximation R R(X ) and boundary region BN R (X ). The internal region of the red curve represents set X, the internal region of the yellow line represents lower approximation R(X ), the green region represents boundary region BN R (X ), the internal region of the blue line represents upper approximation R R(X ), and the whole region represents universal set. doi:10.1371/journal.pone.0091856.g001 Property 1: An object u can be a part of at most one lower approximation.
. Property 3: u is not a part of any lower approximation uu belongs to two or more boundary regions. The next step is how to determine whether an object belongs to boundary region or lower approximation of a set. For each object u, let S(u, X i ) be similarity between u and any set X i . The definition of S(u, X i ) is as follows: Definition 4. Similarity between node u and set X i is Here, Sim u, v ð Þ is obtained by Definition 1. The ratio S(u, X j )=S(u, X i ) is used to decide the assignment of u as follows [12,13]: 1. I f S(u, X i ) i s t h e m a x i m u m f o r 1ƒi, jƒk a n d S(u, X j )=S(u, X i ) §threshold (k denotes the number of sets referring to the number of equivalence classes), u[BN R (X i ) and u[BN R (X j ). Furthermore, u is not a part of any lower approximation. This criterion ensures that Property 3 is satisfied.
1ƒiƒk. In addition, by Property 2, u[R(X i ). This criterion also satisfies Property 1.

The rough-fuzzy clustering method
The RFC consists of the following major steps, as shown in Figure 2.
(1) The graph ( Figure 3) can be represented by an adjacency matrix N, and then transform the adjacency matrix N into the fuzzy matrix R by calculating the similarities between any two nodes (Definition 1). Obviously, R is reflexive and symmetric.
Here, l 1 [(0:82, 1, and these objects are classified into four equivalence classes: X 1~f 1, 2, 3g, X 2~f 4g, X 3~f 5, 6g, X 4~f 7g. If l 2 [(0:387, 0:444a n d i = 1 , 2 a n d 4 , S(u 7 , X i )=S(u 7 , X 3 )~0ƒ0:387. Therefore, u 7 belongs to the lower approximation of X 3 [u 5 ,u 6 and u 7 belong to the s a m e e q u i v a l e n c e c l a s s X 3 . If l 2 [(0:387, 0:444, S(u 4 , X 3 )=S(u 4 , X 1 )~0:444 §l 2 . Therefore, u 4 belongs to the boundary region of X 1 and X 3 , X 2~1 . In this case, n o n -o v e r l a p p i n g s e t s , R(X 1 )~f1, 2, 3g a n d R(X 2 )~f5, 6, 7g, a n d o v e r l a p p i n g s e t s BN R (X 1 )~BN R (X 2 )~f4g are obtained. (5) Merge the sets with overlapping degree to a very high extent in comparison with their sizes [1]. We evaluate the extent of overlapping between each pair of sets by formula 10 and merge the two sets whose overlapping score is above a specific threshold. Let merging threshold be 0.64, because it shows that the intersection is at least 80% of the size of the set if the two sets are equal in size.
We have discussed the details of RFC. The choice scale of l is relatively larger and more flexible than fuzzy clustering, and the clustering results are relatively stable for different l. In the following section, RFC will be applied in artificial synthetic networks, social networks and PPI networks.

Parameter settings
In the algorithm, threshold l 1 is used to divide networks to get non-overlapping modules. The l 1 is closely related to the size of similarities of between nodes in all kinds of networks. Based on the analysis of the algorithm and a large number of experiments, we obtain l 1 according to the following formula: Here, Sim(u, v) obtained by Definition 1 represents the similarity between nodes, avg(Sim) represents the mean of similarities of all pairs of nodes, and Count(Sim(u, v)w avg(Sim)) represents the number of the values that are greater than mean avg(Sim).
Threshold l 2 is applied to determine whether one node belongs to one or multiple modules. In this article, it is set into an adjustable value. Based on a large number of experiments, it is a good choice to set 0:8l 1 ƒl 2 ƒ0:9l 1 .

Evaluation criteria
Different criteria proposed by earlier studies are applied to evaluate RFC. The criteria are defined to assess the similarity between predicted modules and reference modules. The first measure is Normalized Mutual Information (NMI), which is an information theory based on quantifying the closeness of two groups of sets which has been widely used in clustering algorithms and machine learning [30,34,35,36]. It is defined as: Here, H(X ) (H(Y )) is the entropy of the random variable X(Y), whereas H(X , Y ) is the joint entropy.
Here, for a random variable X with n outcomes (x 1 , ::: , x n ), p(x j ) is the probability mass function of outcome x j , and p(y i , x j ) is the probability that Y~y i and X~x j .
The Second measure is the overlapping score between predicted and reference complexes, which is shown as follows [37]: Here, p[P is a predicted complex and k[K a reference complex. P is the set of predicted complexes and K is the set of reference complexes.
After defining overlapping score OS p, k ð Þ between predicted complex and reference complex, precision, recall and F1 measure are defined as follows [37]: Precision~O PN p P j j : ð13Þ  Recall~O Here, OPN p is the number of predicted complexes as OS(p,k) §v and OPN k is the number of reference complexes as OS(p,k) §v.
The overlapping threshold v = 0.25 is chosen, because it shows that the intersection is at least half of the complex size if the two complexes are equal in size [1]. Precision is the fraction of the predicted complexes that match known complexes. Recall represents the fraction of known complexes that match predicted complexes. F1 measure gives a reasonable combination of both precision and recall. Giving the known complexes as reference classification, we take sensitivity as the score of members of the ith known complex which are found in the jth predicted complex. Clustering-wise sensitivity (Sn) is defined as follows [1,37]: Here, n is the number of known complexes. T ij is the number of common proteins between the ith known complex and the jth predicted complex, and num i is the number of proteins belonging to the ith known complex. The positive predictive value (PPV) is the fraction of members of the jth predicted complex which belongs to the ith known complex. PPV is defined as follows [37]: Here, m is the number of predicted complexes, n is the number of known complexes. The geometric accuracy (Acc) is the balance of both sensitivity and predictive value. It is obtained by calculating geometrical mean of Sn and PPV [37].
We employ separation to evaluate one-to-one correspondence between predicted complexes and known complexes. Separation of both the ith known complex and the jth predicted complex is shown as follows [1,2,37]: Here, n is the number of known complexes. m is the number of predicted complexes. T i j is the number of common proteins between the ith known complex and the jth predicted complex.

Results
To validate RFC's feasibility, we apply it in artificial networks, social networks and protein interaction networks. In artificial networks, we compare its performance with those of the best algorithms currently available. The algorithms, GCE [21] and OSLOM [22] are selected for a fair comparison in LFR benchmark networks. To further verify the performance of our method, we apply RFC in Karate club network [38] and Dolphins network [39].
To evaluate RFC quantitatively, we apply it in four weighted and five unweighted large scale yeast PPI datasets (see Table 1), and compare predicted complexes with two reference complexes, MIPS [23] and SGD [24] (see Table 2). We also compare RFC results with those of six other popular methods, MCL [28], CFinder [3], ClusterONE [1], GCE [29], OSLOM [30] and CMC [5,27] with an immediate purpose to test the performance of extracting overlapping complexes. The similarity in weighted networks is defined by weight of the edge, and the similarity in unweighted networks is calculated by definition 1.  Artificial networks The LFR [36] is a class of benchmark graphs which account for the heterogeneity in the distributions of node degrees and community sizes. It can be applied to overlapping communities, by assigning to each node the same number of neighbors in different communities. To simplify things, we suppose that each node belongs to the same number of communities [30]. Mixing parameter u as independent variable is the ratio of the number of external neighbors of a node by the total degree of the node [30]. Small values of u show well separated communities, whereas large values of u indicate high mixed to each other.
RFC is tested and compared with two recent methods, GCE [29], based on greedy clique expansion, and OSLOM [30], based on local optimization method. The two methods have good performances on LFR benchmark graphs with overlapping communities. The comparison of NMI's changes according to the mixture parameter u by three algorithms is presented in Figure 4 In all tests on LFR benchmark graphs, mixing parameter u varies from 0.1 to 0.9 with an interval 0.1 and each point is always 100 realizations, then mean of NMI is obtained as results. By increasing the value of u, communities become more and more fuzzy and it gets harder for any method to correctly detect the modules. We find that RFC performs competitively in comparison with GCE and OSLOM.

Social networks
Although RFC performs well in artificial networks, we have to select two real-world networks for further evaluation.

Karate club network
Zachary observed 34 members of a karate club at a US university in three years [38]. During the course of the time, node 1 (the club's instructor) and node 34 (the club's president) had some different ideas on the price of karate lessons. Ultimately the club was split into two organizations: one group was the supporters of the president and the other group was the supporters of the instructor. In fact, some individuals had friendship between the two groups, that is, these individuals may be overlapping nodes.
Here we use an unweighted network version to test RFC and attempt to determine the factions involved in the split of the club. RFC performs well for detecting the two well-known communities which are centered at node 1 and node 34, respectively. The nodes 9, 10, 20, 28 and 29 are shared between the two groups. The  communities coincide with overlapping nodes 9, 10, 20 observed by Sun et al. [10] with exception of nodes 28 and 29, which Sun et al. put with the community of the club's president. However, node 28 and node 29 have neighbors 3 and 34, respectively. Neighbor 34 is the club's president in one community, while neighbor 3 in the other community plays a pivotal role in its community. Therefore, it is reasonable that nodes 28 and 29 are overlapping. The detailed community structure of the network is shown in Figure 5.

Dolphins network
The second example we discuss is the network studied by the biologist Lusseau [39], who divided a group of dolphins into two groups according to their age. There are 62 nodes and 159 edges in the network. RFC finds two communities with four overlapping nodes (8,29,31,40), which can be seen in Figure 6. The partition of the two communities by RFC agrees with the separation observed by David Lusseau.

PPI networks
First, we test the six methods mentioned above in the weighted Gavin, Collins and Krogan datasets. Table 3 indicates the detailed benchmark results in Gavin dataset when the MIPS gold standard dataset is used as gold standard. The detailed benchmark results in Collins and Krogan datasets are provided in Table S1. Figure 7 gives results of a comparison of the six algorithms in the weighted Gavin, Collins, and Krogan datasets using MIPS gold standard. The results by RFC are compared with the ones by ClusterONE, CMC, MCL, OSLOM and CFinder. The precision, sensitivity and separation are 35.8%, 48.3% and 75.9% higher than mean of five other methods in the four weighted networks. Table 4 indicates the detailed benchmark results in Gavin dataset when the SGD gold standard dataset is used as gold standard. The detailed benchmark results in Collins and Krogan datasets are provided in Table S2. Figure 8 gives results of a comparison of the six algorithms in the weighted Gavin, Collins, and Krogan datasets using SGD gold standard. The results by RFC are compared with the ones by ClusterONE, CMC, MCL,  Then we test all the seven methods mentioned above in the unweighted Gavin, Collins, Krogan, and BioGRID datasets. Table 5 indicates the detailed benchmark results in Gavin dataset when the MIPS gold standard dataset is used as gold standard. The detailed benchmark results in Collins, Krogan and Biogrid datasets are provided in Table S3. Figure 9 gives results of a comparison of all the seven algorithms in the unweighted Gavin, Collins, Krogan and Biogrid datasets using MIPS gold standard. RFC results are compared with ClusterONE, CMC, MCL, OSLOM, GCE and CFinder results. The precision, F1 measure, sensitivity, accuracy and separation are 0.1%, 16.1%, 10.5%, 9.6% and 60.5% higher than mean of six other methods in five unweighted networks. Table 6 indicates the detailed benchmark results in Gavin dataset when the SGD gold standard dataset is used as gold standard. The detailed benchmark results in Collins, Krogan and Biogrid datasets are provided in Table S4. Figure 10 shows results of a comparison of all the seven algorithms in the unweighted Gavin, Collins, and Krogan datasets using SGD gold standard. RFC results are compared with ClusterONE, CMC, MCL, OSLOM, GCE and CFinder results. The precision, F1 measure, sensitivity, accuracy and separation are 2.7%, 26.6%, 11.8%, 10.1% and 69.8% higher than mean of six other methods in five unweighted networks.

Conclusion and Discussion
In this paper, we present a novel method based on rough-fuzzy clustering to detect overlapping and non-overlapping protein complexes in PPI networks. RFC is based on a fuzzy relation model which is transformed into equivalent classes to detect nonoverlapping protein complexes. We further apply the upper approximation and lower approximation in rough sets to deal with each node in the network which belongs to one or multiple complexes. Ultimately, each complex corresponds to an overlapping protein complex.
RFC is tested in artificial networks, social networks and PPI networks and it is proved to provide a new insight into network  division and to accurately recover communities in artificial networks. To determine whether these results are robust, we perform comparative benchmarks on a range of LFR graphs with overlapping communities, and find RFC performs competitively in comparison with GCE and OSLOM. To complete our evaluation, we test RFC and six other popular clustering algorithms in five unweighted PPI networks and four weighted PPI networks, and compare the results with MIPS and SGD gold standard datasets separately. We discover the three quality scores (accuracy, sensitivity and separation) obtained by RFC are obviously larger than those by six other methods.
Our results indicate that RFC outperforms six other popular algorithms in terms of matching more complexes between known complexes and predicted complexes with a higher accuracy, known complexes matching more predicted complexes with a higher sensitivity and providing a better one-to-one mapping with reference complexes with a higher separation. RFC results have a significant comprehensive advantage, especially in the Gavin and Collins datasets whose node numbers are close to the ones of the reference complexes. ClusterONE, OSLOM, GCE and MCL yield the closest score to RFC.
There exist several rough-fuzzy clustering algorithms in previous studies [8,14,17,18,40], such as rough c-means clustering (RCM) [13,15], rough-fuzzy c-means clustering (RFCM) [8,18] and rough-fuzzy possibilistic c-means clustering (RFPCM) [17]. These algorithms are mainly based on rough-fuzzy c-means clustering and its derivatives, and they are used to cluster coexpressed genes or functionally similar genes from microarray gene expression data sets. Recently, fuzzy-rough supervised gene clustering algorithm (FRSAC) has been proposed in [40] to detect groups of co-regulated genes whose expression is strongly associated with sample categories. The research objects of these clustering algorithms are two-dimensional gene expression data, that is, each row represents a gene and each column a sample. In those algorithms, the function of fuzzy sets is to handle overlapping partitions, and rough sets deal with uncertainty, vagueness, and incompleteness in class definition.
To our best knowledge, fuzzy clustering algorithm is firstly proposed in [11] to detect overlapping and non-overlapping community in social networks. In the algorithm, the choice of two thresholds is sensitive and it is difficult to choose accurate thresholds in large social networks and PPI networks. If the first threshold is not precise enough, some nodes supposed to belong to a community may not belong to any equivalence classes, so the nodes will not be allocated to the community. If the second threshold is not accurate enough, the overlapping nodes supposed to belong to two or multiple communities may not be allocated to the communities unless they have to be high correlated with the communities. Therefore, choosing the threshold values may cause some difficulties in large social networks and PPI networks and inaccuracy by excluding some edge nodes.
In order to solve the weaknesses, we propose a new algorithm RFC with different algorithms basis, clustering objects structure and the functions of rough set and fuzzy set. To be more specific, RFC algorithm is not based on c-means clustering, and the research objects of RFC are three-dimensional network data. In RFC, Fuzzy sets are used to create fuzzy equivalence relation and obtain clustering number automatically by calculating the number of equivalence classes. Rough sets are used to determine whether each node belongs to one or multiple complexes. The computing process of RFC indicates that the choice scale of the two thresholds in RFC is relatively larger and more flexible than fuzzy clustering algorithm [11]. It is also easier to detect the edge nodes for a community or a complex by introducing the upper and lower approximation in rough set than fuzzy clustering algorithm. The most significant advantage of RFC is that its separation is larger than the one in other algorithms, thus better evaluating one-to-one correspondence between predicted complexes and known complexes.
Protein complexes are key components to perform cellular functions associated with specific diseases [41], for example, overlapping proteins among multiple complexes tend to be drug targets [41]. In biological networks, some critical genes or motifs participate in multiple biological processes, implying the existence of overlapping modules. Studying the overlapping modules in networks is critical since it helps to confer the relationship between structure and function. In future work, we will focus on detecting human protein complexes to investigate disease related gene and drug target by RFC.