Detecting protein complexes from protein-protein interaction (PPI) networks is a challenging task in computational biology. A vast number of computational methods have been proposed to undertake this task. However, each computational method is developed to capture one aspect of the network. The performance of different methods on the same network can differ substantially, even the same method may have different performance on networks with different topological characteristic. The clustering result of each computational method can be regarded as a feature that describes the PPI network from one aspect. It is therefore desirable to utilize these features to produce a more accurate and reliable clustering. In this paper, a novel Bayesian Nonnegative Matrix Factorization(NMF)-based weighted Ensemble Clustering algorithm (EC-BNMF) is proposed to detect protein complexes from PPI networks. We first apply different computational algorithms on a PPI network to generate some base clustering results. Then we integrate these base clustering results into an ensemble PPI network, in the form of weighted combination. Finally, we identify overlapping protein complexes from this network by employing Bayesian NMF model. When generating an ensemble PPI network, EC-BNMF can automatically optimize the values of weights such that the ensemble algorithm can deliver better results. Experimental results on four PPI networks of Saccharomyces cerevisiae well verify the effectiveness of EC-BNMF in detecting protein complexes. EC-BNMF provides an effective way to integrate different clustering results for more accurate and reliable complex detection. Furthermore, EC-BNMF has a high degree of flexibility in the choice of base clustering results. It can be coupled with existing clustering methods to identify protein complexes.
Citation: Ou-Yang L, Dai D-Q, Zhang X-F (2013) Protein Complex Detection via Weighted Ensemble Clustering Based on Bayesian Nonnegative Matrix Factorization. PLoS ONE 8(5): e62158. https://doi.org/10.1371/journal.pone.0062158
Editor: Vladimir N. Uversky, University of South Florida College of Medicine, United States of America
Received: December 24, 2012; Accepted: March 18, 2013; Published: May 2, 2013
Copyright: © 2013 Ou-Yang et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This project was supported in part by National Natural Science Foundation of China (NSFC) (11171354), the Ministry of Education of China (SRFDP-20120171110016). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Protein-protein interactions (PPI) are fundamental to the biological processes within cells . Most proteins form complexes to carry out biological tasks . Protein complexes can help us to predict the functions of proteins , . There is evidence that many disease mechanisms involve protein complexes . Therefore, in the post-genomic era, predicting protein complexes is crucial. To address this problem, several biological experimental methods have been developed for detecting protein complexes. For instance, Tandem Affinity Purification (TAP) with mass spectrometry  can capture stable protein complexes, whereas Protein-fragment Complementation Assay (PCA)  can be used to study temporal and spatial dynamics of protein interactions. However, as mentioned in , , these methods have some inevitable limitations such as too much time consuming. Due to these experimental limitations, it is quite necessary to develop computational approaches which can be acted as useful complements to the experimental methods for detecting protein complexes.
Recently, high-throughput methods such as two-hybrid systems and mass spectrometry  have been developed to detect a large amount of protein interactions, which enable the construction of PPI networks and make it possible for us to understand the cellular organization from the network level. A PPI network can be generally modeled as an undirected graph, where nodes represent proteins and edges represent pairwise interactions. Previous studies analyzed the graph topology of PPI networks and discovered that dense regions of the network may represent complexes –. These observations indicate the rationality of identifying protein complexes by detecting clusters from a PPI network.
In recent years, a vast number of computational approaches based on graph clustering have been applied to PPI networks for protein complexes identification. These graph clustering algorithms mainly depend on the structure topology analysis of PPI networks to identify protein complexes, which can be roughly divided into three categories: density-based approaches, graph partition-based approaches and hierarchical clustering algorithms. Several comprehensive reviews can be found in , , , . The clustering result of a graph clustering algorithm is a set of clusters. In PPI networks, these clusters correspond to two types of modules: protein complexes and functional modules. A protein complex is a group of proteins that interact with each other at the same location and time. A functional module consists of proteins that participate in the same biological process or perform the same cellular function while binding each other at the same or different location and time , . Here, we do not distinguish protein complexes from functional modules since we only use the PPI network as the underlying dataset for the mining task and the protein interaction data under consideration do not provided temporal and spatial information.
Unfortunately, due to the complex structure of the PPI network, the inner structure of protein complexes is still elusive. Given a PPI network, different protein complex identification algorithms may obeys different optimization criteria and yield diverse clustering results since each of them has been developed to capture one aspect of the network and neglect other network properties. Furthermore, the observed PPI networks obtained from high-throughput methods are quite noisy and therefore they may not represent the real situation. The performance of each protein complex identification algorithm heavily depends on network characteristics.
In fact, it is hard to find a protein complex identification algorithm that can generally work well for various networks with diverse properties . Each algorithm has its own advantages and limitations: density-based approaches focus on detecting densely connected subgraphs in PPI networks. A typical example in this category is CFinder  which detects the k-clique percolation clusters as complexes. However, true complexes in the organism are not limited to densely connected substructures . As pointed out by Qi et al. , complexes with sparsely connected substructures also exist in the PPI network. Therefore, traditional density-based approaches may ignore many biological meaningful complexes with low density. Additionally, due to the lack of global measurement, density-based algorithms can not produce satisfactory results. Graph partition-based approaches such as MCL  and RNSC  explore the best partition of a network. These algorithms are not able to discover overlapping complexes since they only support hard clustering. However, it is generally accepted that some proteins may perform different biological functions while interacting with different partners. Thus graph partition-based approaches cannot accurately capture the real structure of complexes in PPI networks. Hierarchical clustering algorithms ,  can discover the hierarchical structure in a PPI network, which is important for understanding the global structure of functional organization. However, both bottom-up and top-down hierarchical approaches are sensitive to noisy data, whereas it is well known that the interaction data obtained from high-throughput methods may be quite noisy and contain a considerable fraction of false positives . Furthermore, like graph partition-based approaches, hierarchical approaches cannot generate overlapping clusters  either.
In order to generate more reliable solutions, protein complexes identification algorithms should ideally exploit all features of the network and account for properties of the partitions, like overlaps and hierarchy. However, very few algorithms are capable of taking all these factors into consideration . Note that the clustering result of each algorithm can be regard as a feature of the PPI network, which describes the network from one aspect. A natural question is whether we can utilize these features. Thus, we study a basic problem in this paper: provided that a PPI network is described by several clustering results computed from different computational methods, how to integrate these clustering results for accurate and reliable complex detection?
Ensemble clustering ,  is a well known data analysis technique to address this problem. In machine learning literature, ensemble clustering has been proposed as an effective approach to strengthen the quality of simple clustering algorithms. There are reasons to believe that ensemble clustering may benefit from the integration of base clustering results. Hence, we would like to apply ensemble clustering to detect protein complexes in PPI networks. In this paper, the base clustering results are obtained from the application of different protein complex identification algorithms on the same PPI network. However, most existing ensemble clustering algorithms focus on naive combination frameworks. That is, they treat each base clustering result equally. But given a PPI network, some clustering results may be more reliable while others may be less reliable. Thus, different base clustering results should not be treated equally.
In light of the aforementioned challenges, to effectively utilize the information contained in different clustering results, we introduce a weighted ensemble approach which assigns a weight to each base clustering result. But there is not prior information to decide the values of these weights. Inspired by agglomerative fuzzy k-means clustering algorithm  and ensemble manifold regularization , we would like to automatically determine the values of these weights through an optimization process such that the ensemble clustering can produce better quality solution.
Clustering analysis by nonnegative matrix factorization (NMF)  has achieved remarkable progress in the past decade. Recently, it has been employed in cancer class discovery and gene expression analysis . As a matrix decomposition techniques, NMF produces a low-dimensional approximation of a nonnegative matrix, in the form of nonnegative factors, which can be formulated as . The nonnegativity of these factors allow them to be interpreted as a soft clustering of the data. As a clustering algorithm, how to estimate the optimal number of clusters (columns of X or rows of Y) is still a serious issue for NMF. Tan and Févotte  formulated a Bayesian approach to determine the effective number of columns of X (or rows of Y) via automatic relevance determination . Recently, Psorakis et al.  applied this model on social networks for community detection. Compared with the previous algorithms, Bayesian NMF model has several advantages: first, each node is associated with a membership distribution over communities, which represent its propensity of belonging to each community. Therefore, it supports the overlap between communities. Second, it does not suffer from the resolution limit. Third, it is easy to implement and fast enough for large data sets. However, simple application of Bayesian NMF model on PPI networks may not obtain competitive results since many protein interactions detected by high-throughput methods may be false positives which will mislead the detection of complexes.
With these motivations, in this paper, we propose a novel Bayesian Nonnegative Matrix Factorization-based weighted Ensemble Clustering (EC-BNMF), for the purpose of identifying protein complexes. EC-BNMF can integrate multiple clustering results (features) of a PPI network and produce a more accurate and informative clustering. In addition, EC-BNMF allows proteins to be shared among complexes, which is much closer to the reality. By applying EC-BNMF on four yeast PPI networks, we show that EC-BNMF has competitive performance with the state-of-the-art algorithms in detecting protein complexes. Furthermore, the experimental results well verify the effectiveness of EC-BNMF in detecting multi-functional proteins.
In recent years, several approaches based on ensemble clustering have been applied to PPI networks for the purpose of detecting protein complexes , . To weight edges of the PPI network and measure the reliability of the corresponding interactions, Asur et al.  first introduced two similarity metrics-clustering coefficient-based metric and betweenness-based metric. After improving the quality of the data, they used three conventional graph partition-based algorithms-repeated bisections, direct k-way partitioning and multilevel k -way partitioning to generate six base clustering results. These base clustering results all consist of k clusters, here k was the predefined number of clusters of each base clustering. Then they described two different techniques-pruning and weighting to eliminate noisy clusters. Finally, they developed a consensus method based on principal component analysis to solve the clustering problem. In order to discover multi-functional proteins, they also designed an adaptation to allow for soft clustering. However, the base clustering algorithms are all partition-based methods. Thus they may not be able to fully capture the structure of the network, and their performance may heavily depends on the quality of the two similarity metrics. What is more, the base clustering algorithms and the consensus algorithm all need to predefine the number of clusters, but the true number of complexes is always unknown.
Another ensemble framework for detecting protein complexes was proposed by Greene et al. . With different number of dimensions, they first generated a collection of non-negative matrix factorizations. Then they proposed a hierarchical meta-clustering algorithm to aggregate these factorizations and produce a disjoint hierarchy of meta-clusters. Finally, they transformed these results into a soft hierarchical clustering of the original dataset. Most recently, Lancichinetti and Fortunato  presented a systematic study of consensus clustering in complex network. They demonstrated that consensus clustering can be used to cope with the stochastic fluctuations in the results of clustering techniques. Given a network and a clustering algorithm S, they first applied S on times and obtained partitions. Then they computed a consensus matrix D which was based on the cooccurrence of nodes in clusters of the base partitions. After filtering out small entries in D, they applied S on D times and produced partitions again, which could generated a new consensus matrix. The procedure is iterated until a unique partition is reached, which cannot be altered by further iterations. Both these two algorithms focus on generating more accurate and stable results out of a set of partitions delivered by a specific method. Greene et al. developed an algorithm to identify protein complexes from several clustering results, but they did not do selection among base clustering results. Whereas Lancichinetti and Fortunato extracted reliable information from base clustering results and used the original algorithm to detect communities.
Given a PPI network with N proteins, we use an undirected simple graph with a set of nodes V and a set of edges E to model it, where nodes represent proteins and edges represent pairwise interactions. The graph can be represented by an adjacency matrix A, where if there is an edge between protein i and j, and otherwise. In this way, the problem of detecting protein complexes is cast into clustering the nodes into groups.
The task of ensemble clustering is to obtain a comprehensive consensus clustering by integrating diverse and independent clustering results (here we call them base clustering results): . Each base clustering result is generated by a computational algorithm (here we call them base clustering algorithms). As some of the base clustering results do not cover all proteins in the PPI network (i.e., MCODE), we set each of the unclustered proteins to be a singleton cluster. Therefore, each base clustering result contains all of the proteins in the PPI network. Given a PPI network, there are several ways to obtain a collection of clustering results. They can be generated by a given approach with different initializations, or from different approaches.
In this section, we propose a novel Bayesian Nonnegative Matrix Factorization-based weighted Ensemble Clustering (EC-BNMF) to perform efficient protein complexes detection. EC-BNMF consists of two phases: a generation phase which extracts useful information from several base clustering results and generates an ensemble PPI network, and a complex detection phase in which a Bayesian NMF-based ensemble clustering is employed to detect protein complexes from the ensemble PPI network. The flow-chart of the algorithm is shown in Figure 1.
EC-BNMF consists of two phases: a generation phase which integrates several base clustering results into an ensemble PPI network and a complex detection phase which accomplishes the detection of protein complexes.
Constructing an ensemble PPI network
Given an original PPI network and a collection of base clustering results, the goal of this phase is to extract useful information from these data. Here, each base clustering result is regarded as a “feature” of the original PPI network, which provides a description of this network. In order to analyze these “features” from network perspective, we construct a feature network through each base clustering result. In each feature network, two proteins are connected if they have cooccurred in the same cluster at least once. Therefore, if we buy the popular definition of protein complexes as subgraphs with a high internal edge density and a low external edge density, it is easy to identify protein complexes from the feature network since it just consists of a set of fully connected subgraphs. Furthermore, since the original PPI network contains important information, we also treat the original PPI network as a feature network. In this way, we obtain feature networks that describe the original PPI network from different aspect.
We use a feature matrix to represent the adjacency matrix of the q-th feature network. According to the definition of the feature network, each entry of denotes whether the corresponding pair of proteins has been clustered together. Thus is a block-diagonal matrix after some permutations (except the adjacency matrix of the original PPI network). As mentioned above, the goal is to extract useful information from these feature networks and generate an ensemble PPI network that is rich in information. Thus the task is turned into the problem of combining these feature matrices into an ensemble matrix W which corresponds to an ensemble PPI network. To effectively utilize the information provided by these feature networks, we propose a novel weighted combination framework to generate the ensemble PPI network. We hope that the ensemble PPI network can approximate the intrinsic of the original PPI network, thus we propose an alternative approach by assuming that the ensemble PPI network is a weighted combination of these feature networks. The above assumption is equivalent to the following constrain:(1)
Here W is an ensemble matrix corresponding to the ensemble PPI network, and is a vector of weights. Therefore, the problem of generating an ensemble PPI network is turned into the problem of learning the optimal linear combination of the feature matrices.
In order to avoid the parameter U overfitting to one feature network (one of the feature networks is weighted at 1 and all other feature networks are weighted at 0), we introduce a regularization term which represents the sum of the negative entropy of the weight for each feature matrix. It can penalize solutions with maximal weight on a single feature network. We will later show how to automatically estimate the optimal weights.
Detecting protein complexes from the ensemble PPI network via Bayesian NMF-based clustering algorithm
It is worthy to stress that represents the evidence provided by base clustering results that protein i and j belong to same complexes. Therefore, in the ensemble PPI network, the stronger the interaction between two proteins, the more likely they perform the same biological functions. In other words, if two proteins have strong interaction, they have high propensities on the same complexes. Based on the characteristics of the ensemble PPI network, we develop a Bayesian NMF-based clustering algorithm to detect protein complexes from this network, which can utilize the group information provided by the edges. In this section, we outline the main idea of this algorithm.
Assuming we have obtained the ensemble matrix W through the generation phase, then the task is to detect protein complexes from the ensemble PPI network to which W corresponds. In other words, given a protein, we attempt to exploit the groups it belongs to. Since such group memberships are always unknown, we can only infer them from the observed network. Here, each entry of W denotes the nonnegative count of interactions between proteins i and j. Suppose there are K complexes in the PPI network. For each protein i, similar to , we introduce a parameter to indicate the strength of protein i's membership of complex . A higher value of means protein i is more likely in complex . The important point is that protein i may have high value of on more than one complexes, thus our method allows proteins to belong to multiple complexes. Furthermore, not all of the complexes need to have proteins associated with them, hence K just represents the upper bound on the number of complexes.
Let be the protein-complex propensity matrix. According to the definition of , the value of represents the possibility of protein i and j belong to the same complexes. As we have mentioned above, the value of also represents the evidence that protein i and j should be clustered together. Therefore, the pair-wise interactions described in are affected by these unobserved nonnegative parameters which can be described as , where . Each element indicates the contribution of complex to . Similar to , , , we assume that the likelihood of a single element of the matrix is given by , where is the Poisson probability density function with rate .
In practice, given a network, the number of complexes is initially unknown. To ameliorate this problem, as presented in , , , we place automatic relevance determination  priors on the columns of H. The effect of these priors is to pick up relevant columns of H that could best account for the observed interactions.
Following the generation model described above, we write down the probability that an ensemble PPI network is generated:(2)where W is the ensemble matrix. Following the choice of , , we assign independent Half-Normal priors on each column of H with zero mean and variance :(3)where for , , and for , . From Equation (3) we find that the elements of the -th column of H are associated with a variance-like parameter (also known as the relevance weight), which controls the relevance of the corresponding complex in accounting for the observed interactions. When the value of is small, all the elements of the -th column of H are close to zero, which means this column is irrelevant and can be removed from the factorization. Through this filter, we obtain a more parsimonious model which indicates the optimal number of clusters.
Similar to the model used in our previous works , , the generative model introduced above will be sensitive to the choice of . To alleviate this problem, under the assumption that each are independent, each relevance weight is given an inverse-Gamma priors which is conjugate to the Half-Normal distribution. Therefore the joint distribution of will be:(4)where a and b are the (nonnegative) shape and scale hyperparameters respectively. We set a and b to be constant for all . In this way, the model may not very sensitive to the choice of a and b. Take all these factors into consideration, we adopt a Bayesian network model to describe the generation process of an ensemble PPI network, and the resulting product is of the form:(5)
For an ensemble PPI network, we estimate the values of H and by maximum the joint probability of Equation (5). By taking Equations (2),(3) and (4) into Equation (5), and taking the negative logarithm and dropping constants, we obtain the objective function of Bayesian NMF-based clustering algorithm:(6)where is the protein-complex propensity matrix. A graphical model to describe the dependence between all these parameters are illustrated in Figure 2. Since and , and the value of is between 0 and 1, we assume for simplicity. Therefore, the term will disappear.
A graphical model that describes the generation process of an ensemble PPI network with weighted adjacency matrix W in terms of the latent structure H, the components of which are generated using half-normal distribution with zero mean and relevance weights . The rectangles are used to group random variables that repeat. The number of repetitions is shown on the top right corner.
Protein complex detection via Bayesian NMF-based weighted Ensemble Clustering
Integrating the above two phases, we obtain a novel Bayesian NMF-based weighted Ensemble Clustering algorithm. The constructed ensemble PPI network is a weighted undirected network and each element of its adjacency matrix W represents the probability of protein i and j belonging to the same complex. Bayesian NMF model assumes that the joint membership of two proteins in the same complex raises the probability of a link existing between them. Therefore, it can effectively identify protein complexes from the ensemble PPI network. Next, we first introduce the objective function of Bayesian NMF-based weighted Ensemble Clustering algorithm. Then we discuss how to optimize this model and estimate the value of the model parameters. Finally, we use this model to detect protein complexes from PPI networks through the estimators of these model parameters.
Objective function of Bayesian NMF-based weighted Ensemble Clustering.
Adding the introduced regularizer R to the objective function (6), and substituting W with Equation (1), then we present a novel weighted ensemble clustering algorithm-Bayesian Nonnegative Matrix Factorization-based weighted Ensemble Clustering (EC-BNMF):(7)
Here coefficient is the tradeoff parameter which controls the balance between objective function (6) and regularizer R.
Solution to Bayesian NMF-based weighted Ensemble Clustering.
Minimization of in (7) with the constraints form a constrained nonlinear optimization problems. To optimize , similar to , we alternately update H, and U. In this procedure, we first fix the values of U, and optimize the value of H and . Then we fix H and , and optimize the value of U. We repeat this alternate updating procedure until the solution converges. In the following we describe the details.
Given U, (7) degenerates to (6), thus we minimize with respect to H and . Similar to , , we adopt the multiplicative update rule ,  to estimate H and , which is widely accepted as a useful algorithm in solving nonnegative matrix factorization problem.
After an update of the values of H and , we fix H and , and turn to the update of U with respect to (7). By solving the constrained optimization problem, we obtain the following updating rule for :(10)
The updating rules (8) can maintain the nonnegativity of the parameters to be inferred. The elements of H will always be nonnegative during the iteration if we initialize H with nonnegative values. For the detailed inference of the three updating rules, please refer to Text S1.
From protein-complex propensity matrix to protein complexes.
Observing at the updating rule (9), it is obvious that each is bounded from below by during each iteration, and it will attains this bound when the -th column of is a zero vector, which means the -th complex is pruned out of the model. After convergence, we set to be the number of complexes which satisfy the following condition:(11)
Here, is a threshold that need to be predefined. Therefore, if , the -th column of H is regarded as irrelevant complex, and could be filtered out. As mentioned above, each column of H contains N nonnegative real values presenting each protein's degree of participation into the corresponding complex. After computing and filtering out irrelevant columns of H, similar to , , , we obtain protein complexes from H by taking a threshold and assigning a protein to a complex if its membership weight for that complex exceeds . In this way, we obtain the resultant protein-complex membership matrix , where if and if . Here, means protein i is assigned to detected complex . Similar to , we only consider the identified complexes that have at least three members since the complexes with two proteins have been presented in the protein interaction data. After completing these steps, we obtain the optimal number of complexes which represents the number of columns of H* that contain at least three elements of 1.
We summarize the overall algorithm in Figure 3. In this paper, we iteratively update H, and U according to the updating rules (8), (9) and (10) until they satisfy a stopping criterion. Let and be the vector of relevance weights at the current and previous iterations respectively. The algorithm is stopped whenever , where is a user defined tolerance parameter. For simplicity, we set the value of this tolerance parameter to be the same as the threshold . Furthermore, we limit the calculation procedure to a maximum of 150 iterations for practical purposes. That is, we stop iterating when or the number of iterations reach 150. Here, the typical value of 1E-6 is selected as the value of the tolerance parameter and the threshold . In order to avoid a local minimum, we repeat the algorithm 50 times with random initial conditions and choose the result that outputs the lowest value of objective function (7).
Here, we also consider two special cases of our model. First, if we fix the value of each weight to be , the ensemble matrix is , thus the weighted combination framework degenerates to the naive combination framework. In this case, each feature network is treated equally. Second, if the weight of the original PPI network is set to 1, the ensemble matrix W becomes A which is the adjacency matrix of the original PPI network. In this case, our model is equivalent to applying Bayesian NMF model on the original PPI network.
In this section, we evaluate the effectiveness of EC-BNMF in detecting protein complexes. Before presenting the results of our comparative experiments, we first describe the PPI networks and validation metrics that are used. Then we discuss the effect of parameters and the benefits of weighted ensemble clustering. Next, we investigate the performance improvement brought by integrating diverse clustering results. Finally, we compare EC-BNMF with other ensemble clustering algorithms and evaluate the overlapping protein complexes detected by EC-BNMF.
EC-BNMF is tested using four PPI networks from S. cerevisiae. The yeast S. cerevisiae is a highly effective model organism that presents an ideal opportunity to test the performance of a newly proposed algorithm since a great deal of protein complexes of it is known. In this paper, we concentrate our analysis on the following four different high-throughput derived PPI networks: a high-reliable database published by Collins et al. , two experimental yeast PPI networks published by Gavin et al.  and Krogan et al.  respectively, and the entire set of physical interactions in yeast from BioGRID , . Here we use Collins, Gavin, Krogan and BioGRID to represent these four networks. In this paper, for simplicity, we just extract the largest connected components from all the four networks. The corresponding features of the four networks are listed in Table 1. As can be seen from this table, these four networks have different topological properties, we use them as model datasets to test the comprehensive performance of EC-BNMF.
Gold standard protein complexes
To measure the accuracy of the detected complexes, we choose two widely used benchmark complex reference sets as gold standards. One of them is downloaded from the MIPS database , the other one is derived from the Gene Ontology annotations of the Saccharomyces Genome Database , . Following Brohée and Van Helden's study , we use the 220 filtered yeast protein complexes from MIPS database as our first reference set, and we call them MIPS complexes here. In addition, since the complexes in MIPS database do not cover all the proteins in the considered network, we also use another independent reference set, and we call them SGD complexes here. SGD complexes are generated from SGD database  following the procedure described by Nepusz et al. .
The MIPS complexes are download from http://rsat.bigre.ulb.ac.be/rsat/data/publisheddata/brohee_2-006_clustering_evaluation/index_tables.html/. The SGD annotations and GO structure are download from Gene Ontology database  http://www.geneontology.org/on 24 April 2012. In order to prevent the membership of the same protein inconsistencies, we test these two reference set separately. For both reference sets, to avoid selection bias, we filter out the proteins that are not contained in the network at hand. Furthermore, only complexes with at least 3 and no more than 100 members are considered. In Table 2 we summarize the statistics of these reference sets with respect to each PPI network.
We evaluate the performance of a protein complex identification algorithm by judging how well the predicted complexes correspond to the known complexes. In this study, three independent quantity measures are used to assess the similarity between a set of predicted complexes and a set of reference complexes. The first one is the f-measure which is defined as the harmonic mean of Precision and Recall . The other two are the Jaccard and PR metrics which are proposed by Song and Singh . Among these three measures, f-measure is used to assess the similarity between predicted complexes and reference complexes at complex level (Recall measures what fraction of the reference sets are matched by the predicted complexes, and Precision measures what fraction of the predicted complexes are matched by the reference complexes). Whereas Jaccard and PR metrics can measure how well the predicted complexes correspond to reference complexes at complex-protein pair level, which take into account the number of proteins in each complex. The value of each measure vary between 0 and 1, and the higher value means better overlaps. For more details about these three scoring measures, please refer to Text S2. These evaluation metrics can provide us some sense of how well the protein complex identification algorithm can be used to detect protein complexes from PPI networks.
Choice of parameters
There are five parameters K, , a, b and that need to be predefined in our algorithm. K is the maximum number of complexes. Note that EC-BNMF can filter out irrelevance complexes, so the value of K can be taken sufficiently large. Here we empirically set for Collins, Gavin and Krogan, and for BioGRID. is the threshold used to obtain protein complexes from the protein-complex propensity matrix, and we find experimentally that always leads to reasonable results on the four networks. Observing that the shape hyperparameter a affects the optimization of the objective function (7) only through the updating rule (9), thus the influence of a is moderated by the number of nodes N. Therefore, we choose a to be small compared to N. Experimental results also confirm that smaller value of a leads to better results. In this paper, we fix and vary the value of b to find the best result for each network. Another key parameter is which control the effect of regularization term . The parameter controls the relative differences between feature networks. Setting forces all feature networks to be given equal weight, whereas setting discards the regularization term. Through updating rule (10) we can find that the effect of depends on the value of . In order to facilitate the selection of , we set the value of to be in proportion to . That is, we set . To find out the suitable value of , we just need to vary the value of and evaluate the corresponding performance. Finally, the key parameters that affect the performance of EC-BNMF are b and .
In order to fully understand how these two parameters affects the performance, we investigate how the performance changes as the values of these parameters change. To this end, we vary the values of b and for each PPI network, and compare the corresponding experiment results in terms of Jaccard, PR and f-measure with respect to two reference sets. For each PPI network, we try different combination values of () and b ().
For each network, the harmonic mean of six scores (Jaccrad, PR and f -measure with respect to MIPS and SGD complexes) is used to measure the performance of EC-BNMF. Figure 4 shows the corresponding results with respect to various values of and b on the four PPI networks. As shown in Figure 4, for a fixed value of , as the value of b increases, the harmonic mean scores increase initially and decrease after reaching the maximum, and this is true for all the four PPI networks. On the other hand, for a fixed value of b, as the value of increases, the harmonic mean scores increase initially and decrease after reaching the maximum, but the change is not very obvious. This phenomenon is partly owing to the choice of prior information . With this prior information, the model is not very sensitive to small changes in . In fact, if is large enough, the weight assigned to each feature network is nearly equal. Thus the performance will not change a lot with the increase of . In our model, is used to adjust the weights, and it only affects the quality of the ensemble PPI network. Unless a considerable fraction of base clustering results are poor, a small change of can not lead to big change of the performance. From Figure 4, we can find that in terms of parameters, EC-BNMF is relatively stable. In fact, the performance of EC-BNMF is very sensitive to the choice of . To reduce the sensitivity, we assign independent inverse-Gamma priors on each such that EC-BNMF can automatically estimate the optimal value of . Through the updating rule (9), EC-BNMF can adaptively adjust the value of . Thus, EC-BNMF is not very sensitive to the choice of b. Nevertheless, both b and contribute to improving the performance of EC-BNMF.
Performance of EC-BNMF on protein complex detection with respect to different values of b and measured in terms of the harmonic mean score. The -axis denotes the value of , the -axis denotes the value of , and the -axis denotes the value of the harmonic mean of the three measure scores of both MIPS and SGD complexes. (A) Collins network. (B) Gavin network. (C) Krogan network. (D) BioGRID network.
We can find from Figure 4 that the optimal result are obtained when and for Collins network, and for Gavin network, and for Krogan network, and and for BioGrid network. In the following, unless otherwise stated, the complexes detected by EC-BNMF are obtained with these optimal values of parameters for the four PPI networks.
In this section, we systematically evaluate the proposed model on the protein complex detection task. EC-BNMF strives to combine several partitions of a network into a more desirable clustering result. These partitions can be obtained from a single clustering algorithm with different initializations or from the application of different clustering algorithms on a network. In this paper, we focus on the combination of the results of different algorithms since different algorithms may discover different patterns in a given network and increase the information available for ensemble clustering. Therefore, we choose ten state-of-the-art algorithms as base clustering algorithms: CFinder , CMC , ClusterONE , COPRA , DPClus , MCL , MCODE , MINE , RNSC , and SPICi . A brief description of these algorithms and the setting of parameters are discussed in Text S3. We also list the websites where we download the corresponding softwares in Text S3 Table 1.
Weighted combination versus naive combination.
In this section, we investigate the benefits of performing weighted combination when constructing the ensemble PPI network. As a baseline for comparison, we test the performance of naive combination which is a special case of EC-BNMF, where . Here, we call it naive ensemble clustering (NEC).
We apply EC-BNMF and NEC on four PPI networks, and compare their performance. Through updating rules (8) and (9), we can find that the performance of NEC depends on the choice of parameter b. For each PPI network, the results of NEC are obtained over the best tuned parameters. Figure 5 shows the comparative performance of EC-BNMF and NEC on four PPI networks in terms of the three measures (Jaccard, PR and f-measure) according to MIPS and SGD complexes. From Figure 5, we can see that EC-BNMF leads to better performance on all the four PPI networks.
Comparison of the performance of EC-BNMF, naive ensemble clustering (NEC) and Bayesian NMF model (BNMF) on four PPI networks with respect to (A) MIPS gold standard and (B) SGD gold standard.
In real cases, the performance of each clustering algorithm is based on the topological characteristics of the network under consideration. Given a network and a collection of clustering results, some clustering results may perform well in recapitulating protein complexes while others may not. Therefore, constructing an ensemble PPI network by simply averaging is inadequate since the information provided by poor clustering results may be unreliable, and may affect the performance of ensemble clustering.
Ensemble PPI network versus original PPI network.
To demonstrate the benefits of using the ensemble PPI network, we consider the individual performance of applying Bayesian NMF model (BNMF) on the original PPI network. That is, we assume which is the adjacency matrix of the original PPI network. It is noteworthy to mention that Bayesian NMF model is also a popular clustering algorithm. Thus, it is of great interest to test the performance of Bayesian NMF model on original PPI networks. Next, we apply BNMF and EC-BNMF on four PPI networks, and compare their performance. According to updating rules (8) and (9), the performance of BNMF is based on the choice of b. For each PPI network, to be fair, the results of BNMF are obtained over the best tuned parameters. The comparison of these two algorithms are displayed in Figure 5.
The results shown in Figure 5 well verify the benefits of using ensemble PPI networks. It can be observed that EC-BNMF leads to better performance than the individually inferred by applying BNMF on the original PPI network. These results once again show that the performance of a complex identification algorithm is based on its optimization criteria and the characteristic of the network under consideration. BNMF is based on the assumption that if there is an edge between two proteins, they may belong to the same complex. However, the data obtained from high-throughput methods is believed to be quite noisy, and many interactions may be false positives. Therefore, if the network does not satisfy this assumption, the complexes detected by BNMF may be less reliable. In EC-BNMF, we integrate the results of different clustering algorithms, and generate an ensemble PPI network. Each edge in the ensemble PPI network represents the evidence (provided by base clustering results) that the corresponding proteins belong to same complexes. Thus Bayesian NMF model can find out more reliable complexes.
Furthermore, we can see from Figure 5 that BNMF has better performance than NEC on Krogan and BioGRID. In NEC, all base clustering results are treated equally, thus the information provided by unreliable clustering results may mislead the detection of complexes.
Quantitative comparison with base clustering algorithms.
To further evaluate the competitiveness of EC-BNMF in detecting protein complexes, we compare its results with the ones from base clustering algorithms since these algorithms are also the most popular methods. For all the clustering results, we only consider clusters that have at least three elements. Table 3 presents the comparative performance of different clustering algorithms on four PPI networks. The details of these algorithms are presented in Text S3. The results are obtained over the best tuned parameters for each algorithm. Remarkably, for BioGRID, CFinder can not give a clustering result in 48 hours, so it does not take part in the generation phase when considering BioGRID dataset, and the corresponding result will not be listed in this table.
As shown in Table 3, for all the four PPI networks, EC-BNMF has competitive performance with other methods in terms of the three measures (Jaccard, PR and f-measure) with respect to MIPS and SGD complexes. Furthermore, similar to the results shown in , , , among the ten base clustering algorithms, none of them can dominate other methods on all networks according to the three measures. In particular, on Collins, MCODE consistently outperforms other methods, while RNSC also has a competitive performance. On Gavin, SPICi performs better than other methods with respect to SGD complexes, while CMC and RNSC have good performance with respect to MIPS complexes. On Krogan, CMC outperforms other methods with respect to MIPS complexes, and DPClus outperforms other methods with respect to SGD complexes. On BioGRID, SPICi and ClusterONE output higher quality clusters than others. These results illustrate that different approaches have complimentary strengths. The effectiveness of EC-BNMF in detecting protein complexes is mainly due to its ability of capturing information from multiple clustering results in a unified inference procedure. This is achieved by allocating proper weights to base clustering results and seeking the consistent dense regions among these results to output more accurate and reliable clustering results.
One may have noticed that for some PPI networks such as Collins, the evaluation scores obtained by EC-BNMF are close to some base clustering algorithms. This may be due to the clustering results of the base clustering algorithms are very similar on this PPI network. EC-BNMF is an ensemble algorithm whose performance depends on the base clustering results. Therefore, if the base clustering results are close to each other, the performance improvement may not be very noticeable. However, stability is one of the advantages of EC-BNMF. Even though some methods could obtain similar performance to EC-BNMF with respect to a single measure or a single gold standard, they can not perform well on all PPI networks. For example, compared with other base clustering algorithms, the complexes detected by MCODE are more accurate on Collins, whereas on the other three PPI networks, the complexes detected by MCODE are not very accurate. On Krogan, the complexes detected by CMC are more accurate than other base clustering algorithms with respect to MIPS complexes. When considering SGD complexes, the complexes detected by DPClus are more accurate. But on the other three PPI netoworks, the complexes detected by CMC and DPClus may not well match the known complexes. Furthermore, EC-BNMF allows overlaps between protein complexes. Although some base clustering algorithms perform well on some PPI networks (RNSC performs well on Gavin, and SPICi performs well on Gavin and BioGRID), they can not discover overlapping complexes. Viewed in this light, we provide an alternative method to identify protein complexes, which can discover overlapping complexes and has stable performance on networks with different topological features.
To evaluate the overall performance of each algorithm, we integrate the measurement results of each algorithm on different PPI networks into a final score by weighted combination. The weight of each PPI network indicates the number of proteins in this network, divided by the total number of proteins in all the four PPI networks. Let , , and denote the weights of Collins, Gavin, Krogan, and BioGRID respectively. Then we can calculate their value: , , and . The measurement results of an algorithm on a PPI network can be viewed as a 6-dimensional vector (Jaccard, PR and f-measure with respect to MIPS and SGD). The final score of each algorithm is a weighted combination of its measurement results on four PPI networks (e.g., final score of MCL can be computed by . Here, , , , and denote the final score of MCL, the measurement results of MCL on Collins, Gavin, Krogan and BioGRID respectively). Note that CFinder is just run on three PPI networks (Collins, Gavin and Krogan), its final score should be calculated according to its performance on these three networks (Here, the weights of these three networks are , and respectively.). To be fair, the average performance of EC-BNMF on these three networks is also calculated. The final scores of different algorithms are listed in Table 4. From Table 4, we can see that EC-BNMF has competitive overall performance with the base clustering algorithms.
Table 3 also lists the values of weights inferred by EC-BNMF for each feature network. From Table 3, we can see that algorithms have better performances always get higher weights, while the poor ones always obtain lower weights. For instance, MCODE has best performance on Collins, thus it obtains the highest weight 0.1450. COPRA performs the worst on Collins, so it gets the lowest weight 6.437e-6 which is close to zero. Therefore, EC-BNMF is able to effectively utilize the information contained in different clustering results. In addition, for all the four PPI networks, we find that COPRA always has poor performance in detecting protein complexes, so the weights assigned to it are always close to zero. These results demonstrate that EC-BNMF is robust when combining different clustering results.
Comparison with other ensemble clustering algorithm.
In this section, we compare EC-BNMF with Ensemble NMF clustering algorithm  which is also developed for clustering PPI networks. For each PPI network, we use the default settings of parameters in the software except two parameters-the range for selecting number of clusters in each factorization and the maximum number of leaf nodes in final soft hierarchy. For Collins, we set the range to be and . For Gavin, we set and . For Krogan, we set and . Since the true number of complexes for each network is unknown, and the authors did not clearly mention how to determine the number of complexes in their paper, we use the leaf nodes in final soft hierarchy as clusters. Furthermore, there is not prior information about , so we select three numbers for each network and test their performance. For Collins and Gavin, we set to be 80, 100 and 120. For Krogan, we set to be 100, 120 and 140. As mentioned in , Ensemble NMF clustering algorithm is considerably more computationally complex than standard hierarchical clustering techniques. We do not list its results on BioGRID since it can not give a clustering result in 48 hours. The comparison of these two algorithms are shown in Figure 6. As we have mentioned above, Asur et al.  also developed an ensemble clustering algorithm to clustering PPI networks. We do not compare EC-BNMF with the ensemble clustering algorithm proposed by Asur et al.  because there are too many parameters need to be predefined and how to determine their value is not clearly mentioned.
Performance of EC-BNMF in comparison with CMC, ClusterONE, SPICi and Ensemble NMF on four PPI networks in terms of PR, Jaccard and f-measure with respect to (A) MIPS gold standard and (B) SGD gold standard.
Given a PPI network and a collection of clustering results, after getting the ensemble PPI network, the task is turned into detecting protein complexes from this ensemble PPI network which is a weighted undirected network. Besides Bayesian NMF model, there are several methods can deal with weighted networks such as CMC . In order to illustrate the advantages of EC-BNMF, we design a heuristic comparison. We apply CMC, ClusterONE, and SPICi on the ensemble PPI network and evaluate their performance according to the evaluation criteria proposed above. All of these three algorithms are able to detect complexes from weighted PPI networks directly and output the results in a reasonable time. Other algorithms that can deal with weighted PPI networks are not considered since they can not output the results in a reasonable time. Furthermore, CMC, ClusterONE and SPICi can not automatically update the value of weights U, so we apply these three algorithms on the naive ensemble PPI network where . The results of these three algorithms are obtained over the best tuned parameters. The comparison are shown in Figure 6.
It can be observed from Figure 6 that EC-BNMF has competitive performance with other compared algorithms in terms of the three measures on the four PPI networks. With respect to the performance of SPICi on the original PPI network in Table 3, SPICi has a notable gains in accuracy on the ensemble PPI network, demonstrating that incorporation of diverse feature networks (base clustering results) yields improved performance. EC-BNMF can construct an informative ensemble PPI network and effectively utilize the information contained in this ensemble PPI network, thus it can output more accurate and reliable results. The poor performance of CMC and ClusterONE on ensemble PPI network demonstrates that they may not be well suited to handle such data. As shown in Figure 6, Ensemble NMF clustering algorithm can not output competitive results. Ensemble NMF clustering algorithm only utilizes a single algorithm to produce the base clustering results, thus it could only capture one aspect of the data. Furthermore, if we try the more appropriate parameters, the performance of Ensemble NMF clustering algorithm may be improved, but we have no criteria for selecting parameters.
Detecting multi-functional proteins.
In fact, some proteins are believed to exhibit different functions while interacting with different partners. Therefore, an approach for protein complex detection should be able to accommodate proteins that are present in more than one complex. As we have mentioned above, EC-BNMF allows a protein to belong to more than one complex. To illustrate the effectiveness of EC-BNMF in detecting multi-functional proteins, we draw support from the functional annotations for the multi-clustered proteins detected by EC-BNMF, with respect to Gene Ontology (GO) database . Complete lists of the functional annotations for the multi-clustered proteins detected by EC-BNMF on four PPI networks can be found in Table S1. Furthermore, similar to , we test whether topological and functional features can distinguish multi-clustered proteins from mono-clustered proteins. The corresponding results are shown in Figure 7, Figure 8 and Figure 9.
For degree, the distributions of mono- and multi-clustered proteins are represented by boxplots (line = median). (A) Collins network. (B) Gavin network. (C) Krogan network. (D) BioGRID network.
For betweenness, the distributions of mono-and multi-clustered proteins are represented by boxplots (line = median). (A) Collins network. (B) Gavin network. (C) Krogan network. (D) BioGRID network.
Quantitative comparison of the number of Gene Ontology terms associated with mono-and multi-clustered proteins. The distributions of mono-and multi-clustered proteins are represented by boxplots (line = median). (A) Collins network. (B) Gavin network. (C) Krogan network. (D) BioGRID network.
From Figure 7 and Figure 8, we can find that multi-clustered proteins have, on average, a higher degree and a higher node betweenness, and this is true for all the four PPI networks (Wilcoxon text, for Collins, both for degree and betweenness. For Gavin, both for degree and betweenness. For Krogan, both for degree and betweenness. For BioGRID, both for degree and betweenness). From Figure 9, we can observed that multi-clustered proteins are, on average, annotated to more GO terms than mono-clustered proteins, in terms of three ontologies (Biological Process, Cellular Component and Molecular Function. For Collins, for Cellular Component. For Gavin, for Cellular Component. For Krogan, for Cellular Component. For BioGRID, for Cellular Component). For detailed analysis about the Wilcoxon test, please refer to Text S4. Based on the preceding discussion, we find that multi-clustered proteins involved in a larger number of process than mono-clustered proteins. Therefore, EC-BNMF is effective in detecting multi-functional proteins.
To highlight the advantages of EC-BNMF in detecting multi-functional proteins, we present an illustrative example of how three complexes with known overlaps are detected by CFinder, ClusterONE, MCODE and EC-BNMF in Figure 10 A, B, C and D. For more examples, please refer to Text S5. The clusters shown in Figure 10 are drawn from the clustering results of CFinder, ClusterONE, MCODE and EC-BNMF on Collins. The green circle nodes represent RNA polymerase I; The yellow rectangle nodes represent RNA polymerase II; The blue triangle nodes represent RNA polymerase III and the light purple parallelogram nodes represent proteins with other functions. Shaded areas represent the clusters detected by the corresponding method. It can be observed from Figure 10 A, B and C that CFinder, ClusterONE and MCODE can not correctly detect these three overlapping complexes. Each of them have their own advantages and limitations. In particular, they all cluster RNA polymerase I and III together.
The RNA polymerase I, II, III detected by (A) CFinder. (B) ClusterONE. (C) MCODE. (D) EC-BNMF on Collins. Proteins are labeled according to the complex they belong to: green circle nodes represent RNA polymerase I, yellow rectangle nodes represent RNA polymerase II, blue triangle nodes represent RNA polymerase III and light purple parallelogram nodes represent proteins with other functions. Proteins shared by all the three complexes are labeled with red hexagon, while proteins shared by RNA polymerase I and III are labeled with purple diamond. Shaded areas represent the clusters detected by the corresponding method. This figure is plotted with software Cytoscape .
As can be seen from Figure10D, the clusters obtained by EC-BNMF can correctly classify these three complexes. Furthermore, four proteins (YOC224C, YOR210W, YBR154C and YPR187W) common to RNA polymerase I, II and III are correctly classified. Two other proteins (YNL113W and YPR110C) that are shared by RNA polymerase I and III are also correctly classified. This example demonstrates that EC-BNMF can integrate diverse base clustering results into a more accurate and reliable results. One may have noticed that cluster associated with RNA polymerase III contains cluster associated with RNA polymerase I. The reason lies on the following two facts. First, these two complexes share more proteins and have intensive interactions. Second, most of the base clustering results cluster these two complexes together. Thus in the ensemble PPI network, they tend to be of the same cluster. Nevertheless, EC-BNMF can correctly identify the proteins belong to RNA polymerase I and the proteins shared between these three complexes.
Discussion and Conclusion
The identification of protein complexes will bring richer biological information in gaining insights into the working mechanism of cell and revealing the disease mechanisms. In recent years, numerous mathematical and computer algorithms have been proposed to tackle this problem. Most of these algorithms are designed to explore specific structures in the network. They are based on different optimization criterion and different assumptions of the inner structure of protein complex. Therefore, a single algorithm can only capture one aspect of the PPI network. As Song and Singh  mentioned in their study, no single algorithm performs best on all networks. For example, many researchers consider densely connected subgraphs as protein complexes. Under this assumption, protein complexes should be highly connected internally and sparsely connected with the rest of the network. However, this view could not fully describe the characteristic of the PPI network since besides densely connected substructures, complexes with sparsely connected substructures also exist (e.g., linear shape). Furthermore, traditional protein complex identification algorithms that do not support overlap between complexes can not reveal the biological reality. Besides, how to determine the number of complexes in a PPI network is still an open question.
To address these problems, in this study, an alternative method (EC-BNMF) is proposed to identify protein complexes. EC-BNMF is a novel weighted ensemble clustering algorithm which can integrate the clustering results of different protein complex identification algorithms and generate an accurate and reliable clustering result. Unlike conventional ensemble clustering algorithms that treat each base clustering result equally, EC-BNMF is a weighted ensemble clustering algorithm which can automatically estimate the optimal weights of different base clustering results. Therefore, base clustering results that obtain higher weights may be more reliable and can be regarded as important features. On the contrary, base clustering results with lower weights may be less reliable features and they may be far away from real cases. With these weights, we can do selections among features, and output more reliable results. Thus, stability is one of the advantages of EC-BNMF. Experimental results on four yeast PPI networks well verify the stability and effectiveness of EC-BNMF in detecting protein complexes. Further, EC-BNMF allows overlaps between protein complexes, which is closer to the reality.
In fact, as far as we known, two other ensemble clustering algorithms ,  have been developed to detect protein complexes from PPI networks. We do not compare EC-BNMF to Asur method  not only for their method need to pregiven the number of complexes which is always unknown, but also for there are many parameters need to be predefined which are not clearly mentioned in their paper and there is no public software available. Hence, we compare our model with Ensemble NMF . Furthermore, to demonstrate the effectiveness of EC-BNMF, we also design some heuristic comparisons. For instance, we apply Bayesian NMF model on the original PPI network and apply CMC, ClusterONE and SPICi on the ensemble PPI network. Anyhow, our analysis show that EC-BNMF can strengthen the quality of simple algorithms, and obtain more accurate results.
As an ensemble clustering algorithm, on the one hand, EC-BNMF has improved performance than individual clustering algorithms, and can alleviate the interference of unreliable clustering results. On the other hand, the performance of EC-BNMF depends on the base clustering results. If all these results are generated by random or computed by poor clustering algorithms, they may far away from real cases. In such a case, the performance of EC-BNMF may also be poor. To alleviate this problem, we also regard the original PPI network as a feature network. Therefore, if most of the base clustering results are unreliable, EC-BNMF can assign higher weight on the original PPI network and assign lower weights on these bad results. In this way, the performance of EC-BNMF is less dependent on the base clustering results. However, a key aspect of EC-BNMF is its ability of integrating multiple features of the PPI network and generating more reliable results. We are concerned with how to get an accurate and informative clustering, therefore, we choose some popular algorithms as base clustering algorithms since their results can effectively describe the network. In addition, the multi-functional proteins discovered by EC-BNMF are also relied on base clustering results. Based on these base clustering results, EC-BNMF can filter out unreliable multi-functional proteins and add in more reliable multi-functional proteins.
We now count the overall time cost of the updating process in Equation (8), (9) and (10). The time cost for updating H is , where N is the number of proteins, and K is the number of complexes. The time cost for updating is and the time cost for updating U is . Therefore the overall time cost of EC-BNMF is , where T is the number of iterations. Since the parameter H is sparse, the real time cost is much smaller than . In addition, before performing our ensemble algorithm, we need to compute base clustering results, which is time consuming. Nevertheless, this research is still meaningful for the following reasons: First, as mentioned in , in the context of understanding and exploiting the structure in PPI networks, cluster analysis is used as an “offline” process, where producing an accurate and reliable clustering is the primary goal. Second, when generating base clustering results through some protein complex identification algorithms, we can use the softwares provided by the authors to implement these algorithms, which are written in C++ language or Java. Third, with the rapid development of computer hardware, we have the ability to undertake large amount of operations. Fourth, the generation process of base clustering results can be parallelized. Therefore, by parallel computing, we can generate different clustering results simultaneously on modern multi-core processors and reduce the running times.
In this paper, we use Poisson distribution, Half-Normal distribution and inverse-Gamma distribution to model the generation process of the ensemble PPI network. Indeed, other distributions such as Bernoulli distribution, binomial distribution and exponential distribution can also be tried. As an ensemble clustering algorithm, our model is more flexible. It is of great interest to use this model to undertake other clustering-based tasks such as exploring modules in gene regulatory networks and cell signaling networks.
Complete lists of the functional annotations for the multi-clustered proteins detected by EC-BNMF on the four PPI networks. For each protein, we present the GO terms associated with it, in terms of three ontologies (Biological Process, Cellular Component and Molecular Function).
Detailed inference of the solution to Bayesian NMF-based weighted Ensemble Clustering.
The three metrics used for evaluating the predicted protein complexes.
Brief description and detailed parameter settings of the base clustering algorithms.
Comparison of the number of Gene Ontology annotations between mono-clustered and multi-clustered proteins.
We would like to thank the associate editor and the reviewers for their insightful comments.
Conceived and designed the experiments: LOY DQD XFZ. Analyzed the data: LOY DQD XFZ. Wrote the paper: LOY DQD XFZ.
- 1. Qi Y, Balem F, Faloutsos C, Klein-Seetharaman J, Bar-Joseph Z (2008) Protein complex identification by supervised graph local clustering. Bioinformatics 24: i250–i268.
- 2. Li X, Wu M, Kwoh C, Ng S (2010) Computational approaches for detecting protein complexes from protein interaction networks: a survey. BMC Genomics 11: S3.
- 3. Schwikowski B, Uetz P, Fields S (2000) A network of protein-protein interactions in yeast. Nat Biotechnol 18: 1257–1261.
- 4. Zhang XF, Dai DQ (2012) A framework for incorporating functional interrelationships into protein function prediction algorithms. IEEE/ACM Trans Comput Biol Bioinform 9: 740–753.
- 5. Vanunu O, Magger O, Ruppin E, Shlomi T, Sharan R (2010) Associating genes and protein complexes with disease via network propagation. PLoS Comput Biol 6: e1000641.
- 6. Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M, et al. (1999) A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol 17: 1030–1032.
- 7. Tarassov K, Messier V, Landry C, Radinovic S, Molina M, et al. (2008) An in vivo map of the yeast protein interactome. Science 320: 1465.
- 8. Ji J, Zhang A, Liu C, Quan X, Liu Z (2012) Survey: Functional module detection from protein-protein interaction networks. IEEE Trans Knowl Data Eng PP: 1.
- 9. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, et al. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci USA 98: 4569–4574.
- 10. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, et al. (2002) Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature 415: 180–183.
- 11. Tong A, Drees B, Nardelli G, Bader G, Brannetti B, et al. (2002) A combined experimental and computational strategy to define protein interaction networks for peptide recognition modules. Science 295: 321.
- 12. Bader GD, Hogue CW (2003) An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4: 2.
- 13. Barabási A, Oltvai Z (2004) Network biology: understanding the cell's functional organization. Nat Rev Genet 5: 101–113.
- 14. Brohee S, Van Helden J (2006) Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinformatics 7: 488.
- 15. Wang J, Li M, Deng Y, Pan Y (2010) Recent advances in clustering methods for protein interaction networks. BMC Genomics (Suppl 3): S10.
- 16. Song J, Singh M (2009) How and when should interactome-derived clusters be used to predict functional modules and protein function? Bioinformatics 25: 3143–3150.
- 17. Adamcsek B, Palla G, Farkas I, Derényi I, Vicsek T (2006) Cfinder: locating cliques and overlapping modules in biological networks. Bioinformatics 22: 1021–1023.
- 18. Enright AJ, Dongen SV, Ouzounis CA (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 30: 1575–1584.
- 19. King A, Pržulj N, Jurisica I (2004) Protein complex prediction via cost-based clustering. Bioinformatics 20: 3013–3020.
- 20. Girvan M, Newman M (2002) Community structure in social and biological networks. Proc Natl Acad Sci USA 99: 7821–7826.
- 21. Ravasz E, Somera A, Mongru D, Oltvai Z, Barabási A (2002) Hierarchical organization of modularity in metabolic networks. Science 297: 1551–1555.
- 22. Bader G, Hogue C (2002) Analyzing yeast protein-protein interaction data obtained from different sources. Nat Biotechnol 20: 991–997.
- 23. Cho Y, Hwang W, Ramanathan M, Zhang A (2007) Semantic integration to identify overlapping functional modules in protein interaction networks. BMC Bioinformatics 8: 265.
- 24. Ahn Y, Bagrow J, Lehmann S (2010) Link communities reveal multiscale complexity in networks. Nature 466: 761–764.
- 25. Strehl A, Ghosh J (2003) Cluster ensembles-a knowledge reuse framework for combining multiple partitions. J Mach Learn Res 3: 583–617.
- 26. Topchy A, Jain A, Punch W (2005) Clustering ensembles: Models of consensus and weak partitions. IEEE Trans Pattern Anal Mach Intell 27: 1866–1881.
- 27. Li M, Ng M, Cheung Y, Huang J (2008) Agglomerative fuzzy k-means clustering algorithm with selection of number of clusters. IEEE Trans Knowl Data Eng 20: 1519–1534.
- 28. Geng B, Tao D, Xu C, Yang L, Hua X (2012) Ensemble manifold regularization. IEEE Trans Pattern Anal Mach Intell 34: 1227–1233.
- 29. Lee D, Seung H (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401: 788–791.
- 30. Kim H, Park H (2007) Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23: 1495–1502.
- 31. Tan VYF, Févotte C (2009) Automatic relevance determination in nonnegative matrix factorization. In: Gribonval R, editor, SPARS'09-Signal Processing with Adaptive Sparse Structured Representations. Saint Malo, Royaume-Uni: Inria Rennes-Bretagne Atlantique. Available: http://hal.inria.fr/inria-00369376.
- 32. MacKay D (1995) Probable networks and plausible predictions-a review of practical bayesian methods for supervised neural networks. Netw-Comput Neural Syst 6: 469–505.
- 33. Psorakis I, Roberts S, Sheldon B (2010) Soft partitioning in networks via bayesian non-negative matrix factorization. NIPS.
- 34. Asur S, Ucar D, Parthasarathy S (2007) An ensemble framework for clustering protein-protein interaction networks. Bioinformatics 23: i29–i40.
- 35. Greene D, Cagney G, Krogan N, Cunningham P (2008) Ensemble non-negative matrix factorization methods for clustering protein-protein interactions. Bioinformatics 24: 1722–1728.
- 36. Lancichinetti A, Fortunato S (2012) Consensus clustering in complex networks. Scientific Reports 2..
- 37. Tan V, Févotte C (2012) Automatic relevance determination in nonnegative matrix factorization with the beta-divergence. IEEE Trans Pattern Anal Mach Intell PP: 1.
- 38. Zhang XF, Dai DQ, Li XX (2012) Protein complexes discovery based on protein-protein interaction data via a regularized sparse generative network model. IEEE/ACM Trans Comput Biol Bioinform 9: 857–870.
- 39. Zhang XF, Dai DQ, Ou-Yang L, Wu MY (2012) Exploring overlapping functional units with various structure in protein interaction networks. PLoS One 7: e43092.
- 40. Seung D, Lee L (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13: 556–562.
- 41. Nepusz T, Yu H, Paccanaro A (2012) Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods 9: 471–472.
- 42. Collins S, Kemmeren P, Zhao X, Greenblatt J, Spencer F, et al. (2007) Toward a comprehensive atlas of the physical interactome of saccharomyces cerevisiae. Mol Cell Proteomics 6: 439–450.
- 43. Gavin AC, Aloy P, Grandi P, Krause R, Boesche M, et al. (2006) Proteome survey reveals modularity of the yeast cell machinery. Nature 440: 631–636.
- 44. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, et al. (2006) Global landscape of protein complexes in the yeast saccharomyces cerevisiae. Nature 440: 637–643.
- 45. Stark C, Breitkreutz B, Reguly T, Boucher L, Breitkreutz A, et al. (2006) Biogrid: a general repository for interaction datasets. Nucleic Acids Res 34: D535–D539.
- 46. Stark C, Breitkreutz BJ, Chatr-aryamontri A, Boucher L, Oughtred R, et al. (2011) The biogrid interaction database: 2011 update. Nucleic Acids Res 39: D698–D704.
- 47. Mewes HW, Amid C, Arnold R, Frishman D, Güldener U, et al. (2004) Mips: analysis and annotation of proteins from whole genomes. Nucleic Acids Res 32: D41–D44.
- 48. Cherry JM, Adler C, Ball C, Chervitz SA, Dwight SS, et al. (1998) Sgd: Saccharomyces genome database. Nucleic Acids Res 26: 73–79.
- 49. Hong EL, Balakrishnan R, Dong Q, Christie KR, Park J, et al. (2008) Gene ontology annotations at sgd: new data sources and annotation methods. Nucleic Acids Res 36: D577–D581.
- 50. Ashburner M, Ball C, Blake J, Botstein D, Butler H, et al. (2000) Gene ontology: tool for the unification of biology. Nat Genet 25: 25–29.
- 51. Liu G, Wong L, Chua H (2009) Complex discovery from weighted ppi networks. Bioinformatics 25: 1891–1897.
- 52. Gregory S (2010) Finding overlapping communities in networks by label propagation. New J Phys 12: 103018.
- 53. Altaf-Ul-Amin M, Shinbo Y, Mihara K, Kurokawa K, Kanaya S (2006) Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinformatics 7: 207.
- 54. Rhrissorrakrai K, Gunsalus KC (2011) Mine: module identification in networks. BMC Bioinformatics 12: 192.
- 55. Jiang P, Singh M (2010) Spici: a fast clustering algorithm for large biological networks. Bioinformatics 26: 1105–1111.
- 56. Becker E, Robisson B, Chapple C, Guénoche A, Brun C (2012) Multifunctional proteins revealed by overlapping clustering in protein interaction network. Bioinformatics 28: 84–90.
- 57. Cline MS, Smoot M, Cerami E, Kuchinsky A, Landys N, et al. (2007) Integration of biological networks and gene expression data using cytoscape. Nat Protocols 2: 2366–2382.