How to identify protein complex is an important and challenging task in proteomics. It would make great contribution to our knowledge of molecular mechanism in cell life activities. However, the inherent organization and dynamic characteristic of cell system have rarely been incorporated into the existing algorithms for detecting protein complexes because of the limitation of protein-protein interaction (PPI) data produced by high throughput techniques. The availability of time course gene expression profile enables us to uncover the dynamics of molecular networks and improve the detection of protein complexes. In order to achieve this goal, this paper proposes a novel algorithm DCA (Dynamic Core-Attachment). It detects protein-complex core comprising of continually expressed and highly connected proteins in dynamic PPI network, and then the protein complex is formed by including the attachments with high adhesion into the core. The integration of core-attachment feature into the dynamic PPI network is responsible for the superiority of our algorithm. DCA has been applied on two different yeast dynamic PPI networks and the experimental results show that it performs significantly better than the state-of-the-art techniques in terms of prediction accuracy, hF-measure and statistical significance in biology. In addition, the identified complexes with strong biological significance provide potential candidate complexes for biologists to validate.
Citation: Shen X, Yi L, Jiang X, He T, Yang J, Xie W, et al. (2017) Identifying protein complex by integrating characteristic of core-attachment into dynamic PPI network. PLoS ONE 12(10): e0186134. https://doi.org/10.1371/journal.pone.0186134
Editor: Shekhar C. Mande, National Centre For Cell Science, INDIA
Received: April 18, 2017; Accepted: September 26, 2017; Published: October 18, 2017
Copyright: © 2017 Shen et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper.
Funding: This research is supported by the National Natural Science Foundation of China (No. 61532008), the Self-determined Research Funds of CCNU from the Colleges’ Basic Research and Operation of MOE (No. CCNU14A02008, CCNU17TS0003, CCNU16JYKX018), the International Cooperation Project of Hubei Province (No. 2014BHE0017), and the National Key Research and Development Program of China (SQ2017YFSF090207). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
Cellular functions are completed by protein complex formed by multiple proteins aggregating together, rather than by individual protein. Identifying protein complex has significant implications in revealing the important principle of protein organization within cell [1, 2]. Protein complexes can help us to predict the functions of protein . Accumulated evidences suggest that protein complexes are involved in many disease mechanisms . Tracking the protein complexes could reveal important insights into modular mechanisms and improve our understanding on the disease pathways .
In proteomics, large-scale protein-protein interaction (PPI) data have being produced along with high-throughput techniques such as yeast two-hybrid (Y2H)  and affinity purification . Typically, PPI data are abstracted to a complex network model in which protein is regarded as node while interaction as edge. Such network is characteristic of modular structure and prompts the emergence of many computational approaches for detecting protein complexes.
Most of current methods are based on solely network clustering[8–10] or integrated with multiple biological data[11–16]. For example, Palla et al. proposed CPM (Clique Percolation Method) algorithm to detect overlapping dense groups of nodes as protein complexes by continuously merging maximal connected sub graphs containing k vertexes in PPI networks. Review articles [1, 2, 18] provide insight into the contributions of the areas, which have significant meanings to reveal the important principles of protein organizations within cells.
We know that protein complex consists of highly connected proteins, but it is much more than that. Literature  indicates that protein complex is characteristic of core-attachment structure, which has given rise to many protein complex identifying algorithms based on such theoretical principle. For instance, COACH , CoreAttach  and PCD-GED  approaches. But they often neglect the inherent time sequential feature in cell life activities. Cellular systems are highly dynamic and responsive to the stimulus from external environment . Han JD et al. has proved the dynamically organized modularity in yeast PPI network . Thus it has important implications in making a transition from the analyzing of static PPI networks to dynamic networks.
In this paper, we propose a new algorithm, called DCA (Dynamic Core-Attachment), to identify protein complexes by integrating their inherent organizations into dynamic PPI network. Protein- complex cores are formed by continually expressed and highly connected proteins. We subsequently generate protein complexes by appending attachments into the protein-complex cores. The integration of core-attachment feature into the time-evolving PPI network is responsible for the superiority of our algorithm. Experimental results using two PPI data sets of Saccharomyces cerevisiae show that our DCA method outperforms existing computational methods in terms of prediction accuracy, hF-measure and statistical significance in biology.
Materials and methods
To capture the dynamics of protein complex, time course gene expression data are integrated into the original static PPI network and generate the dynamic PPI network with three sigma method. In brief it contains two steps. Firstly, for each gene at a time point, it is considered to be active only if its expression value is greater than a given threshold which is calculated based on three sigma principle. Secondly, the active proteins at this time point and their connections in the static network constitute a sub-network. As a result, all the time series sub-networks behave as a dynamic network. Please refer to the literature  for more detail.
Our DCA algorithm operates in four phases based on the dynamic network. DCA first identifies protein-complex cores and then applies an outward growing strategy to produce protein complexes by including attachments into the protein-complex cores. We will first briefly introduce some basic terminologies and then describe in detail our proposed method for protein complex detection.
Proteins in complex core that playing a central role are characteristic of highly connected, sharing the functions of the same classification and relatively stable, which means that they have a relatively long duration for activity. Based on such an assumption, for one thing their edges own higher edge clustering coefficient (referred as ECC, as shown in Eq (1)), for another the stability of protein activity (referred as AT) here is defined as the time span between the starting and ending time point of its active state. For example, suppose that one protein’s activity starts from time point 6 and it becomes inactive at time point 9, then its active time span (AT) of course is 3. To characterize effectively those biological nature of protein complex, we weight the PPI network by combining ECC and AT as shown in Eq (2): (1) (2)
In Eq (1), Zij represents the number of common neighbors of the two interacting proteins i and j, while min(ki-1, kj-1) equals to the theoretical maximum number of triangles containing the two nodes. ECC ranges from 0 to 1 and the greater value shows the closer relationship among the nodes and their neighbors. In Eq (2), Nv contains the neighbors of node v and AT ranges from 0 to 1 after normalization. α controls the contribution proportion of ECC against AT. They are complementary and consistent with each other. First, due to the false negatives of protein interaction data, some of the interactions in protein-complex core will gain lower ECC, so it is reasonable to increase the weight with greater AT. Instead, some interactions outsize protein-complex core will gain higher weight because of the false positives of interactions, then it is also reasonable to decrease the weight with lower AT. Second, the greater the value of either ECC or AT, the greater the likelihood that they participate in central biological functions in protein-complex core.
As for the attachments of protein-complex, they participate in different protein complexes playing a variety of functions as a supporting role. Nevertheless, as a part of a whole protein complex, they still have relatively closer relationship with the complex core. We define this relationship as adhesion shown in Eq (3): (3)
Where s is a neighbor of Core. Adhs_Core describes the closeness between a protein-complex core and its neighbors, so we use it to measure the likelihood of that whether a protein should be include into a core as its attachment.
As shown in Fig 1, based on each snapshot of the dynamic PPI network, DCA algorithm firstly calculate the weight for each protein node according to its stability of activity and edge clustering coefficient (lines 3~6); Secondly, the nodes with weight greater than β are separately consolidated with their neighbors to form the protein-complex cores (lines 7~11). Thirdly, for each protein-complex core we select reliable attachments cooperating with it to form a protein complex (lines 12~19). Due to the periodic properties of gene expression data, the identified protein complexes contain large number of approximate ones. The last step is the redundancy-filtering procedure (lines 20~23). The computational complexity of DCA algorithm is O(N2) under given parameters of α, β and γ, where N is the number of nodes in network.
Our DCA algorithm operates on time-evolving PPI network and takes its time sequential feature into account. Besides, the predicted protein complexes may overlap with each other since the attachments typically participate multiple protein complexes to carry out specific biological functions.
In order to verify the validity of the proposed DCA algorithm, we use two PPI networks of yeast: DIP  (Version of 20101010) and Krogan_extended  data sets. After filtering a small number of proteins that do not express the spectrum, the former contains 24278 interactions among 4969 proteins while the latter includes 12399 interactions among 3153 proteins. Gene expression data over three successive metabolic cycles are available from GEO (Gene Expression Omnibus) with accession number GSE3431. This dataset includes the expression profiles of 9335 probes under 36 different time points. It is used to construct dynamic PPI network by integrating into the static PPI network. The known protein complex set containing 349 complexes after removing the one that is not covered by the PPI network is derived from CYC2008, which is widely used as a reference set of protein complexes to evaluate protein complex prediction and allows precise standardized functional descriptions of genes.
Metrics for evaluating identified protein complexes
Three evaluating metrics, namely F-measure, GO enrichment analysis and hF-measure are used in this paper to test the performance of DCA algorithm.
Where pc∩kc represents the number of the proteins involved in both complexes pc and kc; |pc| and |kc| represent the number of proteins involved in complex pc and complex kc respectively. Two protein complexes are considered to be matched if their overlapping score is greater than or equal to a given threshold, which is set to 0.2, the same as many other researches. Particularly, OS(pc,kc) = 1 indicates that the two complexes pc and kc match perfectly.
The predicted protein complex sets identified by various algorithms are separately compared against the known protein complex set, by which we can obtain the performance of algorithms on Sensitivity (Sn) and Specificity (Sp). They are typically employed to evaluate the identification of protein complexes. Let true positives (TP) denote the number of predicted protein complexes that match with known complexes, false positives (FP) denote the number of unmatched ones, and false negatives (FN) denote the number of known protein complexes which match with none of the predicted ones, then Sn and Sp can be defined as Eq (5) and Eq (6), respectively. The harmonic mean of Sn and Sp, also known as F-measure Eq (7), is often used to assess the overall accuracy of various methods.(5)(6)(7)
Larger Sn to some extent indicates that more known protein complexes could be recognized, while higher Sp shows that higher percentage of predicted protein complexes match with known protein complexes.
To evaluate the statistical significance of the identified protein complexes, many researchers annotate their main biological functions by using p-value formulated as Eq (8) . Given a predicted protein complex containing C proteins, p-value calculates the probability of observing k or more proteins from the complex by chance in a biological function shared by F proteins from a total genome size of N proteins: (8)
The lower the p-value is, the stronger biological significance the complex possesses, while the complex with p-value greater than 0.01 is deemed to be meaningless at all. Generally speaking, the larger protein complexes possess the smaller p-values.
HF-measure is a measurement to evaluate clusters more finely and distinctly. It uses functional annotation information in the GO database to measure the similarity between components in protein complexes. There are two versions of this metric—the one is topology-free measurement hF-measureTf, the other is topology-based measurement hF-measureTb. Unlike F-measure, the new measurements of hF-measureTf and hF-measureTb can discriminate between different types of errors.
Results and discussion
For the gene expression data including 36 time points used in this paper, averagely there are 1043 (SD = 240) active proteins at each time point. By mapping them into the static PPI networks, the number distribution of proteins and interactions in sub-networks is shown in Table 1.
Using two data sets, DIP and Krogan_extended, we have applied our DCA algorithm on two yeast dynamic PPI networks constructed with three sigma method to perform comprehensive comparisons among various existing competing algorithms including DPC, TS-OCD, CAMSE, ClusterONE, SPICI, COACH, CoreAttach, CPM and MCODE. For all these methods, the optimal parameters are set to default empirical values, while in DCA we recommend α = 0.60, β = 0.55, γ = 1.4. Table 2 shows the basic information of predictions by various methods on the two dynamic PPI networks. On DIP data, DCA predicted 885 complexes with average size of 8.2, of which 515 match 118 real complexes; On Krogan_extended data, it predicted 818 complexes with average size of 8.7, of which 558 match 90 real complexes.
Comparative sensitivity and specificity
Fig 2, Fig 3, Fig 4 and Fig 5 show the overall comparison in terms of Sn, Sp and F-measure. On DIP data, the F-measure of DCA is 0.632, which is 23.2% higher than the next algorithm CAMSE on static network and 23.7% higher than that on dynamic network. Similarly, on Krogan_extended data, the F-measure of DCA is 0.683, which is 32.6% and 17.5% higher than the next algorithm CAMSE on static and dynamic network, respectively. Our DCA method can achieve the highest F-measure by providing the highest specificity and comparable sensitivity, which shows that our method can predict protein complexes very accurately.
To substantiate the biological significance of predicted protein complexes, we calculate their p-values by the tool, SGD's GO::TermFinder. Table 3 and Table 4 show the distribution of p-value. Using DIP data, 853 out of 885 (96.4%) complexes predicted by DCA are considered to be significant with p-value ≤ 0.01, and it predicts higher proportion of significant complexes than other eight algorithms. Such as in the interval (0, 1e-15], DCA obtains 79(8.9%) complexes while the other algorithms only achieve 4~74(1.1%~4.9%). This result is also consistent with the results on Krogan_extended data where DCA achieves 119(14.6%) significant complexes. Many of our predicted complexes are find to match well with the known complexes. Due to the incompleteness of the benchmark, our non-matched predicted complexes, especially for those with low p-values, may provide potential candidate complexes for biologists to validate.
As shown in Table 5, although the hF-measure value of DCA algorithm is little less than TS-OCD, it is 2.1%~12.4% higher than other seven algorithms on DIP dynamic network and 2.8%~14.5% on Krogan_extended dynamic network. Therefore, our DCA algorithm performs significantly better than the most of state-of-the-art techniques. In addition, Table 6 and Table 7 provide ten predicted protein complexes with high hF-measure and low p-value on two dynamic PPI networks. The topology structure of the first complex in the two tables is illustrated in Fig 6, whose GO terms are “tRNA transcription from RNA polymerase III promoter |AmiGO” and “ncRNA transcription | AmiGO”, respectively. From the above analysis we can see that our DCA algorithm detects many useful biological knowledge.
Fig 7 illustrates an example of predicted complex, whose core consists of eight proteins in circle A. Separately, proteins in blocks B, C and D are the attachments of this complex under different time points. It’s GO annotation is “3'-5'-exoribonuclease activity” (GO:0000175) with p-value 6.93e-20 and hF-measureTb 0.64.
Protein complexes comprising of multiple highly related proteins are key molecular entities to perform specific cellular functions. The increasing amount of protein-protein interaction data have enabled us to identify protein complexes from PPI networks. However, current computational methods hardly take consideration of both the inherent organization and dynamics within protein complex. This paper presents a new algorithm named DCA for mining protein complex from dynamic PPI network. Its prominent advantage is combing the sequential feature of network with the characteristic of core-attachment structure in complex. The evaluation and analysis of our predictions demonstrate the following advantages of our DCA algorithm over the state-of-the-art completing approaches. First, our new method is fundamentally different from other approaches for its insight into the inherent dynamic organization of protein complexes, which is often neglected in existing algorithms. The consideration of dynamics in cell system made the model simulation more closely to reality. Second, DCA algorithm has achieved significantly higher F-measure than existing methods. Thus, our predicted complexes match very well with benchmark complexes. In addition, DCA also performs very well in terms of other metrics such as p-value and hF-measure, indicating that our new algorithm can predict protein complexes very accurately. Our identified complexes, therefore, could be probably the true complexes to help the biologists to get novel biological insights. Although the time sequential gene expression data have much help to explore dynamic protein complexes, many factors need to be considered for deep research on living system, such as living conditions and tissue specifics. The model integrated more dimensional biological data is very important to uncover the mystery of life.
This research is supported by the National Natural Science Foundation of China (No. 61532008), the Self-determined Research Funds of CCNU from the Colleges’ Basic Research and Operation of MOE (No. CCNU14A02008, CCNU17TS0003, CCNU16JYKX018), the International Cooperation Project of Hubei Province (No. 2014BHE0017), and the National Key Research and Development Program of China (SQ2017YFSF090207).
- 1. Srihari S, Yong CH, Patil A, Wong L: Methods for protein complex prediction and their contributions towards understanding the organisation, function and dynamics of complexes. FEBS Lett. 2015; 589:2590–2602. pmid:25913176
- 2. Chen B, Fan W, Liu J, Wu FX: Identifying protein complexes and functional modules—from static PPI networks to dynamic PPI networks. Brief Bioinform. 2014; 15:177–194. pmid:23780996
- 3. Zhang XF, Dai DQ: A framework for incorporating functional interrelationships into protein function prediction algorithms. IEEE/ACM Trans Comput Biol Bioinform. 2012; 9:740–753. pmid:22084148
- 4. Lage K, Hansen NT, Karlberg EO, Eklund AC, Roque FS, Donahoe PK, et al: A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes. Proc Natl Acad Sci U S A. 2008; 105:20870–20875. pmid:19104045
- 5. Yu H, Lin CC, Li YY, Zhao Z: Dynamic protein interaction modules in human hepatocellular carcinoma progression. BMC Syst Biol. 2013; 7 Suppl 5:S2.
- 6. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y: A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A. 2001; 98:4569–4574. pmid:11283351
- 7. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, et al: Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002; 415:141–147. pmid:11805826
- 8. Chen B, Shi J, Zhang S, Wu FX: Identifying protein complexes in protein-protein interaction networks by using clique seeds and graph entropy. Proteomics. 2013; 13:269–277. pmid:23112006
- 9. Nepusz T, Yu H, Paccanaro A: Detecting overlapping protein complexes in protein-protein interaction networks. Nat Methods. 2012; 9:471–472. pmid:22426491
- 10. Tingting H, Li P, Hu X, Shen X: A novel proteins complex identification based on connected affinity and multi-level seed extension. In 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 2014.
- 11. Chen B, Wu FX: Identifying protein complexes based on multiple topological structures in PPI networks. IEEE Trans Nanobioscience. 2013; 12:165–172. pmid:23974659
- 12. Li M, Wu X, Wang J, Pan Y: Towards the identification of protein complexes and functional modules by integrating PPI network and gene expression data. BMC Bioinformatics. 2012; 13:109. pmid:22621308
- 13. Tang X, Wang J, Liu B, Li M, Chen G, Pan Y: A comparison of the functional modules identified from time course and static PPI network data. BMC Bioinformatics. 2011; 12:339. pmid:21849017
- 14. Zhang Y, Lin H, Yang Z, Wang J, Liu Y, Sang S: A method for predicting protein complex in dynamic PPI networks. BMC Bioinformatics. 2016; 17:229. pmid:27454775
- 15. Ou-Yang L, Zhang X, Dai D, Wu M, Zhu Y, Liu Z, et al: Protein complex detection based on partially shared multi-view clustering. BMC Bioinformatics. 2016; 17:371. pmid:27623844
- 16. Zhang Y, Lin H, Yang Z, Wang J: Construction of dynamic probabilistic protein interaction networks for protein complex identification. BMC Bioinformatics. 2016; 17:186. pmid:27117946
- 17. Palla G, Derényi I, Farkas I: Uncovering the overlapping community structure of complex networks in nature and society. Nature. 2005; 435:814–818. pmid:15944704
- 18. Wang J, Peng X, Peng W, Wu F: Dynamic protein interaction network construction and applications. PROTEOMICS. 2014; 14:338–352. pmid:24339054
- 19. Dezso Z, Oltvai ZN, Barabasi AL: Bioinformatics analysis of experimentally determined protein complexes in the yeast Saccharomyces cerevisiae. Genome Res. 2003; 13:2450–2454. pmid:14559778
- 20. Wu M, Li K: A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinformatics. 2009:169. pmid:19486541
- 21. Leung HC, Xiang Q, Yiu SM, Chin FY: Predicting protein complexes from PPI data: a core-attachment approach. J Comput Biol. 2009; 16:133–144. pmid:19193141
- 22. Lakizadeh A, Jalili S, Marashi S: PCD-GED: Protein complex detection considering PPI dynamics based on time series gene expression data. J Theor Biol. 2015; 378:31–38. pmid:25934349
- 23. Przytycka TM, Singh M, Slonim DK: Toward the dynamic interactome: it's about time. Brief Bioinform. 2010; 11:15–29. pmid:20061351
- 24. Han JJ, Bertin N, Hao T, Goldberg DS, Berriz GF, Zhang LV, et al: Evidence for dynamically organized modularity in the yeast protein–protein interaction network. Nature. 2004; 430:88–93. pmid:15190252
- 25. Yong CH, Wong L: From the static interactome to dynamic protein complexes: Three challenges. J Bioinf Comput Biol. 2015; 13:1571001.
- 26. Wang J, Peng X, Li M, Pan Y: Construction and application of dynamic protein interaction network based on time course gene expression data. PROTEOMICS. 2013; 13:301–312. pmid:23225755
- 27. Xenarios I, Salwinski L, Duan XJ: DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 2002; 30:303–305. pmid:11752321
- 28. Krogan NJ, Cagney G, Yu H, Zhong G, Guo X, Ignatchenko A, et al: Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature. 2006; 440:637–643. pmid:16554755
- 29. Tu B, Kudlicki A, Rowicka M, McKnight S: Logic of the Yeast Metabolic Cycle. Science. 2005; 310:1152–1158. pmid:16254148
- 30. Pu S, Wong J, Turner B, Cho E, Wodak SJ: Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res. 2009; 37:825–831. pmid:19095691
- 31. Li M, Wu X, Pan Y, Wang J: hF-measure: A new measurement for evaluating clusters in protein–protein interaction networks. PROTEOMICS. 2013; 13:291–300. pmid:23193073
- 32. Li M, Chen W, Wang J, Wu F, Pan Y: Identifying dynamic protein complexes based on gene expression profiles and PPI networks. Biomed Res Int. 2014; 2014:1–10.
- 33. Ou-Yang L, Dai D, Li X, Wu M, Zhang X, Yang P: Detecting temporal protein complexes from dynamic protein-protein interaction networks. BMC Bioinformatics. 2014; 15:335. pmid:25282536
- 34. Jiang P, Singh M: SPICi: a fast clustering algorithm for large biological networks. Bioinformatics. 2010; 26:1105–1111. pmid:20185405
- 35. Adamcsek B, Palla G, Farkas IJ: CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics. 2006; 22:1021–1023. pmid:16473872
- 36. Bader D, Hogue C: An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics. 2003.
- 37. Boyle EI, Weng S, Gollub J, Jin H, Botstein D, Cherry JM, et al: GO::TermFinder—open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes. Bioinformatics (Oxford, England). 2004; 20:3710–3715.