Identifying protein complex by integrating characteristic of core-attachment into dynamic PPI network

How to identify protein complex is an important and challenging task in proteomics. It would make great contribution to our knowledge of molecular mechanism in cell life activities. However, the inherent organization and dynamic characteristic of cell system have rarely been incorporated into the existing algorithms for detecting protein complexes because of the limitation of protein-protein interaction (PPI) data produced by high throughput techniques. The availability of time course gene expression profile enables us to uncover the dynamics of molecular networks and improve the detection of protein complexes. In order to achieve this goal, this paper proposes a novel algorithm DCA (Dynamic Core-Attachment). It detects protein-complex core comprising of continually expressed and highly connected proteins in dynamic PPI network, and then the protein complex is formed by including the attachments with high adhesion into the core. The integration of core-attachment feature into the dynamic PPI network is responsible for the superiority of our algorithm. DCA has been applied on two different yeast dynamic PPI networks and the experimental results show that it performs significantly better than the state-of-the-art techniques in terms of prediction accuracy, hF-measure and statistical significance in biology. In addition, the identified complexes with strong biological significance provide potential candidate complexes for biologists to validate.


Introduction
Cellular functions are completed by protein complex formed by multiple proteins aggregating together, rather than by individual protein. Identifying protein complex has significant implications in revealing the important principle of protein organization within cell [1,2]. Protein complexes can help us to predict the functions of protein [3]. Accumulated evidences suggest that protein complexes are involved in many disease mechanisms [4]. Tracking the protein complexes could reveal important insights into modular mechanisms and improve our understanding on the disease pathways [5]. PLOS  In proteomics, large-scale protein-protein interaction (PPI) data have being produced along with high-throughput techniques such as yeast two-hybrid (Y2H) [6] and affinity purification [7]. Typically, PPI data are abstracted to a complex network model in which protein is regarded as node while interaction as edge. Such network is characteristic of modular structure and prompts the emergence of many computational approaches for detecting protein complexes.
Most of current methods are based on solely network clustering [8][9][10] or integrated with multiple biological data [11][12][13][14][15][16]. For example, Palla et al. proposed CPM (Clique Percolation Method) algorithm to detect overlapping dense groups of nodes as protein complexes by continuously merging maximal connected sub graphs containing k vertexes in PPI networks [17]. Review articles [1,2,18] provide insight into the contributions of the areas, which have significant meanings to reveal the important principles of protein organizations within cells.
We know that protein complex consists of highly connected proteins, but it is much more than that. Literature [19] indicates that protein complex is characteristic of core-attachment structure, which has given rise to many protein complex identifying algorithms based on such theoretical principle. For instance, COACH [20], CoreAttach [21] and PCD-GED [22] approaches. But they often neglect the inherent time sequential feature in cell life activities. Cellular systems are highly dynamic and responsive to the stimulus from external environment [23]. Han JD et al. has proved the dynamically organized modularity in yeast PPI network [24]. Thus it has important implications in making a transition from the analyzing of static PPI networks to dynamic networks [25].
In this paper, we propose a new algorithm, called DCA (Dynamic Core-Attachment), to identify protein complexes by integrating their inherent organizations into dynamic PPI network. Protein-complex cores are formed by continually expressed and highly connected proteins. We subsequently generate protein complexes by appending attachments into the protein-complex cores. The integration of core-attachment feature into the time-evolving PPI network is responsible for the superiority of our algorithm. Experimental results using two PPI data sets of Saccharomyces cerevisiae show that our DCA method outperforms existing computational methods in terms of prediction accuracy, hF-measure and statistical significance in biology.

Materials and methods
To capture the dynamics of protein complex, time course gene expression data are integrated into the original static PPI network and generate the dynamic PPI network with three sigma method [26]. In brief it contains two steps. Firstly, for each gene at a time point, it is considered to be active only if its expression value is greater than a given threshold which is calculated based on three sigma principle. Secondly, the active proteins at this time point and their connections in the static network constitute a sub-network. As a result, all the time series sub-networks behave as a dynamic network. Please refer to the literature [26] for more detail.
Our DCA algorithm operates in four phases based on the dynamic network. DCA first identifies protein-complex cores and then applies an outward growing strategy to produce protein complexes by including attachments into the protein-complex cores. We will first briefly introduce some basic terminologies and then describe in detail our proposed method for protein complex detection.
have a relatively long duration for activity. Based on such an assumption, for one thing their edges own higher edge clustering coefficient (referred as ECC, as shown in Eq (1)), for another the stability of protein activity (referred as AT) here is defined as the time span between the starting and ending time point of its active state. For example, suppose that one protein's activity starts from time point 6 and it becomes inactive at time point 9, then its active time span (AT) of course is 3. To characterize effectively those biological nature of protein complex, we weight the PPI network by combining ECC and AT as shown in Eq (2): In Eq (1), Z ij represents the number of common neighbors of the two interacting proteins i and j, while min(k i -1, k j -1) equals to the theoretical maximum number of triangles containing the two nodes. ECC ranges from 0 to 1 and the greater value shows the closer relationship among the nodes and their neighbors. In Eq (2), N v contains the neighbors of node v and AT ranges from 0 to 1 after normalization. α controls the contribution proportion of ECC against AT. They are complementary and consistent with each other. First, due to the false negatives of protein interaction data, some of the interactions in protein-complex core will gain lower ECC, so it is reasonable to increase the weight with greater AT. Instead, some interactions outsize protein-complex core will gain higher weight because of the false positives of interactions, then it is also reasonable to decrease the weight with lower AT. Second, the greater the value of either ECC or AT, the greater the likelihood that they participate in central biological functions in protein-complex core.
As for the attachments of protein-complex, they participate in different protein complexes playing a variety of functions as a supporting role. Nevertheless, as a part of a whole protein complex, they still have relatively closer relationship with the complex core. We define this relationship as adhesion shown in Eq (3): Where s is a neighbor of Core. Adh s_Core describes the closeness between a protein-complex core and its neighbors, so we use it to measure the likelihood of that whether a protein should be include into a core as its attachment.

DCA algorithm
As shown in Fig 1, based on each snapshot of the dynamic PPI network, DCA algorithm firstly calculate the weight for each protein node according to its stability of activity and edge clustering coefficient (lines 3~6); Secondly, the nodes with weight greater than β are separately consolidated with their neighbors to form the protein-complex cores (lines 7~11). Thirdly, for each protein-complex core we select reliable attachments cooperating with it to form a protein complex (lines 12~19). Due to the periodic properties of gene expression data, the identified protein complexes contain large number of approximate ones. The last step is the redundancyfiltering procedure (lines 20~23). The computational complexity of DCA algorithm is O(N 2 ) under given parameters of α, β and γ, where N is the number of nodes in network.
Our DCA algorithm operates on time-evolving PPI network and takes its time sequential feature into account. Besides, the predicted protein complexes may overlap with each other since the attachments typically participate multiple protein complexes to carry out specific biological functions.

Experimental data
In order to verify the validity of the proposed DCA algorithm, we use two PPI networks of yeast: DIP [27] (Version of 20101010) and Krogan_extended [28] data sets. After filtering a small number of proteins that do not express the spectrum, the former contains 24278 interactions among 4969 proteins while the latter includes 12399 interactions among 3153 proteins. Gene expression data over three successive metabolic cycles are available from GEO (Gene Expression Omnibus) with accession number GSE3431 [29]. This dataset includes the expression profiles of 9335 probes under 36 different time points. It is used to construct dynamic PPI network by integrating into the static PPI network. The known protein complex set containing Protein complex and dynamic core-attachment 349 complexes after removing the one that is not covered by the PPI network is derived from CYC2008 [30], which is widely used as a reference set of protein complexes to evaluate protein complex prediction and allows precise standardized functional descriptions of genes.

Metrics for evaluating identified protein complexes
Three evaluating metrics, namely F-measure, GO enrichment analysis and hF-measure are used in this paper to test the performance of DCA algorithm.
Overlapping Score (OS) [12] Eq (4) is usually used to assess the match score between a predicted protein complex pc and a known protein complex kc: Where pc\kc represents the number of the proteins involved in both complexes pc and kc; | pc| and |kc| represent the number of proteins involved in complex pc and complex kc respectively. Two protein complexes are considered to be matched if their overlapping score is greater than or equal to a given threshold, which is set to 0.2, the same as many other researches [12]. Particularly, OS(pc,kc) = 1 indicates that the two complexes pc and kc match perfectly.
The predicted protein complex sets identified by various algorithms are separately compared against the known protein complex set, by which we can obtain the performance of algorithms on Sensitivity (Sn) and Specificity (Sp). They are typically employed to evaluate the identification of protein complexes. Let true positives (TP) denote the number of predicted protein complexes that match with known complexes, false positives (FP) denote the number of unmatched ones, and false negatives (FN) denote the number of known protein complexes which match with none of the predicted ones, then Sn and Sp can be defined as Eq (5) and Eq (6), respectively. The harmonic mean of Sn and Sp, also known as F-measure Eq (7), is often used to assess the overall accuracy of various methods [12].
Larger Sn to some extent indicates that more known protein complexes could be recognized, while higher Sp shows that higher percentage of predicted protein complexes match with known protein complexes.
To evaluate the statistical significance of the identified protein complexes, many researchers annotate their main biological functions by using p-value formulated as Eq (8) [26]. Given a predicted protein complex containing C proteins, p-value calculates the probability of observing k or more proteins from the complex by chance in a biological function shared by F  proteins from a total genome size of N proteins: The lower the p-value is, the stronger biological significance the complex possesses, while the complex with p-value greater than 0.01 is deemed to be meaningless at all. Generally speaking, the larger protein complexes possess the smaller p-values.
HF-measure is a measurement to evaluate clusters more finely and distinctly [31]. It uses functional annotation information in the GO database to measure the similarity between components in protein complexes. There are two versions of this metric-the one is topology-free measurement hF-measure Tf , the other is topology-based measurement hF-measure Tb . Unlike F-measure, the new measurements of hF-measure Tf and hF-measure Tb can discriminate between different types of errors.

Results and discussion
For the gene expression data including 36 time points used in this paper, averagely there are 1043 (SD = 240) active proteins at each time point. By mapping them into the static PPI networks, the number distribution of proteins and interactions in sub-networks is shown in Table 1.
Using two data sets, DIP and Krogan_extended, we have applied our DCA algorithm on two yeast dynamic PPI networks constructed with three sigma method [26] to perform comprehensive comparisons among various existing competing algorithms including DPC [32], TS-OCD [33], CAMSE [10], ClusterONE [9], SPICI [34], COACH [20], CoreAttach [21], CPM [35] and MCODE [36]. For all these methods, the optimal parameters are set to default empirical values, while in DCA we recommend α = 0.60, β = 0.55, γ = 1.4. Table 2 shows the basic information of predictions by various methods on the two dynamic PPI networks. On DIP data, DCA predicted 885 complexes with average size of 8.2, of which 515 match 118 real complexes; On Krogan_extended data, it predicted 818 complexes with average size of 8.7, of which 558 match 90 real complexes. Protein complex and dynamic core-attachment Comparative sensitivity and specificity On DIP data, the F-measure of DCA is 0.632, which is 23.2% higher than the next algorithm CAMSE on static network and 23.7% higher than that on dynamic network. Similarly, on  Krogan_extended data, the F-measure of DCA is 0.683, which is 32.6% and 17.5% higher than the next algorithm CAMSE on static and dynamic network, respectively. Our DCA method can achieve the highest F-measure by providing the highest specificity and comparable sensitivity, which shows that our method can predict protein complexes very accurately.

P-value analysis
To substantiate the biological significance of predicted protein complexes, we calculate their pvalues by the tool, SGD's GO::TermFinder [37]. Table 3 and Table 4 show the distribution of pvalue. Using DIP data, 853 out of 885 (96.4%) complexes predicted by DCA are considered to be significant with p-value 0.01, and it predicts higher proportion of significant complexes than other eight algorithms. Such as in the interval (0, 1e-15], DCA obtains 79(8.9%) complexes while the other algorithms only achieve 4~74(1.1%~4.9%). This result is also consistent with the results on Krogan_extended data where DCA achieves 119(14.6%) significant complexes. Many of our predicted complexes are find to match well with the known complexes. Due to the incompleteness of the benchmark, our non-matched predicted complexes, especially for those with low p-values, may provide potential candidate complexes for biologists to validate.

HF-measure analysis
As shown in Table 5, although the hF-measure value of DCA algorithm is little less than TS-OCD, it is 2.1%~12.4% higher than other seven algorithms on DIP dynamic network and Protein complex and dynamic core-attachment 2.8%~14.5% on Krogan_extended dynamic network. Therefore, our DCA algorithm performs significantly better than the most of state-of-the-art techniques. In addition, Table 6 and Table 7 provide ten predicted protein complexes with high hF-measure and low p-value on two dynamic PPI networks. The topology structure of the first complex in the two tables is illustrated in Fig 6, whose GO terms are "tRNA transcription from RNA polymerase III promoter |AmiGO" and "ncRNA transcription | AmiGO", respectively. From the above analysis we can see that our DCA algorithm detects many useful biological knowledge. Fig 7 illustrates an example of predicted complex, whose core consists of eight proteins in circle A. Separately, proteins in blocks B, C and D are the attachments of this complex under different time points. It's GO annotation is "3'-5'-exoribonuclease activity" (GO:0000175) with p-value 6.93e-20 and hF-measure Tb 0.64.

Conclusions
Protein complexes comprising of multiple highly related proteins are key molecular entities to perform specific cellular functions. The increasing amount of protein-protein interaction data have enabled us to identify protein complexes from PPI networks. However, current computational methods hardly take consideration of both the inherent organization and dynamics within protein complex. This paper presents a new algorithm named DCA for mining protein complex from dynamic PPI network. Its prominent advantage is combing the sequential feature of network with the characteristic of core-attachment structure in complex. The evaluation and analysis of our predictions demonstrate the following advantages of our DCA algorithm over the state-of-the-art completing approaches. First, our new method is fundamentally different from other approaches for its insight into the inherent dynamic organization of protein complexes, which is often neglected in existing algorithms. The consideration Protein complex and dynamic core-attachment Protein complex and dynamic core-attachment of dynamics in cell system made the model simulation more closely to reality. Second, DCA algorithm has achieved significantly higher F-measure than existing methods. Thus, our predicted complexes match very well with benchmark complexes. In addition, DCA also performs very well in terms of other metrics such as p-value and hF-measure, indicating that our new algorithm can predict protein complexes very accurately. Our identified complexes, therefore, could be probably the true complexes to help the biologists to get novel biological insights. Although the time sequential gene expression data have much help to explore dynamic protein complexes, many factors need to be considered for deep research on living system, such as living conditions and tissue specifics. The model integrated more dimensional biological data is very important to uncover the mystery of life.