Construction of Ontology Augmented Networks for Protein Complex Prediction

Protein complexes are of great importance in understanding the principles of cellular organization and function. The increase in available protein-protein interaction data, gene ontology and other resources make it possible to develop computational methods for protein complex prediction. Most existing methods focus mainly on the topological structure of protein-protein interaction networks, and largely ignore the gene ontology annotation information. In this article, we constructed ontology augmented networks with protein-protein interaction data and gene ontology, which effectively unified the topological structure of protein-protein interaction networks and the similarity of gene ontology annotations into unified distance measures. After constructing ontology augmented networks, a novel method (clustering based on ontology augmented networks) was proposed to predict protein complexes, which was capable of taking into account the topological structure of the protein-protein interaction network, as well as the similarity of gene ontology annotations. Our method was applied to two different yeast protein-protein interaction datasets and predicted many well-known complexes. The experimental results showed that (i) ontology augmented networks and the unified distance measure can effectively combine the structure closeness and gene ontology annotation similarity; (ii) our method is valuable in predicting protein complexes and has higher F1 and accuracy compared to other competing methods.


Introduction
Protein complexes are groups of two or more associated polypeptide chains, which play a critical role in many biological processes. Many proteins are functional only after they are assembled into a protein complex and interact with other proteins in this complex. Even in the relatively simple model organism Saccharomyces cerevisiae, these complexes are comprised of many subunits that work in a coherent fashion. Therefore, protein complexes are important molecular entities in cellular organization, and are of great importance in unveiling the secrets of cellular organization and function.
As protein complexes are groups of proteins that interact with each other, they are generally dense subgraphs in protein-protein interaction (PPI) networks [1,2]. The increase in available PPI data makes it possible to predict protein complexes in PPI networks. Several computational methods for protein complex prediction typically focus on the extraction of dense regions in the PPI networks based on graph theory, including MCL [3], MCODE [4], LCMA [5], CFinder [6] and PCP [7]. However, these methods ignore the biological properties of protein complexes. In general, the proteins in a complex have similar biological properties, but PPI networks cannot provide such vital information. In addition, PPI data produced by high-throughput experiments are often associated with high false positive and false negative rates [8,9].
To address these problems, other valuable resources are gradually being used for protein complex prediction. For example, several recent studies [10,11] have investigated gene expression data to improve protein complex prediction. These studies mainly defined specific scoring methods based on gene expression data, and constructed more reliable weighted PPI networks. The intuition behind them is that the weighted PPI networks should better represent the actual interaction network than the initial binary PPI networks.
Gene Ontology (GO) is another useful resource, and is currently one of the most comprehensive ontology databases in the bioinformatics community [12]. GO aims to standardize the annotation of genes and gene products across species and provides a controlled vocabulary of terms for describing gene product biological properties, which is a significant addition to PPI data for protein complex prediction. Due to the inherent biological properties of protein complexes, the ideal method for protein complex prediction should generate clusters in PPI networks which have a cohesive topological structure with similar GO annotations, by balancing the topological structure and GO annotation similarities. Figure 1 shows an example of protein complex prediction. Figure 1 (a) is a simple PPI network where a vertex represents a protein and an edge represents the interaction between two proteins. Figure 1 (b) is the PPI network annotated by GO slims. As we can see, due to the presence of noise and the complex connectivity of PPI data, it is hard to predict protein complexes from the PPI network in Figure 1 (a). However, if we consider the GO annotation information of each protein in Figure 1 (b), we can predict two complexes reasonably well in Figure 1 (c).
In this study we determined how to predict protein complexes based on both the topological structure of PPI networks and GO annotation similarities. We proposed a novel method for protein complex prediction, called COAN, based on attribute graph clustering theory [13]. The key to our method was to integrate the PPI data and GO into a unified framework by constructing ontology augmented networks. In the ontology augmented networks, we used a unified distance measure to estimate the pairwise vertex closeness. Based on the ontology augmented graph and unified distance measure, COAN generated seed cliques from the maximal cliques in the PPI networks, and expanded clusters starting from the seed cliques. In the experimental section, we showed that COAN was competitive or superior in performance, compared with the state-of-the-art methods used for protein complex prediction.

Ontology augmented networks
Some resources, such as gene expression data, have been used to assess the reliability of protein interactions. These methods usually assign a score to each protein pair. Unlike these methods, we integrated the PPI data and GO into a unified framework by constructing ontology augmented networks, based on attribute graph clustering theory [13].
The GO database is currently one of the most comprehensive and well-curated ontology databases in the bioinformatics community. GO provides GO terms to describe gene product characteristics in the following three different aspects, (I) biological process referring to a biological objective to which the gene or gene product contributes; (II) molecular function defined as the biochemical activity of a gene product; (III) cellular component referring to the place in the cell where a gene product is active. GO slims are cut-down versions of the GO ontologies containing a subset of GO terms. Compared with GO terms, GO slims give a broad overview of the ontology content without the detail of the specific fine-grained terms. GO slims are particularly useful for giving a summary of the results of GO annotation of a genome, microarray, or cDNA collection when broad classification of gene product function is required. The studies [14,15] have showed the proteins in a protein complex generally share one or more GO term annotations. Since GO slims give a broader overview of ontology content than GO terms, we used GO slims to annotate PPI data in this study. Next, we introduce how to construct ontology augmented networks.
Given a PPI network G~(V ,E) and the GO slim annotations set A~(g 1 ,:::,g m ), each protein could be annotated by one or more GO slims in A.
is called a GO annotation edge. Figure 2 is the ontology augmented network for the example in Figure 1. Two GO ''dummy'' vertices ''A1. GO slim 1'' and ''A2. GO slim 2'' are added. Proteins with corresponding GO annotations are connected to the two ''dummy'' vertices, respectively, in the dash line.

Unified distance measure
The transition matrix P o of the ontology augmented network is a V |V 0 j j by V |V 0 j j matrix. The transition probability is defined as follows: The transition probability from protein v i to its neighbor v j through a PPI edge or a GO annotation edge is where N(v i ) represents the set of proteins directly connecting with protein v i in the ontology augmented network, and N 0 (v i ) represents the set of dummy vertices, namely GO slim annotations, directly connecting with protein v i . The transition probability from GO annotation v i to protein v j through a GO annotation edge is Since there is no edge between two GO annotations, the transition probability between v i and v j is 0. pvi,vj~0, Combining Equations (1)-(3), the transition probability matrix P o of an ontology augmented network G o can be calculated. Figure 3 is a transition probability matrix for our example in Figure 2.
contains the summed transition probability of paths from protein v i to protein v j through one intervening vertex, that is, the length of paths is two. Similarly, for any length n, the summed transition probability from protein v i to protein v j can be determined by calculating P n o Â Ã ij . The unified distance on the ontology augmented network is defined as follows: Where l is the delay parameter. The matrix form of the unified distance is Due to l[(0,1), the unified distance matrix Ro can be efficiently calculated by Equation (6). Matrix inversion is roughly of cubic time complexity.
We use unified distance to measure protein pair closeness. One important difference between the unified distance on the ontology augmented network G o and that on the original PPI network G is that, if two proteins v i and v j have the same GO annotation A k , they will have a new common neighbor, thus there is a random walk path between v i and v j through A k . The more GO annotations two proteins share, the more random walk paths exist between the pair of proteins. The increase in paths between the pair of proteins v i and v j will enlarge the value of d(v i ,v j ). Based on ontology augmented networks, we effectively unify the topology structure of PPI networks and the similarity of GO annotations into unified distance measures.

The COAN algorithm
The COAN algorithm broadly consists of two phases. In the first phase, COAN generates seed cliques from all the maximal cliques. Firstly, COAN ranks the cliques based on the unified distance measure. Then, COAN chooses the top rank clique as the seed clique, and removes or prunes the others. This process is repeated until the candidate clique set is empty. In the second phase, COAN expands clusters starting from the seed cliques by adding the close neighbor proteins.
As the existing PPI networks are usually sparse, enumerating all maximal cliques does not pose a problem [10]. COAN uses the cliques algorithm proposed by Tomita et al. [16] to enumerate all maximal cliques with size no less than 3 from initial PPI networks. All maximal cliques make up the candidate clique set C. COAN uses density function to measure the closeness of each clique. The  density function is defined as follows: where d(v i ,v j ) is the unified distance of two proteins v i and v j on ontology augmented networks. If the clique has a large density value, the clique generally has strong connectivity and shares more common GO annotations. Therefore, the density function takes into account both the structure connectivity of PPI networks and GO annotation similarity. In order to choose large density cliques as seed cliques, COAN ranks all the maximal cliques in descending order of their density value. In general, the maximal cliques overlap with each other. With COAN the seed cliques do not overlap as the overlapped cliques are removed or pruned. Given a candidate clique set ranked in descending order of their density value, denoted as C1,C2,::::,Cn f g , the COAN algorithm deletes the top rank clique C1 from C and inserts C1 into the seed clique set S. Then, the COAN algorithm removes or prunes the overlapped cliques as follows: For any other clique Ci[C, COAN checks whether C1\Ci=1. If such Ci exists, COAN further checks whether Ci{C1 j j §3 or not. If Ci{C1 j j §3, Ci is replaced by Ci{C1, otherwise Ci is removed directly. These steps are repeated until candidate clique set C is empty. Consequently, COAN generates the seed clique set S, and the seed cliques are not overlapped.
In the second phase, COAN expands the seed cliques by adding the close neighbor proteins. We use the connectivity score to measure how strongly a protein vi is connected to a seed clique S j , where vi 6 [ S j . The connectivity score of vi with respect to S j is defined as follows: If the connectivity{score(v i ,S j ) §extend thres, then v i is added to S j . Here, extend thres is a predefined threshold for extending. Thus the final predicted complexes will be generated by adding the close proteins to the seed cliques. Figure 4 shows the pseudocodes of the COAN algorithm.

Results and Discussion
In this section, we first describe the datasets and evaluation metrics used in our experiments, and then study the impact of the extend thres on COAN. We compared COAN with the state-ofthe-art methods including CMC [10], COACH [15] and HUNTER [11]. Finally, we present some protein complexes predicted by COAN with detailed information. The Source Code S1 in Supplementary Information is the source code of COAN.

Datasets
The two PPI datasets used were the DIP dataset [17] and Krogan dataset [18], respectively. The DIP database contains 4928 proteins and 17208 interactions, and the Krogan database contains 2675 proteins and 7080 interactions.
The reference complex dataset was CYC2008 [19] which is a comprehensive catalogue of 408 manually curated heterometric protein complexes reliably backed by small-scale experiments reported and used as benchmark complexes in most methods.

Evaluation metrics
Overall, there are two types of evaluation metrics used to evaluate the quality of predicted complexes and compute the overall precision of the prediction methods.
One type of evaluation metrics are Precision, Recall and F1 which are commonly used in bioinformatics and machine learning. Let p(V p ,E p ) be a predicted complex and b(V b ,E b ) be a reference complex. The neighborhood affinity score NA(p,b) between p(V p ,E p ) and b(V b ,E b ) is defined as follows: If NA(p,b) §v, then we consider p(V p ,E p ) and b(V b ,E b ) to match each other. We set v~0:2 in our experiment, which is the same as most methods for protein complex prediction [4,5,[19][20][21]. Let P and B denote the sets of complexes predicted by a method and reference complex, respectively. Let N cp be the number of predicted complexes which match at least one reference complex and N cb be the number of reference complexes that match at least one predicted complex. Precision, Recall and the F1 measure are defined as follows:  Precision measures the fidelity of the predicted complex set. Recall quantifies the extent to which a predicted complex set captures the known complexes in the reference set. The F1 measure provides a reasonable combination of both precision and recall, which can be used to evaluate the overall performance.
Another type of evaluation metrics are sensitivity, positive predictive value (PPV) and accuracy which were recently proposed to evaluate the performance of the protein complex prediction methods [22]. The definitions of these parameters are described in detail by Xiao et al. [23].
As shown in Table 1, the COAN algorithm is sensitive to extend thres. When extend thres~0:1, the precision and recall were only 0.274 and 0.174, respectively. This indicates that too many proteins were added to the seed cliques to construct complexes in the expanding step, because the value of extend thres was too small. In particular, the size of the largest predicted complex with extend thres~0:1 was 118, which is too large for protein complexes. With an increase in extend thres, the precision and recall improved. When extend thres~0:6, the precision and recall were highest. In addition, the highest value of F1 was 0.461, which is generally used to evaluate overall performance. When extend thres was increased from 0.6 to 0.9, the precision, recall and F1 all decreased. When extend thres~0:9, the size of the largest predicted complex was only 14. This indicated that only the closest proteins were added to the seed clique in the expanding step, however, the proteins closely connected to part of the seed clique may well be missed.
In general, high sensitivity values indicate that the prediction has good coverage of the proteins in the reference complexes, while high PPV values indicate that the predicted complexes are likely to be true positive [23]. When extend thres was changed from 0.1 to 0.9, the PPV always increased but sensitivity dropped sharply. This is mainly because with an increase in extend thres, the size of predicted complexes gradually decreases and only the   The '#Complexes' refers to the number of predicted complexes, and ''Size' refers to the size of the largest predicted complex. extend_thres was set at 0.6 for COAN. The highest score is in bold. doi:10.1371/journal.pone.0062077.t003  closest proteins can be added to the seed cliques. Therefore, the predicted complexes are more likely to be true positive or a part of the reference complexes, when extend thres is larger. accuracy is defined as the geometric average of sensitivity and PPV. Similar to F1, accuracy increased when extend thres was changed from 0.1 to 0.6. However, when extend thres ranged from 0.6 to 0.9, accuracy did not change appreciably, and was about 0.49.

Comparison of COAN with other methods
In this experiment, we compared COAN with the state-of-theart methods: CMC [10], COACH [15], HUNTER [11] MCODE [4] and MCL [3]. The results using the DIP dataset and the Krogan dataset evaluated with the CYC2008 dataset are listed in Table 2 and Table 3, respectively.
As shown in Table 2, COAN outperformed other methods using the DIP dataset. In particular, COAN achieved an F1 of 0.461, which was significantly superior to the other methods. Compared to COAN, COACH predicted more complexes, which was beneficial in achieving high recall and sensitivity. In contrast, MCODE only predicted 77 complexes, which resulted in the worst recall of 0.098 and F1 of 0.162. HUNTER predicted 92 complexes and achieved the highest precision of 0.685. MCL and CMC achieved the highest sensitivity of 0.555 and PPV of 0.566, respectively. In addition, we noticed that the size of the largest predicted complex by the four methods was very different. The largest predicted complex by MCL consisted of 498 proteins, which was far beyond the normal size protein complex.
Next, we compared the four methods using the Krogan dataset. From Table 3, it can be seen that the results using the Krogan dataset were similar to the results using the DIP dataset. COAN predicted 237 complexes, and achieved best performance in the overall evaluation metrics, F1, PPV and accuracy. COACH predicted 345 complexes, and achieved highest recall of 0.343. HUNTER and MCL achieved best precision 0.865 and sensitivity 0.57, respectively. MCODE only predicted 72 complexes, and achieved worst recall 0.159.
Overall, COAN predicted many protein complexes using the DIP and Krogan datasets, and outperformed other methods in the major evaluation metrics, F1 and accuracy.
In addition, Figure 5 gives an example of two complexes predicted by COAN on Krogan dataset. Due to the complex connectivity of PPI networks, it is difficult to accurately predict complexes only based on topology structure information of PPI networks. If the PPI network is annotated by GO slim, it can be noticed that some proteins share common GO slim annotations.

Examples of predicted complexes
Examples of predicted complexes using the DIP dataset are presented in Table 4 with the p-values of the three GO domains. In general, a predicted complex is considered to be statistically significant if the p-value is less than 0.01. Therefore, a smaller pvalue represents a higher biological meaning in Table 4. We used the tool SGD's GO::TermFinder [24] to calculate p-value. From Table 4, it can be seen that some predicted complexes (ID1-ID6) matched the reference complex dataset well with high p-values. Other predicted complexes (ID7-ID9) were not matched with the reference dataset. However, they also had high biologically functional homogeneity and local density. Therefore, they are possible real protein complexes which are still undiscovered by biologists. These results provide clues for biologists to verify and identify new protein complexes.

Conclusion
In order to exploit GO to predict protein complexes in a PPI network, we have proposed a novel method which constructs an ontology augmented network based on a PPI network and GO annotation information. Ontology augmented networks can efficiently integrate the PPI data and GO into a unified framework through a unified distance measure. Using the ontology augmented network, we developed a clustering algorithm, COAN, to predict protein complexes, which was capable of taking into account the topological structure of the PPI network, as well as the similarity of GO annotations. Experimental comparisons on two yeast PPI datasets showed that our approach was better than or competitive with the state-of-the-art approaches. In particular, our approach provided a framework to integrate other valuable resources, such as gene expression data.
In a complex, the GO annotations may have different importance. Therefore, they may have a different degree of contribution in the unified distance measure. In future work, we plan to explore a self-adjustment mechanism to determine the degree of contribution of different GO annotations. In addition, we will exploit other resources to improve the performance of COAN in protein complex prediction.

Supporting Information
Source Code S1 The source code of COAN.

(ZIP)
Author Contributions