Exploring Function Prediction in Protein Interaction Networks via Clustering Methods

Complex networks have recently become the focus of research in many fields. Their structure reveals crucial information for the nodes, how they connect and share information. In our work we analyze protein interaction networks as complex networks for their functional modular structure and later use that information in the functional annotation of proteins within the network. We propose several graph representations for the protein interaction network, each having different level of complexity and inclusion of the annotation information within the graph. We aim to explore what the benefits and the drawbacks of these proposed graphs are, when they are used in the function prediction process via clustering methods. For making this cluster based prediction, we adopt well established approaches for cluster detection in complex networks using most recent representative algorithms that have been proven as efficient in the task at hand. The experiments are performed using a purified and reliable Saccharomyces cerevisiae protein interaction network, which is then used to generate the different graph representations. Each of the graph representations is later analysed in combination with each of the clustering algorithms, which have been possibly modified and implemented to fit the specific graph. We evaluate results in regards of biological validity and function prediction performance. Our results indicate that the novel ways of presenting the complex graph improve the prediction process, although the computational complexity should be taken into account when deciding on a particular approach.


Introduction
A protein within a cell is rarely a single constituent of the mechanism that performs its function. It has been observed that proteins involved in the same cellular processes often interact with each other [1] making the protein-protein interactions (PPI) fundamental to almost all biological processes [2]. Significant amount of data is produced with the advancement of highthroughput technologies. Yeast-two-hybrid, mass spectrometry, and protein chip technologies have allowed the construction of large interaction networks [3], and are now scaled up to produce extensive genome-wide data sets that are providing us with a first glimpse of global interaction networks. However, these rapid improvements come at the price of a vast majority of known proteins not being experimentally characterized, and their function is yet unknown [4]. As has been commonly realized, the acquisition of data is but a preliminary step, and a true challenge lies in developing effective means to analyze such data and endow them with physical and/or functional meaning [5]. This has prompted the computational function prediction as one of the most challenging problems of the postgenomic era.
PPI data has the nature of networks. This provides a global view of the context of each protein. There is more information in a protein interaction network (PIN) compared to sequence or structure alone. A protein in a PIN is annotated with one or more functional terms. Multiple and sometimes unrelated annotations can occur due to multiple active binding sites or possibly multiple stable tertiary conformations of a protein. The annotation terms are commonly based on an ontology. A major effort in this direction is the Gene Ontology (GO) project [6]. GO characterizes proteins in three major aspects: molecular function, biological process and cellular localization.
We can now characterize the computational function prediction as the process of understanding the relationship between the protein's interaction context and its functions. Grouping proteins of the PIN into sets (clusters) which show greater similarity among proteins in the same cluster than in different clusters has been shown as an effective approach to accomplish this goal [7]. Since biological functions can be carried out by particular groups of proteins, dividing networks into naturally grouped parts (clusters) is an essential way to investigate some relationships between the function and topology of networks or to reveal hidden knowledge behind them. Typical graph clustering methods often result in a poor clustering arrangement [8] so PINs have been weighted based on topological properties such as shortest path length [9,10] and clustering coefficients [11] in order to achieve an improvement in the clustering results. In [12][13][14][15] the edge-betweenness and its modified version, using weights generated from micro array expression profiles, have been used as a method to find functional modules in the PIN. A method that combines the results of multiple, independent clustering arrangements into a single consensus cluster structure is presented in [16].
PINs have also been analyzed by extracting protein complexes, i.e. finding densely connected subgraphs within the network. To infer such complexes many methods have been proposed. The Markov Cluster algorithm (MCL) [17] simulates a flow on the graph by calculating successive powers of the associated adjacency matrix. Restricted Neighborhood Search Clustering (RNSC) [18]), is a cost-based local search algorithm that explores the solution space to minimize a cost function, calculated according to the numbers of intra-cluster and inter-cluster edges. Super Paramagnetic Clustering (SPC) [19] is a hierarchical clustering algorithm inspired from an analogy with the physical properties of a ferromagnetic model subject to fluctuation at nonzero temperature. Molecular Complex Detection (MCODE) [20] is based on node weighting by local neighborhood density and outward traversal from a locally dense seed protein to isolate densely connected regions. Detection of highly connected subgraphs (cliques) combined with Monte Carlo optimization is considered in [21]. The authors distinguish two types of clusters: protein complexes and dynamic functional modules. Highly connected subgraphs algorithm is used in [22] for discovery of protein complexes, while the authors of [23] use spectral clustering for generating modules, and possible functional relationships among the members of the cluster for predicting new protein-protein connections. More recent approaches exploit semantic similarity measures based on GO between pairs of proteins within the PIN. PROCOMOSS [24] uses a multi-objective evolutionary approach in which graphical properties as well as biological properties based on GO semantic similarity measure are considered as objective functions for detecting protein complexes in a PIN. CSO [25] performs clustering based on network structure and ontology attribute similarity on GO attributed PINs. Both of these algorithms achieve state-of-the-art performance. These results are another proof that topological features of the PIN alone are insufficient for proper partitioning of the PIN and the network needs to be augmented.
In this paper we address the problem of function prediction in twofold manner. First, we propose novel graph representations of the PIN each having different level of complexity and different inclusion of the annotation information within the graph. Second, we select state-of-the-art algorithms for cluster detection that have not yet been used on PINs and we examine their efficiency in detecting clusters within the different graph representations of the PIN as previously defined. Since we are interested in function prediction the exploration of these methods goes one step further in establishing efficient clustering in terms of accurate cluster based function prediction and establishing the benefits and the drawbacks of combining the methods with the different graph representations of the PIN in the functional annotation process. We conclude the paper with a discussion of what would be the recommended approach of predicting a function in the PIN depending on the priorities of the outcome i.e. what is the best experimental setup if the prediction is done network wide versus a prediction for a single (or a small group of) protein(s), and if the prediction accuracy is of higher importance than its coverage, and vice versa.

Protein-Protein Interaction Data
High-throughput techniques are prone to detecting many false positive interactions, leading to a lot of noise and non-existing interactions in the databases. Furthermore, some of the databases are supplemented with interactions computationally derived with a method for protein interaction prediction, adding additional noise to the databases. Therefore, none of the available databases are perfectly reliable and the choice of a suitable database should be made very carefully.
We conduct our experiments on Saccharomyces cerevisiae PPI data which are compiled from a number of established datasets used in previous research on PPI. Namely, we first merge the PPI datasets of Uetz [26], Ito [27], Ho [28], Krogan [29], and Gavin [30]. We then filter out interaction from the merged dataset based on the number of supporting evidence found in DIP [31], MIPS [32], MINT [33], BIND [34] and BioGRID [35]. The resulting dataset contains only protein-protein interactions which have more than one experimental evidence. The functional terms for each protein are taken from the SGD database [36], and are unified with the GO terminology. This data is further purified as proposed in [37]. First, the trivial functional terms, like 'unknown molecular function' are erased. Then, additional terms are calculated for each protein by the policy of transitive closure derived from the GO. The extremely frequent terms (appearing as annotations to more than 300 proteins) are also excluded, because they are very general and do not carry significant information. The final dataset is highly reliable and consists of 2502 proteins with 6354 interactions between them and has a total of 888 functional terms and 31515 protein-term pairs. The average node degree of the resulting protein interaction network is 5.08 and the clustering coefficient is 0.18. Figure 1 shows the degree distribution of the network on log-log scale.

Protein Interaction Network Representation
As previously stated, PPI data has the properties of a network and therefore can be represented as a graph. We introduce several different graph representations of the PIN, each of which represents the information within the data at a different level. Our first goal is to explore the level of detail that is sufficient for effective clustering of the PIN and function prediction, and to show that the novel augmented representations significantly improve performance. We point out here that all graphs resulting from a PIN are undirected since an interaction itself is undirected. The different representations with ascending level of complexity are defined as follows.
Simple Graphs. The most basic definition of a PIN graph representation is through simple graph with G 1~( V , E) where nodes i, j[V correspond to proteins, and edges (i, j)[E correspond to interaction between ''proteins'' i and j. The simple graph is unweighted. With this graph we use only the topology of the PIN to determine clusters. For our data we have DV D~2502 and DED~6354.
Weighted Graphs. The simplest way to enrich the previous representation is to add weights to edges from E and thus define a weighted graph G 2~( V , E, W ) for the PIN, where W is a matrix whose elements w ij are the weights of the edges (i, j)[E. Weights can be calculated in three different ways [38]. a) Content-based weights: a content-based weight calculation is one that assigns weight w 1 ij to the edge (i, j) by looking at the terms (''content'') associated with nodes i and j, not taking their environment (the graph structure) into account. If t i is the set of terms associated with node i and t j is the set of terms associated with j, w 1 ij can be computed using the normalized Jaccard Index as follows: Function Prediction in PINs via Clustering PLOS ONE | www.plosone.org b) Structure-based weights: a structure-based weight calculation is one that takes the context of the nodes i and j into account, but not the content of the nodes themselves, when calculating weight w 2 ij for the edge (i, j). In order to calculate w 2 ij we need to derive a way to map the context of i and j so that the result contains all the structural information about these nodes. The structural information of the graph G 2 is naturally encoded in its adjacency matrix A~½a ij so we can define the weight matrix W 2~½ w 2 ij as follows: where W 1~½ w 1 ij is the content-based weight matrix. Since a ij~0 , V(i, j)= [E, the first part of Eq. 2 gives the sum of all content-based weights of edges between node i and all neighbours of j, while the second part is the sum of all content-based weights between node j and all neighbours of i. PINs are known to have proteins that interact with many other, which gives rise to hubs within the graph representing the PIN. Eq. 2 will give high scores to nodes with high degree and vice versa, i.e. low scores to nodes with low degree, so we average the values to overcome this unwanted effect and get Eq. 3. Additionally w 2 ij are normalized to be in the same range as w 1 ij .
where A 1~½ a ij = P N n~1 a nj , A 2~½ a ij = P N n~1 a in , and N~DV D. c) Hybrid weights: it combines both content-based and structure-based weights; a natural way of combining them is taking the average of the two: We note that many other ways of defining W 1 and W 2 are possible. We are pointing out that multiple definitions of weighting may make sense, and that, depending on the task, one may be more suitable than the other. We will show how the different weighting schemes influence the result of clustering and function prediction.
Protein-Term Graphs. We define G 3~( V |T, E|E t ) as a protein-term graph in which the terms associated to proteins in the PIN become part of its representation. More specifically T is the set of all terms present within the PIN and each term t i is represented as a node in the graph. E t is the set of edges (i,t j ) where i[V , t j [T and term t j is associated with protein i in the PIN. This definition of the representation and the set of additional edges E t takes into account additional edges only between protein nodes (V) and new term nodes (T), and no edges exist between two term nodes, as shown on Figure 2. V and E have the same definition as in the previous representations. The graph is unweighted.
In this way functional relationships between the proteins in the PIN are directly included in the graph representation and therefore in the process of clustering and function prediction. When we create the protein-term graph for our data we have a total of 3390 nodes (DV D~2502, DTD~888) and 37869 edges (DED~6354, DE t D~31515).
Full Functional Connected Graphs. The full functional connected (FFC) graphs are defined as G 4~( V ,E|E f ,W f ). Let t i and t j be the sets of terms associated with nodes i and j, respectively, then for edge (i, j) we have (i, j)[E f if and only if (i, j)= [E and t i \t j =1. W f~½ w f ij is the weighted matrix. In other words if two proteins in the PIN share a term, an edge is added in the graph between them even if they don't interact together, thus creating ''false'' interactions. However the information for the ''true'' interactions is preserved through the weight matrix. Namely, each edge is assigned a content-based weight, with an additional constant being added to edges representing real interactions. Formally we have: where for every (i, j)[E|E f . We take the constant to be 1 since that is the maximum value of the content-based weight in the case of identical terms in the two connected nodes. This way we ensure that each true interaction weight is larger (or equal in the worst case) than any false interaction weight, but in the same time allowing the content similarity to have at most the same effect as a true interaction. The FFC graph for our PIN has a total of 1086948 edges (DED~6354, DE f D~1080594).

Clustering Algorithms
The modern science of networks has brought significant advances to our understanding of complex systems, with the organization of the vertices in clusters (also referred to as communities) being one of the most relevant features of the graphs representing such systems. The problem of detecting clusters is very hard and not yet satisfactory solved, and is in the focus of a large interdisciplinary scientific community [39]. PINs are complex networks, and as such communities (corresponding to functional modules and complexes) emerge in their graph representations [10]. In our work we focus on most recently developed methods for cluster detection in graphs which have been classified as most efficient [40]. These algorithms are initially employed in detecting community structure in different real-life networks and to our knowledge have not yet been used in clustering PINs. Taking this into account our motivation and goal is to explore how these state-of-the-art algorithms perform when used in a PIN, and even further explore how the combination with the different PIN representations affect the function prediction performance.
Modularity Function Algorithms. One of the biggest breakthroughs in cluster detection was the Girvan and Newman modularity function [41]. They propose an equation that calculates the quality of a given clustering compared to a corresponding random graph. The randomization of the edges is done with preserving each node degree. The modularity function is defined as: The term A ij has different meaning for different graph representations. When we work with unweighted graphs (G 1 ,G 3 ) the term is the corresponding member of the adjacency matrix (A ij~aij ), while in weighted graphs (G 2 ,G 4 ) the term is the corresponding member of the weight matrix (A ij~wij ) since these graphs are a simple generalization [42]. Terms k i and m are defined with k i~P j A ij and m~(1=2) P ij A ij , and in the case of unweighted graphs correspond to node degree and total number of nodes, respectively. The probability of an edge existing between nodes i and j if connections are made at random but respecting node degrees is k i k j =2m, c i defines the cluster to which node i is if c i~cj and 0 otherwise. This function gives the difference of the fraction of edges that fall into the cluster and the expected number of edges distributed at random. A value less than 1 means that the number of edges in the group is greater than the number at random i.e. the cluster is well defined, and otherwise, values between zero and 21 mean that the analysed edges do not form good cluster.
The ''Fast Community'' (FC) [43] community structure inference algorithm is based on a greedy technique that maximizes the Girvan and Newman modularity function. The algorithm uses hierarchical agglomerative method where at the beginning each node represents one cluster. Nodes and later clusters are merged trying to maximize the modularity exploring the full topology of the graph. The novelty of this algorithm is the usage of data structures for sparse matrices, max-heaps, that make this algorithm much faster and suitable for analysis of large graphs.  [44] uses a different greedy technique using supervertices for representation of the communities and calculating the modularity. At start all nodes are in different clusters but as each node chooses a new cluster the clusters are replaced with supervertices. Two supervertices are connected if there exists an edge between any two nodes from the two supervertices. Again at each step the modularity is calculated from the initial topology. This algorithm finds maximum modularity better than the algorithm used by Clauset et al. [43] but its limitation is in the storage demands.
Multi-Resolution Algorithms. Recently it has been shown that modularity optimization may fail to identify clusters smaller than a scale which depends on the total number N of links of the network and on the degree of interconnectedness of the clusters, even in cases where clusters are unambiguously defined, characterizing these methods with a so called resolution limit [45]. A new class of methods that deals with this problem is based on multiscale quality functions. These quality functions incorporate a resolution parameter allowing to tune the characteristic size of the clusters in the optimal partition and aim at uncovering modules at the true scale of organization of a network, i.e., not at a scale imposed by modularity optimization. The publication of Lambiotte [46] gives good overview of the existing multi-resolution quality functions also presenting a new method that tries to unify them by looking into the dynamics of the partitioning problem.
The key idea is to measure the quality in terms of stability of module associated to a stationary Markov process modeled as a random walk process. The resulting quality function for detecting modules on multiple-scales is defined as follows: where t represents the time parameter of the random walk, equivalent to the Hamiltonian introduced by Reichard and Bornhodt [47]. This equation is the same as the modularity function (7) when the time parameter t is equal to 1. The algorithm implementation suggested in [46] and [48] uses the same greedy technique for modularity maximization as in [44]. We performed experiments for the time parameter ranging from 1 to 10 (as suggested in [48]) and we obtained the best results when the parameter equals 5. We'll refer to this algorithm with time parameter set to 5 as TimeBGLL. Edge Clustering Algorithms. Partitioning of nodes in a graph has the disadvantage of being incompatible with the existence of overlapping clusters, i.e. situations where nodes belong to several clusters. This overlap is known to be present at the interface between clusters, but can also be pervasive in the whole graph [49]. In these situations a partition of the nodes is questionable as it imposes undesired constraints on the cluster detection problem. Since edges in the graphs representing the PINs often correspond to one particular type of interaction in the PIN, they typically belong to one single cluster. Therefore we define clusters as partitions of edges rather than of nodes. The edges incident at a single node may belong to several partitions and in this sense, nodes can be members of several clusters.
We adopt the method proposed in [50] since it naturally fits the problem at hand, and also can deal with weighted graphs as described in [51]. Without losing generality we can assume the definition G 1 (V ,E) for an unweighted node graph. The method first transforms G 1 in an unweighted line graph L 1 (G 1 ) and then uses random walk dynamics to measure the quality function. In principle, any node clustering algorithm can be used. However since optimisation of modularity is related to the behaviour of random walkers on a graph and the construction of L 1 (G 1 ) preserves the dynamics of random walkers, it makes sense to apply the modularity optimisation approach to find the partitions of the line graph L 1 (G 1 ). We use the modularity maximization algorithm proposed in [44].
The conversion of the graph from node to line is done as follows: first the node graph is represented using the incidence matrix B DV D|DED , where B ia is equal to 1 if edge a is related to node i and 0 otherwise. The matrix B can be seen as an adjacency matrix of a bipartite network. The line graph is constructed with projection of the bipartite graph by taking all nodes of one type for the nodes of the projected graph. A link is added between two nodes in the projected graph if two nodes have at least one node of the other type in common in the original bipartite graph, resulting in the adjacency matrix C DED|DED of the line graph L 1 (G 1 ), with elements defined by: where d ab is the Kronecker delta symbol. By calculating the adjacency matrix as in Eq. 9 nodes with high degree, hubs, are given too much prominence in the line graph, so normalization is used to avoid this effect and C ab is calculated with: where k i is the degree of node i. When we work with weighted node graphs, G 2 (V ,E,W ), a second weighted incidence matrixB B is introduced, whereB B aj~wa if edge a is incident on vertex j and has weight w a . Each node i has strength s i , defined as the sum of all weights of its incident edges. As in the unweighted case the normalized adjacency matrix is computed for the weighted line graph L 2 (G 2 ) given with: The visual representation of the node to line graph transformation is shown on Figure 3. Random Walks and Maps Algorithms. The ability of random walks to generate dynamics and represent information flow in the network makes them suitable for usage in the clustering problem. Probability flow of random walks on graph are used for creation of efficient and accurate clustering method by Rosvall and Bergstrom (Infomap) [52]. This algorithm additionally uses Huffman coding to describe the path on the network that also allows compression of the maps and speeding up the module detection. Using this coding retention of the unique names of the important structures formed during the random walks is provided. The random walk equation used for undirected graphs is as follows: where in the case of unweighted graphs (G 1 ,G 3 ), A is the normalized adjacency matrix, while in the case of weighted graphs (G 2 ,G 4 ), A is the weight matrix W, r is teleportation or restart probability, X(t) is the probability vector for the random walker visiting a node at time t, and S is the starting probability vector (usually S is all zeros except start node value equal 1). At beginning X(0) = S.

Functional Annotation
There are few different methods in the literature for assigning terms to a query protein after clusters are determined. Each of the methods is based on calculating a score for each term associated with a node that belongs to the same cluster as the query node, and assigning to the query protein those terms that have a score greater or lower than a predefined threshold depending on the score type being used. In our work we tested hypergeometric enrichment P-value, chi-square statistics and terms frequency within the cluster as scores for predicting terms.
The hypergeometric enrichment P{ value for term t is calculated with: where N is the number of nodes in the graph representing the PIN, T is the number of nodes in the graph that have term t assigned to them, C is the cluster size and n t is the number of nodes in the cluster that have term t assigned to them. The terms enriched within the cluster (i.e. obtaining P{ value below some threshold) are then predicted for the query node. The chi-square statistics score for term t is defined with: where n t has the same meaning as in the previous score and e t is the expected number of nodes in the cluster that have term t assigned to them. The expected number is calculated using simple proportion e t~( T=N)C, with T, N, and C having the same meaning as in the previous score. The simplest and most intuitive score calculation approach would be that each term is ranked by its frequency of appearance as a term assigned to nodes within the cluster. This approach is derived from the well known Majority Algorithm used in [53], where a node is assigned with the most frequent terms occurring in its neighbours. Our definition expands the node neighbourhood not only to the direct neighbours but to all nodes that are in the cluster it belongs to, K: where T K is the set of terms present in the cluster K, and We need to note here that when we work with graph representation G 3 , i.e. the protein-term graph, the definition of some quantities used in the score calculations need to be altered. Namely, we say that a term t is present in a cluster if the corresponding term node t belongs to the cluster. The total number of nodes in the graph corresponds to the total number of protein nodes, the size of the cluster corresponds to the number of protein nodes in the cluster, the number of nodes in the graph with term t assigned to them corresponds with the degree of term node t, and the number of nodes in a cluster with term t assigned to them corresponds to the number of edges between term node t and protein nodes belonging to the cluster. For the frequency score T K is now a set of term nodes and Z ij is defined with: , if i{th protein node from K has an edge to the j{th term from Our experiments showed that the frequency based score for function prediction outperforms the other two scores for any combination of graph representation and clustering algorithm so for simplicity all the results presented are based on this approach.

Results and Discussion
We tested representative algorithms of the previously described clustering algorithms classes, including FC [43], BGLL [44], TimeBGLL [48], EdgeCluster [50,51], and Infomap [52]. We performed evaluation of the clustering validity of the different algorithms used. Each of these algorithms was used to determine clusters in each of the different graph representations of our Saccharomyces cerevisiae PIN. We evaluated the clustering results in terms of functional validity and also in terms of accuracy when used in function prediction.
Before we proceed to the results and the discussions for the main focus of this paper, i.e. the function prediction via clustering methods, we give a summary of the computational complexity of our experiments. Although resources are vast nowadays, complexity should not be ignored when deciding upon an experimental setup. Table 1 gives a summary of the sizes of the proposed graph representations of our PIN which is crucial for the expected runtime i.e. computational complexity of the clustering algorithms which is given in Table 2. As can be seen BGLL, TimeBGLL, EdgeCluster and Infomap have essentially linear runtime proportional to the number of edges within the graph, while FC runs in quasilinear time proportional to the number of nodes within the graph, but nevertheless runs faster than any polynomial with exponent strictly greater than 1.

Clustering Validation
Clustering validation was performed using a synthetic benchmark graph as given in [54] in order to compare the different clustering methods used in our work. The synthetic graph was modeled with the parameters of the simple graph representation of our PIN. Since the aim of this experiment is to determine the clustering power of our chosen algorithms and compare them among themselves and with other algorithms used in previous research the graph representation is of no significance and any one can be used. The resulting clusters were compared with the a priori known clusters using the Normalized Mutual Information (NMI) method proposed in [55]. It is based on defining a confusion matrix M, where the rows correspond to the ''real'' clusters, and the columns correspond to the ''found'' clusters. The element of M, M ij is the number of nodes in the real cluster i that appear in the found cluster j. A measure of similarity between the clusters, based on information theory, is then: where the number of real clusters is denoted C A and the number of found clusters is denoted C B , the sum over row i of matrix M is denoted M i , the sum over column j is denoted M j and the total number of nodes is M. The normalized mutual information equals 1 if the clusters are identical and 0 if they are totally independent. The definition of the measure when the clusters are overlapping (EdgeCluster) is given in details in the appendix of [56]. Table 3 shows the resulting values for the NMI score calculated as previously explained. These results justify the selected representative clustering algorithms in this paper as they outperform the algorithms, as cited in the introduction, previously used in clustering of PINs based on the topological features of the network, i.e. MCL, RNSC, SPC, and MCODE. Later experiments show that the performance ''ranking'' on function prediction more or less follows the one given in Table 3.

Biological Validity of the Clusters
We use many different clustering algorithms that produce different clusters by size and structure for which we evaluate biological relevancy, in other words we test to confirm that the cluster structure has not arisen by chance. If a cluster is biologically relevant, the genes belonging to the same cluster  should have similar biological functions [8]. Therefore the functional homogeneity of a cluster is an indicator for its biological validity. Most of the methods for calculating a clusters functional homogeneity include some form of the P{ value measure. In [21] a modified P{ value, which combines computationally derived clusters with ''real'' complexes derived from the protein databases, is used: where N is the total number of nodes in the network, n 1 and n 2 are the sizes of the two complexes (the derived and the real one), and k is the number of nodes they have in common. This measure is effective and good when evaluating a single clustering algorithm but for two or more algorithms the evaluation is time consuming as it requires extraction of the corresponding real complexes for each computed cluster. A more efficient way of testing functional homogeneity is through functional entropy. The entropy is calculated as the sum of the appearance frequencies of all function terms in the cluster, and multiplies the logarithm of those frequencies [57]: where F i is the appearance frequency of the term i, given with the equation above, T i is the number of times that term appears in the clusters and n is the number of distinct terms present in the cluster. If the nodes in the same cluster have consistent terms, the value of the functional entropy will be low, being zero when nodes have only one term. We performed the biological validation of our clustering algorithms using entropy. We retained only clusters with more than 2 nodes, and for each combination of graph representation and clustering algorithm we calculated the average entropy over all clusters. The calculated entropy values are shown in Table 4. Taking into account the definition of the entropy measure lower values would yield an algorithm which is more stringent at identifying functionally coherent clusters. A second and more interesting aspect of the entropy in relation to our research is the correlation of the entropy values and the results of the functional annotation of proteins using the clustering algorithms. Namely, the lower the entropy of an algorithm, the coverage of the average cluster is smaller. The coverage of a cluster here is defined as the ratio between the number of terms present in the cluster and the number of terms present in the whole network. The lower coverage clusters lead to fewer mistakes being made during the  term assignment process, but on the downside these clusters may lack the necessary terms needed for correct and complete annotation of a query protein. In terms of the definitions used for the annotation validation this would mean that lower entropy values yield lower False Positives (FPs), but higher False Negatives (FNs). The inverse holds for higher entropy values.

Annotation Validation
The effective evaluation of protein functional annotation is challenging. The lack of agreed measures and benchmarks used for assessment of the methods performance makes this task difficult. In our work we used the leave-one-out method when only one protein at time plays the role of a query protein. In the leaveone-out method a random annotation protein is selected and is considered as unannotated. This assumption for no terms present at the query protein affects different representations in different ways. For the unweighted representations no additional changes have to be made, while weighted graphs should be altered since the weight computation is no longer possible as defined by the corresponding equations. Specifically if the representation uses the content based weight its value is substituted with the structure based weight and everything else remains the same. For the Protein-Term representation (G 3 ) the unannotated query protein assumption means that all edges to term nodes should be deleted.
Once the clustering algorithm has been applied, for each term present in the query cluster (i.e. the cluster of the query protein) we calculate its rank according to Eq. (15), and all ranks are then normalized to a range between 0 and 1. We should also note here that when the unannotated query protein assumption causes changes within the graph representation the clustering algorithm should be run for each query protein. The query protein is annotated with all functions that have rank above a previously determined threshold v. For example, for v = 0, the query protein is assigned with all the functions present in its cluster. We change the threshold in the [0,1] range and compute the numbers for the four possible different classes which can occur during the assignment process:   Each annotation is assigned to one of the four classes. Using the number of annotations in each class (given in brackets above) we can calculate the following statistical measures: FalsePositiveRate~F P FPzFN ð22Þ Graphed as coordinate pairs, the Sensitivity and the FalsePositi-veRate form the Receiver Operating Characteristic curve (or ROC curve). The ROC curve describes the performance of a model across the entire range of classification thresholds. The Area Under Curve (AUC) of a classifier is equivalent to the probability that the classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance [58]. We performed functional annotation for each combination of a clustering algorithm and a graph representations of our Saccharomyces cerevisiae PIN. Figures 4-8 show the ROC curves and the AUC values for each graph representation for Infomap, timeBGLL, edgeCluster, BGLL and FC, respectively. Tables 5-9, show the sensitivity and false positive rate at threshold values from v = 0 to v = 0.9 with 0.1 step.
We can see from the results shown on Figures 4-8 and Tables 5-9 what we previously stated about the influence of the entropy value. As expected the more complex representations (G 3 or ProteinTerm and G 4 or FFC graph) have higher entropy values which implicitly increases the Sensitivity and fpr values (by increasing the FP and decreasing FN). The opposite holds for the simpler representations (G 1 or Simple and G 2 or Weighted graph).
If we average the AUC values for a single algorithm over all graph representations (Table 10) the top ranking algorithm is the edge clustering with AvgAUC edgeCluster = 0.9065, followed by AvgAUC Infomap = 0.8963, AvgAUC timeBGLL = 0.8913, AvgAUC BGLL = 0.8864, and AvgAUC FC = 0.8831. This result is in line with the well known fact that protein interaction networks have many multifunctional proteins that perform several functions, and are expected to interact specifically with distinct sets of partners, simultaneously or not, depending on the function performed. If we look in more detail at Tables 5-9 we can get a better perspective about the quality of the different annotation process based on each of the clustering algorithms. Table 11 shows the corresponding sensitivity and false positive rate values for each of the algorithms combined with each of the representations at a fixed threshold v = 0. These values are a   Table 5. Values for the sensitivity (sens.) and the false positive rate (fpr), for the functional annotation for each graph representation using Infomap, at different threshold values (v).
Simple sens.  Table 6. Values for the sensitivity (sens.) and the false positive rate (fpr), for the functional annotation for each graph representation using timeBGLL, at different threshold values (v).
Simple sens. 0,8185 Table 7. Values for the sensitivity (sens.) and the false positive rate (fpr), for the functional annotation for each graph representation using edgeCluster, at different threshold values (v).
Simple sens.  Table 8. Values for the sensitivity (sens.) and the false positive rate (fpr), for the functional annotation for each graph representation using BGLL, at different threshold values (v).

0,7166
general indicator of the behaviour of the corresponding annotation process. The EdgeCluster algorithm shows much greater false positive rate as compared to the next in line (according to AvgAUC) Infomap. In fact, Infomap has the overall lowest levels of false positive rates with any graph representation. This means that Infomap performs very stringent clustering of the PIN which results in clusters that are poor in terms of function (term) diversity therefore missing out on part of the functions (terms) which should be associated with a query protein. This leads to a very precise, but incomplete view of the annotation set of the query protein. On the other hand EdgeCluster, timeBGLL, BGLL, and FC achieve much higher sensitivity at the price of a high false positive rate, which means that the annotation set view is much richer but more noisy as compared to Infomap. All of these results are due to the fact that the ratio between the number of clusters generated with Infomap and the other algorithms (all have similar numbers of clusters) is approximately 2.5:1.
The performance of the algorithms on the different graph representations proposed in this research is consistent in all the experiments as can be seen in Table 10. As expected the simple graph representation (G 1 ) has the lowest AUC values for all clustering approaches. The hybrid weighting scheme (G 2 ) outperforms each of the separate content and structure weighting, with structure being more informative than the content. The rise in performance noted when using the FFC graph representation (G 4 ) suggests that the actual PIN is lacking part of the real interactions that occur between pairs of proteins. Finally, the Protein-Term representation (G 3 ) yields the best results in terms of AUC, but both G 3 and G 4 have the noisy annotation problem as stated before (even for the usually low noise Infomap algorithm). In terms of complexity it is clear from Tables 1 and 2 that the G 3 and G 4 representations are more complex and this computational complexity should be taken into account when deciding on the appropriate representation for a PIN. Also a network wide annotation would be very impractical if we use G 2 ,G 3 , or G 4 , since the clustering algorithm needs to be run for every query protein.
On the other hand a scenario in which a wider set of possible annotations needs to be determined for a single (or a few) protein(s) would greatly benefit from these augmented PIN graph representations.
In summary and considering the goals defined our results show that all of the proposed novel representations yield a significant improvement in the function prediction performance over the simple unweighted graph representation. The Protein-Term graph representation is the most informative one and if computational resources are not scarce it is the representation that should be used for the prediction. The next in line is the FFC graph representation, followed by the hybrid weighted graph representation. The ease of further augmentation of these two representation (for example with similarity metrics based on GO instead of using a simple Jaccard index) is their added value and they can be further improved to maximize the annotation prediction performance. All of the clustering algorithms used in this paper perform very good on the PIN, as it was shown in the clustering validation section, with Infomap being the best in that context. In terms of using these clustering algorithms in the function prediction the most accurate one is the Infomap algorithm, while edgeCluster and timeBGLL have the highest coverage.
As a final note we point out to another potential problem in the process of function prediction using clustering, namely the completeness. It has been estimated that the complete S. cerevisiae network has between 37800 and 75500 protein interactions [59]. Currently there are between 55000 and 60000 interactions contained in publicly available repositories for S. Table 9. Values for the sensitivity (sens.) and the false positive rate (fpr), for the functional annotation for each graph representation using FC, at different threshold values (v).
Simple sens. cerevisiae, which means there are potentially unknown regions of the network which can explain the high false positive rates and low sensitivity stated before.

Conclusions
Complex protein interaction networks reveal graph properties that can be analysed in terms of functional modules associated with the biological function they perform. In our work we investigated the power of the novel algorithm for complex network clustering combined with novel graph representations of the protein interaction networks, and assess their possibilities for protein function prediction via clustering. We show that using these algorithms we can gain significant knowledge for the modular structure of the network. As these networks carry not only interaction information but also annotations the different representations we propose augment to the prediction process by including this information in the clustering of the network.
The results from our experiments validate the augmented graph representation approach. Even the simplest augmentation i.e. the different weighted graph representations of the PIN significantly improve the results of the function prediction. Our experiments were performed using the simple normalized Jaccard Index as a weighting factor and we are confident that results can be even further improved using a more sophisticated weighting scheme. We used the same weighting when we further augmented the graph representation by adding artificial edges to take into account the well known fact that protein interaction networks to this date are still not completely captured by the experimental methods used for their construction. This representation is very complex and is computationally exhaustive but the potential of uncovering new knowledge is significantly increased. Our experiments showed that the most informative representation is the one where we generate a graph in which every single term associated with a protein becomes a node and the association of proteins and terms is represented by adding an edge between each pair. The power of unravelling the functions of a query protein of this representation is the greatest of all proposed representations, but also the same holds for the computational complexity.
In general if one would like to perform a network wide annotation, usage of the weighted graph representations would be recommended, while the exploration of a single protein, or a small group of proteins, should the performed using either the full functional connected graphs or the protein-terms graph. In terms of selecting a clustering algorithm our results showed that Infomap has the best performance in determining the modular structure of Table 10. Values for the AUC for the functional annotation with each clustering algorithm and graph representations for the PIN and the average AUC values per algorithm and per representation.
a PIN and is also the most accurate of all tested algorithms. However, the high accuracy comes with the price of low coverage (i.e. the inability to discover a larger set of functions associated with a query protein). The opposite holds for the timeBGLL and EdgeCluster algorithms. Depending on the required results one can choose either a random walk and map algorithm (Infomap) if the priority is to get a narrow set of accurate protein functions, or either an edge clustering/overlapping clusters algorithm (Edge-Cluster) or a multi-resolution algorithm (timeBGLL) if coverage of the possible functions is of bigger importance.

Supporting Information
File S1 Matlab code for generation of the graph representations.