Information Content-Based Gene Ontology Functional Similarity Measures: Which One to Use for a Given Biological Data Type?

The current increase in Gene Ontology (GO) annotations of proteins in the existing genome databases and their use in different analyses have fostered the improvement of several biomedical and biological applications. To integrate this functional data into different analyses, several protein functional similarity measures based on GO term information content (IC) have been proposed and evaluated, especially in the context of annotation-based measures. In the case of topology-based measures, each approach was set with a specific functional similarity measure depending on its conception and applications for which it was designed. However, it is not clear whether a specific functional similarity measure associated with a given approach is the most appropriate, given a biological data set or an application, i.e., achieving the best performance compared to other functional similarity measures for the biological application under consideration. We show that, in general, a specific functional similarity measure often used with a given term IC or term semantic similarity approach is not always the best for different biological data and applications. We have conducted a performance evaluation of a number of different functional similarity measures using different types of biological data in order to infer the best functional similarity measure for each different term IC and semantic similarity approach. The comparisons of different protein functional similarity measures should help researchers choose the most appropriate measure for the biological application under consideration.


Abstract
The current increase in Gene Ontology (GO) annotations of proteins in the existing genome databases and their use in different analyses have fostered the improvement of several biomedical and biological applications. To integrate this functional data into different analyses, several protein functional similarity measures based on GO term information content (IC) have been proposed and evaluated, especially in the context of annotation-based measures. In the case of topology-based measures, each approach was set with a specific functional similarity measure depending on its conception and applications for which it was designed. However, it is not clear whether a specific functional similarity measure associated with a given approach is the most appropriate, given a biological data set or an application, i.e., achieving the best performance compared to other functional similarity measures for the biological application under consideration. We show that, in general, a specific functional similarity measure often used with a given term IC or term semantic similarity approach is not always the best for different biological data and applications. We have conducted a performance evaluation of a number of different functional similarity measures using different types of biological data in order to infer the best functional similarity measure for each different term IC and semantic similarity approach. The comparisons of different protein functional similarity measures should help researchers choose the most appropriate measure for the biological application under consideration.

Introduction
The advancement of high-throughput biology technologies has resulted in a large increase in functional data, eliciting the need for relevant tools that help analyze and extract information from these data. The Gene Ontology (GO) [1] is an established standard for the functional annotation of proteins that successfully provides structured and controlled, organism-independent vocabularies to describe gene functions and a well adapted platform to computationally process data at the functional level [2]. Currently, several proteins are already annotated with GO terms in the existing biological databases [3][4][5][6], thus enabling protein comparisons on the basis of their GO annotations. Even though the high proportion (more than 98%) of these annotations are inferred electronically (mostly based on transitive mappings from InterPro2GO, SPKW2GO, EC2GO, SPSL2GO, HAMAP2GO and UniPathway2GO), with IEA (Inferred from Electronic Annotation) as the GO evidence code (http://www.geneontology.org/ GO.evidence.shtml), these annotations are becoming more and more accurate with an increased level of confidence as the different mappings are manually curated [7].
Several functional similarity measures that quantify similarity between proteins based on their GO annotations have been introduced and successfully applied in many biomedical and biological applications [2,8]. These measures allow the integration of the biological knowledge contained in the GO structure [9], and have contributed to the improvement of biological analyses [2]. These measures are derived either directly from the GO term information content (IC), a numerical value scoring the description and specificity of a GO term using its position in the GO directed acyclic graph (DAG), or from GO term semantic similarity scores conveying information shared by two GO terms in the GO DAG [8]. It is worth mentioning that several term semantic similarity models have been introduced and a detailed review can be found in [10,11]. In this study, we are only focusing on term semantic similarity models that are based on term information content, known as node-based models [8,11]. In order to quantify the information content (IC) value of a given term, several approaches have also been proposed, each depending on how the concept 'specificity' is conceived in the context of the GO DAG structure. These approaches are partitioned into two main families, namely annotation-and topology families, and have been largely used to compare GO terms in the GO DAG and proteins at the functional level using their GO annotations.
The annotation family uses GO term statistics in the corpus under consideration. Despite the issue of protein annotation dependence (scores are based on annotation, which may be unbalanced, biased and incomplete), which leads to shallow annotation problem [10] that affects semantic similarity scores produced [12], this family has been used in several applications. Several approaches for comparing GO terms have been tested in the context of the GO DAG, the most popular node-based semantic similarity approaches include the Resnik [13], Lin [14] and Jiang & Conrath [15] approaches, which were initially suggested in the context of the WordNet and adapted to the GO DAG [16]. Recently, the Nunivers approach [8] has been introduced and different enhancements, such as Disjunct Common Ancestor (DCA) [17], relevance similarity [18], information coefficient similarity [19] and eXtended GraSM (XGraSM) [8] model were proposed to improve the existing approaches for GO term comparison. Note that a random walks enhancement [20] was proposed to improve any of the existing similarity measures by modeling inherent uncertainty from the incomplete knowledge of gene annotations and ontology structure. Functional similarity measures induced by GO term semantic similarity approaches include average (Avg) [16], maximum (Max) [21], average of the best matches (ABM) [2], and best match average (BMA) [9], and those using the GO term information content directly, namely SimGIC [22], SimUI [23], SimUIC and SimDIC [2,9].
The topology-based family, which only uses the structure of the GO DAG in the computation of the IC values, has been proposed to correct for the effect of annotation dependence and provide an effective way of measuring functional similarity between proteins based on their GO annotations. The earliest type of topology-based family, namely edge-or path-based semantic similarity measures, suffers from a serious drawback of producing uniform scores for terms at the same level of the hierarchy under consideration as these scores are obtained using path lengths between terms [8]. These measures ignore the position characteristics of terms in the hierarchy and a solution based on differently weighting edges was suggested, but failed to completely resolve the problem [9,11]. In this study, we are only considering the node-based approaches as pointed out previously, which use the concept of IC score to compare the properties of the terms themselves and relations to their ancestors or descendants, and taking into account term position characteristics [9]. These measures are referred to as IC-based approaches and overcome the main issue of edge-or path-based approaches, producing a fixed and well defined IC score for a given GO term, independent of the corpus or source under consideration. Each topology-based approach provides its specific semantic similarity measure for comparing GO terms, and functional similarity measure for scoring protein closeness. However, none of the existing studies has attempted to evaluate the effectiveness of functional similarity measures proposed in the context of the annotation-based approaches when applied to the topologybased approaches. Such a study is important to determine the most appropriate functional similarity measure for each approach given the biological application.
Here, we investigate the behaviour of several different IC-based functional similarity measures suggested in the context of annotation-based and topologybased approaches, using different biological data, including protein-protein interaction networks, protein domain and other functional data. Each measure performs differently for different applications [2] and interprets the DAG structure of the GO differently [8,9]. Thus, one needs to understand these differences in order to choose an effective measure for analysis of a dataset, which can be cumbersome and tedious for someone who just needs a quick GO semantic similarity measure for their biological question. This suggests that the quantitative comparative study of all existing GO semantic similarity measures and approaches is necessary to enable one to quickly identify the most effective measure, among the several semantic similarity tools available, for their application. This study provides a mapping between a term IC or term semantic similarity approach and its corresponding most 'appropriate' functional similarity measure, given a particular biological application.

Materials and Methods
To evaluate the existing IC-based functional similarity measures which have been used in the context of biomedical and bioinformatics applications, we use different functional data, including protein sequence, Pfam domain and enzyme commission (EC) similarity data, human gene expression (microarray) and protein-protein interaction (PPI) datasets. All these data represent some form of 'grouping' of proteins that should be functionally related and thus provide useful tests for GO similarity measures. The complete set of GO data and protein-GO term associations were extracted from the GO and GOA databases, respectively, released on the 15th April, 2014. We have considered three topology-based approaches, namely the GO-universal metric proposed by Mazandu and Mulder [9], and the methods of Wang et al. [24] and Zhang et al. [25]. In general, the information content (IC) or semantic value of a given term t is computed as follows: where p(t) is the relative frequency of occurrence of the term t in the protein annotation dataset under consideration [16], which is the D-value [25] and topological position characteristic of t in the context of annotation family, the Zhang and GO-universal approaches, respectively. Note that the Zhang et al. model for computing the IC score follows the Seco et al. approach [26] in its conception and it is adapted to the context of the GO-DAG. For the Wang et al. method, the IC score of a given term t is the sum of S-value of the term t and those of all its ancestors [24]. The term semantic similarity score S GO s,t ð Þ between GO terms s and t can be retrieved from the following formula [8]: where A x~A |fxg and A denotes the set of ancestors of the term x, m A s \A t ð Þ §0 and m A s |A t ð Þw0 are measures of the commonality between and of the description of A s and A t , respectively. The formula 2 is a unified formula of all term semantic similarity models based on IC or SV values of terms. Note that other term semantic similarity models that do not use only or directly IC values were proposed. These include the Hybrid Relative Specificity Similarity (HRSS) method [27], which adapts both node-and edge-based concepts, and the Shortest Semantic Differentiation Distance (SSDD), which assesses the distance between terms in the GO DAG in order to measure their semantic similarity score [28], and these methods are beyond the scope of this study.

Measuring protein similarity at the functional level
Several measures have been proposed for estimating functional similarity scores in the context of annotation-based IC approaches to facilitate protein comparisons at the functional level. These functional similarity scores are obtained using statistical measures of closeness, such as average (Avg), maximum (Max), bestmatch average (BMA) and averaging all the best matches (ABM). The average and maximum measures are computed as follows: and where T X r is a set of GO terms in X representing the molecular function (MF), biological process (BP) or cellular component (CC) ontology annotating a given protein r and n~T X p and m~T X q are the number of GO terms in these sets, and S GO s,t ð Þ is the semantic similarity score. The ABM [2] for two annotated proteins is the mean of best matches of GO terms of each protein against the other, given by the following formula: with S GO s,T X r À Á~m axfS GO s,t ð Þ : t[T X r g. The Best Match Average (BMA) [2,9] for two annotated proteins p and q is the mean of the following two values: average of best matches of GO terms annotated to protein p against those annotated to protein q, and average of best matches of GO terms annotated to protein q against those annotated to protein p, given by the following formula: Note that the four functional similarity measures above require GO term semantic similarity scores, and are referred to as IC-based non-direct term or term semantic similarity-or pair-wise term-based measures [2]. For the topology-based family, each approach has been suggested with its functional similarity measure. The GOuniversal metric [9] uses BMA, and ABM was used in the Wang et al. approach [24]. The Zhang et al. measure [25] is a context dependent approach and authors initially suggested using the approach proposed by Lord et al. [16], which is the Avg scheme for measuring functional similarity scores between proteins.
In the context of the annotation-based family, it has been observed that measuring the semantic similarity of two GO terms based only on the most informative common ancestor terms cannot discern the semantic contributions of the ancestor terms to these two specific terms and thus may negatively impact functional similarity scores. The GraSM and XGraSM approaches have been proposed and shown to perform better than those using only the most informative common ancestors (MICA) strategy [8]. This argument has been confirmed through the performance evaluation of the SimGIC measure suggested by Pesquita et al. [22], which uses a Jaccard index weighted by IC of terms, thus incorporating the features of all ancestors of the terms. The SimGIC measure computes the functional similarity score between two proteins p and q as follows: where IC(x) is the information content value of the term x [8] and A X r a set of GO terms together with their ancestors in X representing the ontology (MF, BP or CC) annotating a given protein r.
Using the observation above, we proposed two other possible functional similarity schemes [2,9], using Dice (Czekanowski or Lin like measure) and universal indexes, referred to as SimDIC and SimUIC, respectively, and given by the following formulae: SimUIC p,q ð Þ~P Note that this study provides the first evaluation of these SimDIC and SimUIC measures and their comparison to other functional similarity measures. Unlike the Avg, Max, ABM and BMA measures, in which semantic similarity between GO terms is required in the computation of functional similarity scores, the SimGIC, SimDIC and SimUIC measures use the IC of terms directly and they are referred to as IC-based direct term measures. Note that there exist other functional similarity models, such as shortest-path graph kernel (spgk) [29], using the intrinsic topology of the GO DAG for directly estimating protein functional similarity scores without computing the IC scores of GO terms or semantic similarity scores between terms. Here, we are only focusing on protein functional similarity models that use the IC of terms.

Assessing different functional similarity measures
We systematically assess different functional similarity measures on different types of functional data, including sequence similarity, Pfam domain and Enzyme Commission (EC) similarity data on a selected set of proteins, and human protein-protein interaction (PPI) and co-expression networks. These datasets represent different types of biological data used to evaluate GO semantic similarity measures [10]. Depending on these biological data, different performance measures are used to elucidate the 'best' semantic similarity measure or approach.

Correlation with EC, Pfam and sequence similarity
Generally, the comparison of different semantic similarity measures is performed using Pearson's correlation measures with sequence, Pfam domain and Enzyme Commission (EC) similarity data. This correlation provides an indication of how effective the functional similarity measure is in capturing sequence, Pfam, and EC similarity. This means that a measure with a higher correlation is better, since it captures these similarities well and it is likely to be an unbiased measure. To compare different measures, we ran the Collaborative Evaluation of Semantic Similarity Measures (CESSM) online tool [30] at http://xldb.di.fc.ul.pt/tools/ cessm/ for BP and MF using a dataset of selected proteins with known relationships downloaded from the CESSM website.

Performance evaluation using a PPI network
Different measures were assessed in terms of their ability to capture functional coherence in a human PPI network based on how interacting proteins are functionally related to each other. Human PPI datasets were downloaded from several different PPI databases, including the IntAct, DIP, BIND, MIPS, MINT and BioGRID databases, and integrated into a single network in which only interactions predicted by at least two different approaches and found in the STRING dataset are considered, to reduce the impact of false positives. This produced a human PPI network with 6031 interactions from which a total of 5366 and 5580 interactions with both interacting partners were among 29844 and 31683 proteins annotated with respect to the GO BP and CC ontologies, respectively. These interaction datasets are available in the supplementary data (see Tables S1, S2 and S3 in File S1) and can also be downloaded from the CBIO website at http://web.cbio.uct.ac.za/ITGOM/funcsimdata. The set of these 5366 and 5580 interactions are considered as a positive set, while the negative set consists of the same number of interactions randomly selected among annotated human proteins pairs. This is consistent as the chance of randomly selecting a detected PPI is very small (less than 0.0012%). We only considered proteins annotated with BP and CC terms in the network produced since two proteins that interact physically are more likely to be involved in similar biological processes or localized in the same cellular component, but there is no guarantee that they share molecular functions [9]. The classification power of different functional similarity measures was tested using Receiver Operator Characteristic (ROC) curve analysis, which assesses the Area Under the Curve (AUC), plotting the true positive rate or sensitivity vs the false positive rate or 1specificity. This AUC value is used as a measure of discriminative power and a realistic classifier must have an AUC larger than 0.5.

Clustering power on a gene expression dataset
We use the human co-expression network retrieved from the Bossi et al. [31] and the STRING human network. We retrieved 7228 co-expressed protein pairs of which a total of 6995 pairs have both proteins found among 29844 human proteins annotated with BP terms (see Tables S4 and S5 in File S1, or go to http:// web.cbio.uct.ac.za/ITGOM/funcsimdata). We are only considering the BP ontology as co-expressed genes are more likely to share common processes and may at least belong to the same pathway or contribute to a similar biological process [32]. We partitioned these co-expressed proteins into different clusters using the Blondel et al. method [33] and the corresponding partition is considered to be a ground truth, i.e., the true partition of the actual co-expressed network. Thereafter, the interactions from the co-expressed network are weighted using functional similarity scores and proteins clustered using the same clustering method. We assessed the clustering power of a given functional similarity measure by comparing this clustering result to the ground truth using Normalized Mutual Information and Rank Index of pairwise cluster memberships [34].
Let n be the number of proteins in the network with the ground truth (g) having p partitions, each with n g i proteins, i~1, . . . ,p, and clustering result (c) with q partitions, each with n c j proteins, j~1, . . . ,q. The entropy H r d À Á of a given clustering (d) having r partitions, each with n d ' proteins, '~1, . . . ,r, is given by: and the mutual information I p g ,q c ð Þ between the two partitions is computed as follows: where n ij is the number of common proteins between the ith cluster in the ground truth and the jth cluster in the clustering result. This implies that the normalized mutual information NI p g ,q c ð Þ is given by: Finally, the Rank Index RI p g ,q c ð Þ of pairwise cluster memberships is computed as follows: where a is the number of pairs of proteins belonging to the same cluster in the ground truth and clustering result, and b the number of protein pairs belonging to different clusters in the ground truth and clustering result. The functional similarity measure providing higher normalized mutual information and accuracy scores is considered to be the 'best' one.

Results and Discussion
Previous work on semantic similarity measures has suggested that the appropriate use of functional similarity measures depends on the biological applications and different measures perform differently for different applications [2]. Each semantic similarity approach or functional measure was defined for a specific purpose with a specific application in mind, especially in the context of topologybased approaches, where each approach was set with its specific functional similarity measure, depending on its conception and the applications for which it was designed. These applications include, protein-protein interaction assessments, protein function prediction, protein clustering, etc. and results were often tested against the expectations of the performance scores. Here, we assess the performance of different measures on different biological applications or data, including EC, Pfam domain and sequence similarity on a selected set of protein pairs, and human PPI and co-expression network or expression data, in order to elucidate the most 'appropriate' measures for different approaches and biological applications. The summary of different approaches that are combined to construct 57 different IC-based functional similarity measures used is provided in Table 1. Note that the Jiang and Conrath approach is not used explicitly since it has been shown to be a particular case of the Lin approach [8].

Using EC, Pfam and Sequence Similarity data
We used a dataset of proteins with known relationships downloaded from the CESSM online tool. The GO annotations of different proteins in the dataset were retrieved from the GOA-UniProtKB dataset. The CESSM tool has made the comparison of different functional similarity measures using Pearson's correlation measures with sequence, Pfam domain and EC similarity possible. We ran the CESSM online tool and results are shown in Figure 1 for the BP, MF and CC ontologies. Except for the Resnik approach, these results show that in general there is a good correlation between EC, Pfam domain, sequence similarity and functional similarity measures for BP, MF and CC, especially when using measures other than Max and Avg. For EC in particular, the MF ontology tends to display higher levels of correlation. This is unsurprising as EC numbers are very specific for a particular function, so there should be good correlation in MF terms.
Recently, it was shown that the normalization model and correction factors have an impact on the performance of functional similarity measures [8]. It is likely that the effect of the normalization factor is a serious drawback of the Resnik approach as this has an impact on its performance and makes it inconsistent with the hierarchy under consideration. This is confirmed by looking at the performance of the Nunivers [8] and Lin [14] approaches (see Table 2), which follow the general pattern, whereas the Resnik approach suggests the Max measure for the MF ontology. In general, BMA and ABM measures provide the best performance and they perform equally in most cases. On the other hand, the Table 1. Summary of different IC-based functional similarity and term semantic similarity measures.

Measure Model Approach Reference
Functional similarity IC-based direct term SimGIC [22] SimDIC [2] SimUIC [2] SimUI [23] Pair-wise term or IC-based non direct term Avg [16] Max [21] BMA [22] ABM [24] Term Semantic Similarity Annotation-based Resnik [13] XGraSM-Resnik [8] Nunivers [8] XGraSM-Nunivers [8] Lin [14] XGraSM-Lin [8] Li et al. [19] Relevance [18] Topology-based GO-Univeral [9] Wang et al. [24] Zhang et al. [25] These measures were used to built 57 different functional similarity measures that are assessed using different types of biological data, including Enzyme use of an efficient correction factor may improve a given approach or measure. If the information coefficient and relevance introduced by Li et al. [19] and Schlicker et al. [18], respectively, which use the IC value of the most informative common ancestor between terms, does not significantly improve the performance of the Lin approach, then one can consider all common informative ancestors in the correction factor to enhance the performance of the approach [8].
As displayed in Figure 1 and Table 2, applying the XGraSM correction factor to the Resnik, Lin and Nunivers approaches significantly improved their performance. Thus, including common informative ancestors in the conception of a semantic similarity measure improves its performance, especially for approaches that include only the feature of child terms in the computation of IC. This is the case for the annotation-based, Zhang et al. and Wang et al. approaches, where the SimGIC measure shows an overall best performance. Note that this is not the case for the GO-universal metric, in which, the BMA measure performs better than other measures, and it also provides better performance for the Wang et approach when applied to EC data, even though the Wang et al. approach initially used the ABM measure. It follows that in the context of the annotation-based family, if one chooses to use the IC-based non-direct measures, it is advantageous to use the  XGraSM enhancement model, in which case, Resnik-BMA shows overall best performance. The SimUI approach [23] refers to the union-intersection protein similarity measure and it is a particular case of SimGIC assigning equal IC value to all terms in the GO-DAG [9]. Even though this assumption is not realistic in the context of the GO DAG, the SimUI measure can still be used as an alternative measure in practice as it shows relatively good performance when applied to these different data.

Using protein-protein interaction and expression data
We used human PPI and co-expressed networks to assess the performance of different functional similarity measures. In the case of the PPI network, we are using the AUC values computed using the ROCR package under the R programming language as a measure of classification power. The larger the upper AUC value, the more efficient the functional similarity measure is. For the coexpression network, we computed the NI and RI values as measures of clustering power, the higher these values, the more powerful the functional similarity measure is. Different values found for different measures are shown in Figure 2   and Table 3. These results indicate that independently of the approaches, the Avg measure, which is the earliest proposal suggested by Lord et al. [16] in the context of the IC-based functional similarity, performs better than any other functional similarity measure. It was unexpected to find that the Wang et al. approach performs poorly in terms of AUC values when using the BMA and ABM measures for BP, whereas these measures have shown good performance when used in EC, Pfam domain and sequence similarity data and the authors of this approach initially suggested using the ABM measure. Other approaches show good performance when used with their initial measures even though the Avg measure achieves the best performance. On the other hand, the Max approach performs poorly compared to other approaches, independently of the network (PPI or co-expression) and performance measure. This may be due to the fact that the Max approach tends to over-estimate functional similarity scores between proteins, for example by assigning the similarity score of 1 to two proteins sharing at least one GO terms independently of the number of unrelated terms between these proteins. Table 4 lists functional similarity measures achieving overall 'best' performance for different ontologies (MF, CC and BP) given a biological data type. These results indicate that for the CC ontology, the topology-based approaches, namely    Assessing Gene Ontology Functional Similarity Measures measures for different approaches and different biological data or applications is provided in Table 5. Finally, note that the good performance of the annotation-based family is related to the corpus under consideration because of its dependence on the frequencies of GO term occurrences in the corpus. These annotations may be unbalanced in their distribution across the DAG. This constitutes a serious drawback to these approaches, specifically for organisms with sparse GO annotations and may negatively affect their performances [9]. The use of the whole set of annotations as done in this study may solve this problem but only at the cost of an increase in the running time and the complexity of these annotation-based approaches. This is expected to worsen as the number of protein annotations increases daily, which would potentially hamper the performance of these approaches in their running time, since processing the annotation file would take a lot of time before being able to compute the IC values. This implies that it is may be better to make use of topology-based approaches if one has to choose between the two families.

Conclusion
Several IC-based GO functional similarity measures have been proposed over recent years and have enabled comparison of proteins at the functional level on the basis of their GO annotations. These measures are being used in different Table 5. Summary of the best performing measures for different applications.

Model
Approach EC Pfam Seq. Sim. PPI CN biological and biomedical applications and have largely contributed to the efficient exploitation of the biological knowledge embedded in the GO structure. While annotation-based functional similarity measures have been intensively studied and topology-based measures very often deployed to specific applications, none of the previous studies has attempted to quantitatively perform all-againstall semantic similarity measure comparisons. As a result, there were still gaps in our knowledge on the performance of these measures when applied to different biological data or applications, making the choice of the most 'appropriate' measure difficult, especially for someone who just needs a quick GO semantic similarity measure for their biological question. Thus, a comparative study was necessary in order to provide a global assessment of these different semantic similarity measures.
Here, we have carried out a quantitative performance evaluation of several different semantic similarity measures between GO terms for different term IC families or semantic similarity approaches and different biological data. Results indicate that a measure used for a given biological data type was not always the most appropriate even for the 'well' studied family measures, namely annotationbased measures. In fact, though the SimGIC or the BMA or ABM measure was confirmed to be the best measure, in general, when using EC, Pfam domain and sequence similarity data, this measure was not the best for applications related to PPI and co-expression data (e.g., assessing protein-protein interaction or clustering co-expressed proteins), where the Avg measure showed overall best performance. This is also the case for the topology-based approaches where, in general, the initial measure suggested for use does not provide the overall best performance. This study bridges the gap between the large variety of GO semantic similarity measures and their performance in different biological and biomedical applications by comparing different protein functional similarity measures using different biological data. This should help researchers choose the most appropriate measure for their biological application.

Supporting Information
File S1. Combined file of supporting tables. Table S1: A human protein-protein interaction dataset used to assess the classification power of different functional similarity measures using Receiver Operator Characteristic (ROC) curve analysis. Table S2: A set of human protein-protein interaction with both interacting partners annotated with respect to the GO BP ontology. Table S3: A set of human protein-protein interaction with both interacting partners annotated with respect to the GO CC ontology. Table S4: A human co-expression network used to assess the clustering power of different functional similarity measures using using Normalized Mutual Information and Rank Index scores.