Local Network Topology in Human Protein Interaction Data Predicts Functional Association

The use of high-throughput techniques to generate large volumes of protein-protein interaction (PPI) data has increased the need for methods that systematically and automatically suggest functional relationships among proteins. In a yeast PPI network, previous work has shown that the local connection topology, particularly for two proteins sharing an unusually large number of neighbors, can predict functional association. In this study we improved the prediction scheme by developing a new algorithm and applied it on a human PPI network to make a genome-wide functional inference. We used the new algorithm to measure and reduce the influence of hub proteins on detecting function-associated protein pairs. We used the annotations of the Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) as benchmarks to compare and evaluate the function relevance. The application of our algorithms to human PPI data yielded 4,233 significant functional associations among 1,754 proteins. Further functional comparisons between them allowed us to assign 466 KEGG pathway annotations to 274 proteins and 123 GO annotations to 114 proteins with estimated false discovery rates of <21% for KEGG and <30% for GO. We clustered 1,729 proteins by their functional associations and made functional inferences from detailed analysis on one subcluster highly enriched in the TGF-β signaling pathway (P<10−50). Analysis of another four subclusters also suggested potential new players in six signaling pathways worthy of further experimental investigations. Our study gives clear insight into the common neighbor-based prediction scheme and provides a reliable method for large-scale functional annotation in this post-genomic era.


Introduction
Due to advance in DNA sequencing, genes are being discovered at unprecedented speed, creating a need for annotating their functions. High-throughput mapping of protein-protein interaction (PPI) data is an example of functional genomics that enables rapid assignment of functional annotations by links between proteins which imply functional associations. However, due to noises inherent in the process of data generation [1], for example, by a yeast two-hybrid method [2], it becomes important to develop algorithms that reduce the influence of such noises and improve the quality of declared functional associations. So far, partial PPI networks for several organisms have been mapped [3][4][5][6][7][8][9][10][11], and different methods have been formulated to investigate these networks, and hence protein functions [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27]. One method to suggest biological function is to compare the PPI network with similar random networks to identify unusual topological connectivity between proteins, which we call common-neighbor statistics. Such statistics has been used to assess the functional relationship between proteins in a yeast PPI network, and functional inferences that are statistically significant have been made from those relationships [28]. In this study, we improved upon the commonneighbor statistics, thereby enhancing the quality of functional association predictions, and applied our methods to a compre-hensive human PPI dataset [29] to suggest potential functions of human proteins.
PPIs can be visualized as a graph with proteins composing the nodes, and interactions composing the edges (the graphical interactions). Ample evidence exists that such a graph is nonrandom in the topologies of its connectivity [30][31][32]. We assumed that most of the nonrandomness is necessary for the protein-interaction network to perform proper biological function. We further hypothesize, that two proteins share a number of interacting neighbors which is significantly larger than that occurred on average in truncated power-law preserving random networks can significantly enhance the likelihood of the two proteins sharing a common or related biological function. In prior work on yeast PPI network, we developed a formula for ranking the degree of rareness of such occurrences [28]. In this study, we developed an additional formula to overcome a deficiency in the previous work and make the ranking more accurate. We found that the combination of these two formulas leads to better results. We applied the method of detecting nonrandomness to the publicly available PPI dataset for humans [29]. With our clustering method, we built a 1729-protein cluster where we found most function-related proteins were clustered together and many subclusters were highly enriched in different signaling pathways. In particular, we made an in-depth analysis of the transforming growth factor b (TGF-b) pathway which is important in cell proliferation and tumorigenesis, and suggested a list of proteins presumably involved in several signaling pathways.

Algorithms
Suppose that in a PPI network of size N, the degree (i.e., the number of interactions) for each protein node is fixed, but the interacting partners are randomly selected. This specifies the random network which we compare the real PPI data with. We randomly pick proteins X and Y (X with n X interactions and Y with n Y interactions) and find that X and Y share m interacting partners (nodes) in this network. We denote the set of common partners as A~Z 1 ,Z 2 ,::::::Z m f g , the set of all proteins as V~1,2,::::::,N f g , and the number of interacting partners for each protein in V as k~n 1 ,n 2 ,::::::n N f g . The total number of graphs in which proteins X and Y have m common partners is a product of three factors: (i) m proteins can be . By multiplying these three factors and dividing by the total number of unrestricted ways for protein X to have n X and protein Y to have n Y interacting partners-N n X N n Y -we can arrive at the following formula (Algorithm I) by Samanta and Liang [28]: In this calculation, we have relaxed the constraint that the degree of each node remains the same. For such totally randomized networks for which only the average number of interactions per protein is fixed, our simulation showed that the probability computed by P 1 is accurate. However, a more realistic random control is to also keep the degree distribution the same as the real PPI network (i.e., to preserve the truncated power-law distribution [32]). This is much broader than a totally random network, for which the degree distribution, for a large number of interactions, decays exponentially. For such a truncated power-law random network, our simulations showed that P 1 becomes inaccurate. To determine the reason behind this and to devise a compensation, we note that in any set A of m common partners, proteins with more interactions will appear at a higher frequency. An extreme case is that if one protein interacts with most proteins in the network (i.e., a hub protein), it is hardly a surprise to find any two proteins sharing it as a common partner. Because it is easier to observe hub proteins as common partners and because P 1 only takes into account the degree of nodes on average, the significance of P 1 should be downweighted when hub proteins are involved as common partners. Therefore, we came up with another algorithm (Algorithm II) to reduce the influence of hub proteins: under the condition that all proteins are randomly connected, we used the degree kof V (except the degree of X and Y) to compute the probability that only A~Z 1 ,Z 2 ,::::::Z m f gconnects to X and Y, and we derived the probability as follows: In supporting information (Text S1), we show that the second product is bounded from both above and below; and hence, we Þ . Therefore, each protein pair with common neighbor(s) was assigned with both P 1 and P 2 . In previous work, Samanta and Liang [28] used only P 1 to rank the relationship of protein pairs. For our method, we added P 2 as a complementary algorithm to improve the biological inference. We showed that by reducing the influence of hub proteins in the network, the use of both P 1 and P 2 allowed us to identify a more reliable functional relationship than that identified by P 1 alone.

Comparing Network Topology between Real and Randomized PPI Networks
We computed the probabilities (P 1 and P 2 ) according to Algorithms I and II for 311,023 protein pairs that had at least one common neighbor, and plotted the distribution of the probabilities ( Fig. 1a and  1b). In this paper, all the probabilities have been natural [base e] logarithm transformed. To assess the statistical significance of the topological connections in the human PPI network, we computed and compared the distributions of probabilities calculated from Algorithms I and II in suitably randomized networks. There are two ways to randomize the PPI network: (i) randomly connect nodes (proteins) but keep the total number of edges (interactions) the same (i.e., simple random network); and (ii) in addition to (i), keep the number of interacting partners of each protein the same as in our real PPI network (i.e., a truncated power law-preserving random network). Compared to simple randomization, for both Algorithms I and II, the truncated power law-preserving randomization produced a probability distribution more similar to that of the real PPI network (Fig. 1). As a biological network is a network with a truncated power-law distribution [32], it is more realistic to use a truncated power law-preserving random network as the background for comparisons. We use ''random network'' hereafter to refer to a truncated power law-preserving random network, unless otherwise specified. As expected, the human PPI network has much more highly improbable topological connections that happen by chance only with a very low probability ( Fig. 1c and 1d).

Ranking Protein Pairs and Suggesting Functionally Associated Protein Pairs
Ideally, given that P 1 assesses the degree of nonrandomness in the network, which indicates the functional association, we anticipated that P 1 should rank our protein pairs in a way that reflected their functional relevance. Therefore, we hypothesized that a higher ranking (i.e., a better P 1 ) corresponds to a closer biological relationship. With the Gene Ontology (GO) annotations [33] and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway annotations [34] as benchmarks, we used annotation overlap rates (see Methods and Materials) to validate the reliability of the protein pair ranking from P 1 , and preliminarily determined functionally associated protein pairs (i.e., significant protein pairs). We noted that in the top 5,000 protein pairs, each 1,000 pairs always had a higher overlap rate than those beyond the top 5,000 pairs, and that the region of high overlap will give us a high level of confidence in presenting reliable predictions (Fig. 2). Thus, we chose the 5,000th value of P 1 (217.11) as the cutoff from Algorithm I. It was interesting that the probability perfectly matched the Bonferroni correction ln 2 N N{1 ð Þ in which N = 7,362 is the size of the whole protein network. The false discovery rate (FDR) [35], which was used to assess the effectiveness of our method, is 0.40 for the top 5,000 functional associations selected by Algorithm I, with the cutoff at 217.11 (for our definition of FDR, see Methods and Materials).
In a real PPI network, it is common to have many hub proteins with large numbers of interacting neighbors. P 2 is designed to reduce the influence of these hub proteins within the top 5,000 protein pairs selected by P 1 as we believe that P 2 can identify protein pairs whose lower P 1 is caused by common neighbors that are hub proteins and remove them from the list of significant protein pairs. With GO and KEGG as the benchmarks, the utility of P 2 is then confirmed by the following assertions: (i) the protein pairs with a good P 2 (Group I) always have a lower FDR (here a lower FDR means a closer functional relationship) than those without a good P 2 (Group II; Fig. 3a); and (ii) the protein pairs with a good P 2 (Group I) always have a lower FDR than the same number of top protein pairs ranked by P 1 only (Group III; Fig. 3b). We also noted that because P 1 and P 2 have a  , an additional cutoff from Algorithm II makes difference from merely tightening the cutoff from Algorithm I. As the cutoff for P2 changes, the difference in FDR between Groups I and II varies; the difference maximizes when the cutoff goes to 230.03, which is the value we used for the second cutoff from Algorithm II (Fig. 3a). Therefore, 4,233 significant protein pairs (P,0.001; see Table S1 in supporting information) were considered to have a close functional association in terms of the cutoffs from Algorithm I (217.11) and Algorithm II (230.03). In addition, the 4,233 significant pairs had a FDR of 0.35, compared with 0.39 for the top 4,233 pairs ranked by P 1 only (Fig. 3b), 0.83 for the top 4233 pairs from the truncated power law-preserving random network [cutoffs: 28.90 for ln(P 1 ) and 211.33 for ln(P 2 )] and 0.92 from the totally randomized network [cutoffs: 26.42 (P 1 ) and 213.10 (P 2 )].

Estimate the Lower Bound of FDR for the 4233 significant protein pairs
Because the functional annotations for human proteins are far from complete, the proportion of true positive functional associations must be higher and thus the FDR should be lower than 0.35. To estimate the lower bound of the FDR, we took into consideration the behavior of the random network by computing what percentage of the 4,233 protein pairs were generated by chance. As biological networks are networks with a truncated power-law distribution [32], we used only a truncated power law-preserving random network as the background. Cut by the same cutoff [217.11 for ln(P 1 ) and 230.03 for ln(P 2 )], the power law-preserving random networks have on average 86 protein pairs as significant associations (Fig. S1). The lower bound of FDR is the false discovery number generated in random network (86) divided by the number of predicted significant associations (4233), which is approximately 2%.

Significant Protein Pairs Are Informative in Functional Inference
We observed strong functional relationships among the top 4,233 protein pairs. After manual inspection, we found that at least 96 of the top 100 annotated protein pairs (excluding pairs with unannotated proteins) have close functional relationships and we listed the top 10 pairs in Table 1.
The GO and KEGG-based FDR for 23,782 direct interactions is 0.57, which is significantly higher than our FDR of 0.35 (P,10 216 , two-sample proportion test). This comparison supports the notion that our method offers more reliable functional associations than the human PPI data itself does. Because only 21.6% of the 4,233 protein pairs interact directly in the PPI data, we believe that the rest of them provide additional functional information that is not revealed in the PPI data.
We used GO and KEGG annotations to compare functions and compute annotation overlaps. Among the 1,754 proteins in the top 4,233 protein pairs, 1,220 have qualified GO terms (i.e., GO terms at the highest level without direct or indirect GO ''offspring'' terms in each ontology), and 834 have KEGG pathway annotations. If a protein has at least one annotated significant partner (i.e., two proteins are significant partners to each other if they are a significant protein pair), a list of annotation(s) from its partner(s) can be sorted Figure 3. Algorithm II decreased the false discovery rate (FDR) of our predictions. (a) For the top 5,000 protein pairs ranked by P 1 , each cutoff value from P 2 (on the x axis is the quantile of P 2 we used as the cutoffs) divided them into two groups: Group I (red line), whose P 2 was better than the cutoff, and Group II (blue line), whose P 2 was worse than the cutoff. In this plot, the maximal difference between the two groups is at 0.039  by frequency and annotations occurring at the highest frequency are assigned to this protein (frequency must be at least twice for KEGG and four times for GO; otherwise discarded. For more details, see Text S1 and Fig. S3 in supporting information). For an annotated protein (based on GO and KEGG annotations), if an assigned annotation occurs among its known functions, we consider this to be a correct prediction. By this method, we found that 79% (for KEGG) and 70% (for GO) of assigned annotations were correct predictions. (Randomly picking 4233 pairs from 1729 proteins will only yield a 7% correct prediction rate for KEGG and 12% for GO on average from 100 trials.) In the same way, we predicted 466 KEGG pathways for 274 proteins and 123 GO terms for 114 proteins. We estimated that the FDRs of our predictions are much less than 21% (for KEGG) and 30% (for GO) because of the percentage of correct predictions for annotated proteins and the incompleteness of GO and KEGG annotations. We arbitrarily selected 40 predicted annotations (20 for KEGG and 20 for GO) and listed them in Table 2. For complete predictions, see Table S2 in supporting information.

Clustering from the Significant Protein Pairs
Because clustering can significantly improve the quality of functional inference [28], we built a cluster consisting of 1,729 proteins (excluding 25 non-human proteins) based on the P 1 of 4,233 significant protein pairs. We constructed the empirical cumulative distribution from these P 1 values; thus, each significant protein pair had a score between 0 and 1 according to its ranking order in the distribution of P 1 . Then we built a 172961729 dissimilarity matrix in which each matrix element was assigned either a score (if applicable) or a ''10'' for pairs with no significant P 1 . The purpose of using such a large value was to minimize background noise. Then the dissimilarity matrix was subjected to agglomerative hierarchical clustering with an unweighted pair-group average. The whole cluster is given in Fig. S2 in supporting information.

Analysis of Functional Modules with Significant P Values
In the cluster of 1,729 proteins, most of the functionally related proteins were correctly clustered into their corresponding functional modules, in which they are characterized by similar functions or the same pathway (Fig. 4). The largest subcluster derives directly from the root of the whole cluster and consists of 959 proteins; the second-largest subcluster has only 51 members (Fig. S2). We cut the 959-member subcluster with different cutoff values and analyzed the corresponding subclusters by using both manual inspection and Ingenuity Pathway Analysis (IPA). We conducted a detailed analysis for one prominent subcluster (the subcluster related to the TGF-b signaling pathway) as a reference.
The TGF-b signaling pathway-related subcluster (Fig. 5a) has a total of 45 protein members, 35 of which are known to participate in the TGF-b signaling pathway, according to the Ingenuity database. The probability of observing this by chance is ,10 254 , according to the calculation from Ingenuity software (right-tailed Fisher's exact test). With respect to this extreme P value, we reasoned that probably all the cluster members cooperate to mediate signal transduction. To investigate the role of the other 10 proteins in the TGF-b signaling pathway, we generated a functional relationship network using Osprey software (http:// biodata.mshri.on.ca/osprey) [36] to explicitly elucidate the relationships between the 45 proteins (Fig. 5b): the 10 proteins not related to TGF-b according to the Ingenuity database are located inside a circle, whereas the other 35 TGF-b member proteins lie on the circle; common neighbors which do not belong to the 45-member subcluster stay outside the circle.
The cluster and the association network ( Fig. 5a and 5b) intuitively suggest possible roles that the inner proteins play in the TGF-b signaling pathway, which have not yet been incorporated into the Ingenuity pathway. Take Fig. 5b for instance: SKI functions as both the significant partner and the direct interacting neighbor of SMAD2 and SMAD3, and the three proteins' common neighbors (five violet nodes) all share the function of transcriptional regulation. From this we infer that SKI may regulate the TGF-b signaling pathway on a transcription level, which is in accordance with findings in the literature (but has not been incorporated into the Ingenuity database) that SKI regulates downstream DNA transcription by forming a protein complex with SMAD2 and SMAD3 [37], [38]. With respect to IGSF1's significant partners, direct-interaction partners, and the previous work identifying IGSF1 as a potential receptor that could affect cellular response through its cytoplasmic region [39], we suspect that IGSF1 could function as a coreceptor for inhibin and/or activin. SOSTDC1 and NOG may regulate TGF-b by interacting with BMP receptors, which is in accordance with the findings that both of them function as BMP antagonists [40], [41]. In addition to positive regulatory functions [42], DAB2 may serve as an antagonist of STRAP, which has a negative regulation on TGF-bmediated transcriptional activation [43], [44]. FMOD, CTGF, and SLITL2 may be involved in regulating receptor binding of TGF-bs, in accordance with published findings [45][46][47], and they may interact with each other. Thus, through integration with information from known networks, our method (probability, probability-derived clusters and networks) suggests new features which we can further investigate in experiments.
To facilitate analysis of this type, we proposed eight signaling pathways with extreme P values (,10 240 , from IPA 5.0) that are worthy of further investigations (Fig. 6). The proteins within the same signaling pathway tend to stay together in the same subclusters. This is shown for the largest 959-member subcluster ( Fig. 6a; cluster members are indexed from 1 to 959). From IPAbased classification of the proteins into each of the eight pathways, we calculated a density distribution for all eight signaling pathways along the cluster (Fig. 6b-e). Each pathway is expected to have a distinct distribution (its own peaks). The peaks in Fig. 6b-e map to some areas (i.e., subclusters) that are probably highly related to their corresponding pathways. Functionally intercrossed pathways, like death-receptor/NF-kB signaling, may have close peaks. The distribution patterns are useful in identifying pathway-specific regions in the cluster. We selected another 4 subclusters that are presumably involved in six signaling pathways (excluding TGF-b) with respect to pathway member distributions, and listed the potential pathway members in Fig. 7. We expect that the clusters and distributions will help biologists to find their subcluster of interest and discover new pathway members.

Discussion
An advantage of our prediction scheme, inherited from Samanta and Liang (2003), is the insensitivity to the high false positive rate of high-throughput PPI data. After adding 6086 randomly generated interactions (30.4% of the real data, assuming at least 50% false positive rate for high-throughput data), we were still able to recover on average 93.4% of significant protein pairs; furthermore, .90% of falsely generated ''significant protein pairs'' will become significant if we loosen the cutoffs of P1 and P2 a little to double the number of significant protein pairs. This will certainly offer more flexibility when selecting which PPI data to use.
We compared the performance of our prediction scheme with that of the direct prediction scheme used by Schwikowki et al. (2000) which infers the function of a protein from it direct interacting neighbors in the PPI network [48,49]. Under the same criteria (i.e., the minimum frequency of shared functions required to assign annotations), the FDRs of our predictions (30% for GO and 21% for KEGG) have been significantly improved over the FDRs (60% for GO and 49% for KEGG) from the direct prediction scheme [48]. This result is reasonable because our algorithms identified significant protein pairs that are more functionally associated than the direct-interacting pairs in the human PPI data, and we made functional inferences from these significant pairs, not from direct protein interactions which may suffer large amounts of false positives generated in high throughput assays. Human proteins may have multiple functions and belong to different functional modules, so different signaling pathways may also have some pathway members in common. It is thus reasonable to assume that the overlap of distribution (Fig. 7b-e), especially of peaks, may reveal the functional relevance of different pathways. For example, the death-receptor and NF-kB signaling pathways overlap in the peak area, and the T-and B-cell receptor signaling pathways have a similar distribution. Therefore, the cluster and its pathway distributions will be useful in multipathway analysis and accurate function prediction.
We also developed a new algorithm for computing the probabilities that three proteins share m interacting partners (see Text S1 in supporting information). However, we found that if three proteins have a very low probability of sharing m interacting partners, in most cases two of them will have a very low P 1 . Because this algorithm is highly dependent on Algorithm I (P 1 ), we do not think it provides more information worthy of further investigation.
In conclusion, we proposed an improved method to predict protein functional association and make reliable functional annotations; we derived a cluster to investigate signaling pathways and suggest potential novel pathway members. We believe that with the explosion of available human PPI data, our method will contribute greatly to the functional research of human proteins.

Protein-Protein Interaction Data
From the BioGRID (www.thebiogrid.org), we downloaded the human PPI data (version 2.20), which derived from both conventional focused studies (,69.6%) and high-throughput studies (yeast two-hybrid; ,30.4%) [29]. There are 20,019 total non-redundant interactions (excluding self-interactions) and 7,362 protein entries in this dataset, including 42 nonhuman proteins that interact with human proteins.

Benchmarks for evaluating the functional association
We used GO and KEGG as independent benchmarks to assess the functional association of each protein pair. GO and KEGG databases provide specific pathways, functions and cellular components for proteins in our PPI data: we classified the 7,362 proteins into 237 KEGG pathways and 1956 qualified GO terms (including biological process, molecular function and cellular component). These databases are good references for evaluating functional association because of its reasonable coverage of the genome and its large number of categories, which makes it improbable to have random matching of pathways.

Annotation overlap rate
With GO annotation (R package: GO, 08-Aug-2006), we defined the GO overlap rate as follows: overlap rate~T Q TA , where T Q is the number of protein pairs of which both proteins share at least one qualified GO term; T A is the number of protein pairs of which both proteins are annotated with qualified GO terms. Here ''qualified GO terms'' means GO terms at the highest level without direct or indirect GO ''offspring'' terms in each ontology (the level is defined as the number of nestings from the root node (level 1) in the Gene Ontology DAG file [33]).
We defined the KEGG overlap rate in the same way as above (R package: KEGG, Release 41.1). We used the GO and KEGG overlap rates to assess the functional association of protein pairs: a higher overlap rate corresponds to a closer functional relationship. Definition of FDR for the declared significant functional associations Suppose the GO and KEGG pathways are complete: if both proteins in each pair have KEGG pathway identifiers and qualified GO terms, we call them declared positive protein pairs. If they share at least one identifier (either GO or KEGG identifier), we consider this declared association true positive; otherwise we consider it false positive. Therefore, the FDR can be written as follows: FDR~1{ number of true positive protein pairs number of declared positive protein pairs This false discovery rate is used to assess the performance of our algorithm as we expect an improved annotation scheme will lower the proportion of wrong predictions among declared significant functional associations.

Pathway analysis tool
We used IngenuityH Pathway Analysis (IPA) 5.0 software (Ingenuity Systems, Inc., Redwood City, CA) to identify existing pathway members and calculate P values for signaling pathways identified in our cluster.

Supporting Information
Text S1 Found at: doi:10.1371/journal.pone.0006410.s001 (0.12 MB DOC) Figure S1 Density plot of the distributions of P 1 and P 2 (two dimensions) from the human protein-protein interaction (PPI) network (a) and the randomized but truncated power-law preserving PPI network (b). The vertical and horizontal lines stand for the thresholds from Algorithms I and II, respectively. In a random PPI network (with truncated power-law), the expectation of significant protein associations is 86 (lower left in b) compared with 4,233 significant associations in the real PPI network (lower left in a). Found at: doi:10.1371/journal.pone.0006410.s002 (0.15 MB TIF) Figure S2 A cluster that consists of 1729 human proteins. Indices above protein names are their coordinates in this cluster. Found at: doi:10.1371/journal.pone.0006410.s003 (0.08 MB PDF) Figure S3 Estimation of prediction precise rates and the number of predictions we can make given different n (n is the minimal frequency of annotation occurrence required for functional prediction). (a) Estimated precise rate of predicted KEGG pathways given n. (b) The number of predictions for KEGG pathway we can make given n. (c) Estimated precise rate of predicted GO terms given n. (d) The number of predictions for GO terms we can make given n.