Revisiting Date and Party Hubs: Novel Approaches to Role Assignment in Protein Interaction Networks

The idea of “date” and “party” hubs has been influential in the study of protein–protein interaction networks. Date hubs display low co-expression with their partners, whilst party hubs have high co-expression. It was proposed that party hubs are local coordinators whereas date hubs are global connectors. Here, we show that the reported importance of date hubs to network connectivity can in fact be attributed to a tiny subset of them. Crucially, these few, extremely central, hubs do not display particularly low expression correlation, undermining the idea of a link between this quantity and hub function. The date/party distinction was originally motivated by an approximately bimodal distribution of hub co-expression; we show that this feature is not always robust to methodological changes. Additionally, topological properties of hubs do not in general correlate with co-expression. However, we find significant correlations between interaction centrality and the functional similarity of the interacting proteins. We suggest that thinking in terms of a date/party dichotomy for hubs in protein interaction networks is not meaningful, and it might be more useful to conceive of roles for protein-protein interactions rather than for individual proteins.


Introduction
Protein interaction networks, constructed from data obtained via techniques such as yeast two-hybrid (Y2H) screening, do not capture the fact that the actual interactions that occur in vivo depend on prevailing physiological conditions. For instance, actively expressed proteins vary amongst the tissues in an organism and also change over time. Thus, the specific parts of the interactome that are active, as well as their organisational form, might depend a great deal on where and when one examines the network [1,2]. One way to incorporate such information is to use mRNA expression data from microarray experiments. Han et al. [1] examined the extent to which hubs in the yeast interactome are co-expressed with their interaction partners. They defined hubs as proteins with degree at least 5, where ''degree'' refers to the number of links emanating from a node. Based on the averaged Pearson correlation coefficient (avPCC) of expression over all partners, they concluded that hubs fall into two distinct classes: those with a low avPCC (which they called date hubs) and those with a high avPCC (so-called party hubs). They inferred that these two types of hubs play different roles in the modular organisation of the network: Party hubs are thought to coordinate single functions performed by a group of proteins that are all expressed at the same time, whereas date hubs are described as higher-level connectors between groups that perform varying functions and are active at different times or under different conditions. The validity of the date/party hub distinction has since been debated in a sequence of papers [3][4][5][6], and there appears to be no consensus on the issue. Two established points of contention are: (1) Is the distribution of hubs truly bimodal (as opposed to exhibiting a continual variation without clear-cut groupings) and (2) is the date/party distinction that was originally observed a general property of the interactome or an artefact of the employed data set? Different statistical tests have suggested seemingly different answers. However, despite (or in some cases due to) this ongoing debate, the hypothesis has been highly prominent in the literature [2,[7][8][9][10][11][12][13][14][15]. Here, following up on the work of Batada et al. [3,5], we revisit the initial data and suggest additional problems with the statistical methodology that was employed. In accordance with their results, we find that the differing behaviour observed on the deletion of date and party hubs [1], which seemed to suggest that date hubs were more essential to global connectivity, was largely due to a very small number of key hubs rather than being a generic property of the entire set of date hubs. More generally, we use a complementary perspective to Batada et al. to define structural roles for hubs in the context of the modular organisation of protein interaction networks. Our results indicate that there is little correlation between expression avPCC and structural roles. In light of this, the more refined categorisation of date, party, and 'family' hubs, which was based on taking into account differences in expression variance in addition to avPCC [8], also appears inappropriate. A recent study by Taylor et al. [2] argued for the the existence of 'intermodular' and 'intramodular' hubs-a categorisation along the same lines as date and party hubs-in the human interactome. We show that their observation of a binary hub classification is susceptible to changes in the algorithm used to normalise microarray expression data or in the kernel function used to smooth the histogram of the avPCC distribution. The data does not in fact display any statistically significant deviation from unimodality as per the DIP test [16,17], as has already been observed by Batada et al. [3,5] for yeast data. We revisited the bimodality question because it was a key part of the original paper [1], and in particular because it made a reappearance in Taylor et al. [2] for human data. However, it is possible that a date-party like continuum may exist even in the absence of a bimodal distribution, and this is why we also attempt to examine the more general question of whether the network roles of hub proteins really are related to their co-expression properties with interaction partners.
Many real-world networks display some sort of modular organisation, as they can be partitioned into cohesive groups of nodes that have a relatively high ratio of internal to external connection densities. Such sub-networks, known as communities, often correspond to distinct functional units [18][19][20]. Several studies in recent years have considered the existence of community structure in protein-protein interaction networks [21][22][23][24][25][26][27][28][29]. A myriad of algorithms have been developed for detecting communities in networks [19,20]. For example, the concept of graph 'modularity' can be used to quantify the extent to which the number of links falling within groups exceeds the number that would be expected in an appropriate random network (e.g., one in which each node has the same number of links as in the network of interest, but which are randomly placed) [30]. One of the standard techniques to detect communities is to partition a network into sub-networks such that graph modularity is maximised [19,20].
We use the idea of community structure to take a new approach to the problem of hub classification by attempting to assign roles to hubs purely on the basis of network topology rather than on the basis of expression data. Our rationale is that the biological roles of date and party hubs are essentially topological in nature and should thus be identifiable from the network alone (rather than having to be inferred from additional information). Once we have partitioned the network into a set of meaningful communities, it is possible to compute statistics to measure the connectivity of each hub both within its own community and to other communities. One method for assigning relevant roles to nodes in a metabolic network was formulated by Guimerà and Amaral [31], and we follow an analogous procedure for the hubs in our protein interaction networks. We then examine the extent to which these roles match up with the date/party hypothesis, finding little evidence to support it.
One might also wonder about the extent to which observed interactome properties are dependent on the particular instantiation of the network being analysed. Several papers have discussed at length concerns about the completeness and reliability (or lack thereof) of existing protein interaction data sets, e.g. [32][33][34][35][36][37][38]. Such data have been gathered using multiple methods, the most prominent of which are Y2H and affinity purification followed by mass spectrometry (AP/MS). (See the discussion in Materials and Methods.) In a recent paper, Yu et al. examined the properties of interaction networks that were derived from different sources, suggesting that experimental bias might play a key role in determining which properties are observed in a given data set [11]. In particular, their findings suggest that Y2H tends to detect key interactions between protein complexes-so that Y2H data sets may contain a high proportion of date hubs (i.e., hubs with low partner co-expression)-whereas AP/MS tends to detect interactions within complexes, so that hubs in AP/MS-derived networks are predominantly highly co-expressed with their partners (i.e., these networks will contain party hubs). This indicates that a possible reason for observing the bimodal hub avPCC distribution [1] is that the interaction data sets used information that was combined from both of these sources. Here we compare several yeast interaction data sets and find both widely differing structural properties and a surprisingly low level of overlap.
Finally, as an alternative to the node-based date/party categorisation, we suggest thinking about topological roles in networks by defining measures on links rather than on nodes. In other words, one can attempt to categorise interactions between proteins rather than the proteins themselves. We use a well-known measure of link significance known as betweenness centrality [18,39] and examine its relation to phenomena such as protein co-expression and functional overlap. Here as well we find little evidence of a significant correlation with expression PCC of the interactors. However, there seems to be a reasonably strong relation between link betweenness and functional similarity of the interacting proteins, so that link-centric role definitions might have some utility.
In summary, we have examined the proposed division of hubs in the protein interaction network into the date and party categories from several different angles, demonstrating that prior arguments in favour of a date/party dichotomy appear to be susceptible to various kinds of changes in the data and methods used. Observed differences in network vulnerability to attacks on the two hub types seem to arise from only a small number of particularly important hubs. These results strengthen the existing evidence against the existence of date and party hubs. Furthermore, a detailed analysis of network topology, employing the novel perspective of community structure and the roles of hubs within this context, suggests that the picture is more complicated than a simple dichotomy. Proteins in the interactome show a variety of topological characteristics that appear to lie along a continuum-and there does not exist a clear correlation between their location on this continuum and the avPCC of expression of their interaction partners. On the other hand, investigating link

Author Summary
Proteins are key components of cellular machinery, and most cellular functions are executed by groups of proteins acting in concert. The study of networks formed by protein interactions can help reveal how the complex functionality of cells emerges from simple biochemistry. Certain proteins have a particularly large number of interaction partners; some have argued that these ''hubs'' are essential to biological function. Previous work has suggested that such hubs can be classified into just two varieties: party hubs, which coordinate a specific cellular process or protein complex; and date hubs, which link together and convey information between different function-specific modules or complexes. In this study, we re-examine the ideas of date and party hubs from multiple perspectives. By computationally partitioning protein interaction networks into functionally coherent subnetworks, we show that the roles of hubs are more diverse than a binary classification allows. We also show that the position of an interaction in the network is related to the functional similarity of the two interacting proteins: the most important interactions holding the network together appear to be between the most dissimilar proteins. Thus, examining interaction roles may be relevant to understanding the organisation of protein interaction networks.
(interaction) betweenness centralities reveals an interesting relation to the functional linkage of proteins, suggesting that a framework incorporating a more nuanced notion of roles for both nodes and links might provide a better framework for understanding the organisation of the interactome.

Revisiting Date and Party Hubs
The definitions of date and party hubs are based on the expression correlations of hubs with their interactors in the protein interaction network . Specifically, the avPCC has been computed for each hub and its distribution was observed by Han et al. [1] to be bimodal in some cases. A date/party threshold value of avPCC (for a given expression data set) was defined in order to optimally separate the two types of hubs [1].
We have re-examined the data sets and analyses that were used to propose the existence and dichotomy of date versus party hubs. In the original studies on yeast data [1,4], any hub that exhibited a sufficiently high avPCC (i.e., any hub lying above the date/party threshold) on any one expression data set was identified as a party hub. Batada et al. [5] noted that this definition causes the date/ party assignment to be overly conservative, in that a hub's status is unlikely to change as a result of additional expression data. In fact, some of the original expression data sets were quite small, containing fewer than 10 data points per gene. This suggests that classification of proteins as 'party' hubs was based on high coexpression with partners for just a small number of conditions in a single microarray experiment, even though such co-expression need not have been observed in other conditions and experiments. For instance, Han et al. found 108 party hubs in their initial study [1]. However, calculating avPCC across their entire expression compendium (rather than separately for the five constituent microarray data sets) and using the date/party threshold specified by the authors for this compendium avPCC distribution yields just 59 party hubs. Using only the ''stress response'' data set [40], which comprises over half of the data points in their compendium and is substantially larger than the other 4 sets, yields 74 party hubs. Thus, the results of applying this method to categorise hubs depend heavily on the expression data sets that one employs and is vulnerable to variability in smaller microarray experiments.
Recent support for the idea of date and party hubs appeared in a paper that considered data relating to the human interactome; the authors found multimodal distributions of avPCC values, seemingly supporting a binary hub classification [2]. We used an interaction data set provided by Taylor et al. [2] (an updated version of the one used in their paper, sourced from the Online Predicted Human Interaction Database (OPHID) [41]; see Materials and Methods), and found that the form of the distribution of hub avPCC that they observed is not robust to methodological changes. For instance, raw intensity data from microarray probes has to be processed and normalised in order to obtain comparable expression values for each gene [42]. The expression data used by Taylor et al. [2] (taken from the human GeneAtlas [43]) was normalised using the Affymetrix MAS5 algorithm [44]; when we repeated the analysis using the same data normalised by the GCRMA algorithm [45] (which is the preferred method to control for probe affinity) instead of by MAS5, we obtained significantly different results. Figure 1 depicts the avPCC distributions for hubs (defined as the top 15% of nodes by degree [2], corresponding in this case to degree 15 or greater) in the two cases. We obtained density plots for varying smoothing kernel widths. The GCRMA-processed data does not appear to lead to a substantially bimodal distribution at any kernel width, whereas the MAS5-processed data appears to give bimodality for only a relatively narrow range of widths and could just as easily be regarded as trimodal. We also used Hartigan's DIP test [16,17,46] to check whether either of the two versions of the expression data gives a distribution of avPCC values showing significant evidence of bimodality. The DIP value is a measure of how far an observed distribution deviates from the best-fit unimodal distribution, with a value of 0 corresponding to no deviation. We used a bootstrap sample of 10,000 to obtain p-values for the DIP statistic. We found no significant deviation from unimodality: for MAS5, the DIP value is 0:0087 (p-value &0:821) and for GCRMA the DIP value is 0:0062 (p-value &0:998). This suggests that the apparent bimodal or trimodal nature of some of the curves in Figure 1 is illusory and not statistically robust.  [2]). Gene expression data from GeneAtlas [43], normalised using (a) MAS5 and (b) GCRMA [42]. Curves obtained using a normal smoothing kernel function at varying window widths. Hartigan's DIP test for unimodality [16,17] returns values of 0.0087 (p-value &0:821) for (a) and 0.0062 (p-value &0:998) for (b), indicating no significant deviation from unimodality in either case. doi:10.1371/journal.pcbi.1000817.g001 We also find variability across different interaction data sets: For instance, we analysed the recent protein-fragment complementation assay (PCA) data set [47] and found no clear evidence of a bimodal distribution of hubs along date/party lines (data not shown). Even in cases where multimodality is observed, it might be arising as a consequence or artefact of combining different types of interaction data; there are believed to be significant and systematic biases in which types of interactions each data-gathering method is able to obtain [11,29,47]. For instance, analysing avPCC values on the stress response expression data set [40] for hubs in networks obtained from Y2H or AP/MS alone [11], we find that 100% (259/259) are date hubs in the former but that only about 30% (56/186) are date hubs in the latter. At the moment it is reasonable to entertain the possibility that new kinds of interaction tests might smear out the observed bimodality; this appears to be the case with the PCA data set.
One of the key pieces of evidence used to argue that date and party hubs have distinct topological properties was the apparent observation of different effects upon their deletion from the network. Removing date hubs seemed to lead to very rapid disintegration into multiple components, whereas removal of party hubs had much less effect on global connectivity [1,4]. However, it has been observed that removing just the top 2% of hubs by degree from the comparison of deletion effects obviates this difference, suggesting that the observation is actually due to just a few extreme date hubs [5]. In order to study this in greater detail, and to isolate the extreme hubs, we used node betweenness centrality [39] (see Materials and Methods), a standard metric of a node's importance to network connectivity (this need not be strongly correlated with degree). We found that in the original 'filtered yeast interactome' (FYI) data set [1], date hubs have on average somewhat higher betweenness centralities (1:79|10 4 for 91 date hubs versus 1:07|10 4 for 108 party hubs, two-sample t-test p-value &0:08). However, there happens to be one date hub (SPC24/UniProtKB:Q04477, a highly connected protein involved in chromosome segregation [48]) that has an exceptionally high betweenness (2:45|10 5 ) in this network. When the set of date hubs minus this one hub is targeted for deletion, we find that the observed difference between date and party hubs is greatly reduced (Figure 2(a)).
It was subsequently shown that the FYI network was particularly sparse; as more data became available, the updated filtered high-confidence (FHC) data set was used to perform the same analysis [4] (we also looked at the Y2H-only and AP/MSonly networks [11]; see Figure S1). In the case of FHC, the network did not break down on removing date hubs but nevertheless displayed a substantially greater increase in characteristic path length (CPL) than seen for party hub deletion. For FHC too, date hubs have, on average, higher betweenness values than party hubs (3:7|10 4 for 306 date hubs versus 2:15|10 4 for 240 party hubs, p-value &0:06). However, the larger average is due almost entirely to a small number of hubs with unusually high betweennesses, as removing the top 10 date hubs by betweenness (which all had values higher than any party hub) greatly reduced the difference between the distributions (p-value &0:29). Furthermore, the removal of just these 10 hubs from the set of targeted date hubs is sufficient to virtually obviate the difference with party hubs, as shown in Figure 2(b). Notably, the set of 10 highbetweenness hubs includes prominent proteins such as Actin (ACT1/UniProtKB:P60010), Calmodulin (CMD1/UniProtKB: P06787), and the TATA binding protein (SPT15/Uni-ProtKB:P13393), which are known to be key to important cellular processes (Table 1). Thus, we can account for the critical nodes for network connectivity using just a few major hubs, and most of the proteins that are classified as date hubs appear to be no more central than the party hubs. High betweenness nodes have previously been referred to as bottlenecks [7] and it has been suggested that these are in general highly central and tend to correspond to date hubs. However, the same sort of analysis on the Yu et al. data set [7] once again revealed that only the top 0.5% or so of nodes by betweenness are truly critical for connectivity (data not shown). Additionally, the 10 key hubs in the FHC network  show a wide range of avPCC values (Table 1): high betweenness does not necessitate low avPCC. Similarly, we found no strong correspondence between bottleneck/non-bottleneck and date/ party distinctions across multiple data sets. These observations further weaken the claim that there is an inverse relation between a hub's avPCC and its central role in the network.

Topological Properties and Node Roles
In principle, one should be able to view a categorisation of hubs according to the date/party dichotomy directly in the network structure, as the two kinds of hubs are posited to have different neighbourhood topologies. We thus leave gene expression data to one side for the moment and focus on what can be inferred about node roles purely from network topology. Guimerà and Amaral [31] have proposed a scheme for classifying nodes into topological roles in a modular network according to their pattern of intramodule and intermodule connections. Their classification uses two statistics for each node-within-community degree and participation coefficient (a measure of how well spread out a node's links are amongst all communities, including its own)-and divides the plane that they define into regions encompassing seven possible roles (see Materials and Methods for details). We depict these regions in Figure 3, which shows the node roles for yeast (FHC [4]) and human (Center for Cancer Systems Biology Human Interactome version 1 (CCSB-HI1) [49]) data sets, which we computed based on communities detected by optimising modularity via the Potts method [50] (see Text S1 for details, and Figure S4 and Table S1 for indications of the structural and functional coherence of the communities, respectively). Also, when partitioning the network using this method, one can adjust the resolution to get more or fewer communities. In Figure S2, we show the results of this computation repeated for two other values of the resolution parameter. In each case, we obtain a similar pattern to the results shown here, and the conclusions below are valid across the multiple resolutions examined.
Some of the topological roles defined by this method correspond at least to some extent to those ascribed to date/party hubs. For instance, one might argue that party hubs ought to be 'provincial hubs', which have many links within their community but few or none outside. Date hubs might be construed as 'non-hub connectors' or 'connector hubs', both of which have links to several different modules; they could also fall into the 'kinless' roles (though very few nodes are actually classified as such). We thus sought to examine the relationship between the date/party classification and this topological role classification. In Figure 3, we colour proteins according to their avPCC. In Figure 4, we present the same data in a more compact form, as we only show the hubs (defined as the top 20% of nodes ranked by degree [4]) in the two interaction networks, plotting them according to node role and avPCC. The horizontal lines correspond to an avPCC of 0.5, which was the threshold used to distinguish date and party hubs in the yeast interactome [4].
One immediate observation from these results is that the avPCC threshold clearly does not carry over to the human data. In fact, all of the hubs in the latter have an avPCC of well below 0.5. Even if we utilise a different threshold in the human network, we find that there is little difference in the avPCC distribution across the topological roles, suggesting that no meaningful date/party categorisation can be made (at least for this data set). This might be the case because the human data set represents only a small fraction of the actual interactome. Additionally, it is derived from only one technique (Y2H) and is thus not multiply-verified like the yeast data set.
For yeast, we see that hubs below the threshold line (i.e., the supposed date hubs) include not only virtually all of those that fall   into the 'connector' roles but also many of the 'provincial hubs'. On the other hand, those that lie above the line (i.e., the supposed party hubs) include mainly the provincial hub and peripheral categories. Although one can discern a difference in role distributions above and below the threshold, it is not very clearcut and the so-called date hubs fall into all 7 roles. It would thus appear that even for yeast, the distribution of hubs does not clearly fall into two types (the original statistical analysis has already been disputed by Batada et al. [3,5]), and the properties attributed to date and party hubs [1] do not seem to correspond very well with the actual topological roles that we estimate here. Indeed, these roles are more diverse than what can be explained using a simple dichotomy.

Data Incompleteness and Experimental Limitations
It has been proposed that date and party hubs play different roles with respect to the modular structure of protein interaction data. As there are diverse examples of such data, one might ask to what extent entities like date and party hubs can be consistently defined across these. In order to investigate the extent of network overlap and the preservation of the interactome's structural properties (such as community structure and node roles) for  [31], we designate the roles as follows: R1 -Ultra-peripheral; R2 -Peripheral; R3 -Non-hub connector; R4 -Non-hub kinless; R5 -Provincial hub; R6 -Connector hub; and R7 -Kinless hub. We colour proteins according to the avPCC of expression with their interaction partners. We computed expression avPCC using the stress response data set [40] (which was the largest, by a considerable margin, of the expression data sets used in the original study [1]) for FHC and COXPRESdb [67] for CCSB-HI1. No partner expression data was available for a few proteins (25 in FHC, 1 in CCSB-HI1)these are not shown on the plots. doi:10.1371/journal.pcbi.1000817.g003 -326 hubs with a minimum degree of 4) networks. Larger circles represent means over all nodes in a given role. Note that 'hub' as used in the role names refers only to within-community hubs, but all of the depicted nodes are hubs in the sense that they have high degree. In each case, we determined the degree threshold so that approximately the top 20% highestdegree nodes are considered to be hubs. We also fixed the date/party avPCC threshold at 0.5, in accordance with Bertin et al. [4]. doi:10.1371/journal.pcbi.1000817.g004 different data sets and data-gathering techniques, we compared statistics and results for four different yeast interaction data sets: FYI, FHC, Database of Interacting Proteins core (DIPc), and PCA (see Table 2 and Materials and Methods for details of these). Our motivation for these choices of data sets (aside from PCA) was that they all encompass multiply-verified or high-confidence interactions. We also used PCA data because it is from the first large-scale screen with a new technique that records interactions in their natural cellular environment [47]. For each data set, we counted the number of nodes and links in common using pairwise comparisons in the largest connected component of the network. For the overlapping portions, we then computed the extent of overlap in node roles and communities. For the latter, we employed the Jaccard distance [51], which ranges from 0 for identical partitions to 1 for entirely distinct ones (see Materials and Methods). In Table 3, we present the results of our binary comparisons of the yeast data sets. Table 3 reveals that there are large variations amongst the different networks reported in the literature. FYI, FHC, and DIPc are all regarded as high-quality data sets, yet they contain numerous disparate interactions. PCA has a very low overlap with both FYI and DIPc (considered separately), suggesting that it provides data that is not captured by either Y2H or AP/MS screens. Such differences unsurprisingly lead to nodes having variable community structure between data sets. The Jaccard distance for each pairwise comparison amongst the 4 networks is around 0.8, so on average the intersection of communities for the same node covers only about a fifth of their union (for comparison purposes, communities are computed over the complete network in each case, and then each community is pruned to retain only those nodes also present in the other network). Because we compute topological node roles relative to community structure, it is not surprising that the role overlap is also not very high in any of the cases.
Given the above, it is difficult to make any general inferences regarding proteome organisation from results on existing protein interaction networks. They depend a great deal on the explored data set, which in each case represents only part of the total interactome and may also contain substantial noise.

The Roles of Interactions
Most research on interactome properties has focused on nodecentric metrics, which draws on the perspective of individual proteins (e.g., [1,8,52,53]). Here we try an alternative approach that instead uses link-centric metrics in order to examine how the topological properties of interactions in the network relate to their function. In order to quantify the importance of a given link to global network connectivity, we use link betweenness centrality [18,39] (see Materials and Methods). We investigate the relationship between link betweenness and the expression correlation for a given interaction. If date and party hubs genuinely exist, one might expect a similar sort of dichotomy for interactions, with more central interactions having lower expression correlations and vice versa. That is, given the hypothesised functional roles of date and party hubs, most intermodular interactions would connect to a date hub, whereas most intramodular interactions would connect to a party hub. In Figure 5, we depict all of the interactions in two yeast data sets, which we position on a plane based on the values of their link betweenness and interactor expression PCC (calculated using the stress response data set as before). Additionally, we colour each point according to the level of functional similarity between the interacting proteins, as determined by overlap in GO (Cellular Component) annotations (see Materials and Methods). We also obtain similar results using the other two GO ontologies, which are shown in Figure S3.
For the FHC data set, we find no substantial relation between expression PCC and the logarithm of link betweenness (linear Pearson correlation &{0:04, z-score &{3:1, p-value &0:0022).  Methods, this is a finite-size effect.) This is somewhat reminiscent of the weak/strong tie distinction in social networks [54,55], as the 'weak' (high betweenness) interactions serve to connect and transmit information between distinct cellular modules, which are composed predominantly of 'strong' (low betweenness) interactions. For instance, we found that interactions involving kinases fall largely into the 'weak' category. Additionally, GO terms such as intracellular protein transport, GTP binding, and nucleotide binding were enriched significantly in proteins involved in high-betweenness interactions.

Discussion
In this paper, we have analysed modular organisation and the roles of hubs in protein interaction networks. We revisited the possibility of a date/party hub dichotomy and found points of concern. In particular, claims of bimodality in hub avPCC distributions do not appear to be robust across available interaction and expression data sets, and tests for the differences observed on deletion of the two hub types have not considered important outlier effects. Moreover, there is considerable evidence to suggest that the observed date/party distinction is at least partly an artefact, or consequence, of the different properties of the Y2H and AP/MS data sets.
In order to study the topological properties of hub nodes in greater detail, we partitioned protein interaction networks into communities and examined the statistics of the distributions of hub links. Our results show that hubs can exhibit an entire spectrum of structural roles and that, from this perspective, there is little evidence to suggest a definitive date/party classification. We find, moreover, that expression avPCC of a hub with its partners is not a strong predictor of its topological role, and that the extent of interacting protein co-expression varies considerably across the data sets that we examined.
Additionally, a key issue with existing interaction networks is that they are incomplete. We have compared some of the available 'high-quality' yeast data sets and shown that they have very little overlap with each other. One can obtain protein interaction data Pairwise comparisons of the largest connected components of different yeast protein interaction data sets. Notes: 1 Proteins occurring in both networks. 2 Links amongst the common nodes as counted in the previous column: individually in either network and common to both networks. 3 Communities and node roles computed over entire data sets; for pairwise comparison, we then narrow down communities in each case to only those nodes also present in the data set being compared to.  using several experimental techniques, and each method appears to preferentially pick up different types of interactions [11,29]. The only published interactome map of which we are aware that examines proteins in their natural cellular environment [47] is largely disjoint with other data sets and shows little evidence of a date/party dichotomy. We find similar issues in human interaction data sets. A general conclusion about interactome properties is thus difficult to reach, as it would require robust results for a number of different species, which are unattainable at present due to the limited quantity and questionable quality of protein interaction and expression data.
As an alternative way of defining roles in the interactome, we have also investigated a link-centric approach, in which we study the topological properties of links (interactions) as opposed to nodes (proteins). In particular, we examined link betweenness centrality as an indicator of a link's importance to network connectivity. We found that this too does not correlate significantly with expression PCC of the interacting proteins. For certain data sets, however, it does appear to correlate quite strongly with the functional similarity of the proteins. Additionally, there appears to be a threshold value of betweenness centrality beyond which one observes a sudden drop in functional similarity. We also found that the high-betweenness interactions are enriched for kinase bindings and other kinds of interactions involved in signalling and transportation functions. This suggests that a notion of intramodular versus intermodular interactions, somewhat analogous to the weak/strong tie dichotomy in social networks, might be more useful. However, further work would be required to establish such a framework of elementary biological roles in protein interaction networks. As the quantity, quality, and diversity of protein interaction and expression data sets increases, we hope that this perspective will enhance understanding of the organisational principles of the interactome.

Protein Interaction Data Sets
Several experimental methods can be used to gather protein interaction data. These include high-throughput yeast two-hybrid (Y2H) screening [56][57][58][59]; affinity purification of tagged proteins followed by mass spectrometry (AP/MS) to identify associated proteins [60,61]; curation of individual protein complexes reported in the literature [62]; and in silico predictions based on multiple kinds of gene data [63]. There is also a more recent technique, known as the protein-fragment complementation assay (PCA) [47], which is able to detect protein-protein interactions in their natural environment within the cell. However, only one large-scale study has used this technique thus far [47]. Each of these methods gives an incomplete picture of the interactome; for instance, a recent aggregation of high-quality Y2H data sets for Saccharomyces cerevisiae (the best-studied organism) was estimated to represent only about 20% of the whole yeast binary protein interaction network [11].
Each technique also suffers from particular biases. It has been suggested that Y2H is likely to report binary interactions more accurately, and (due to the multiple washing steps involved in affinity purification) it is also expected to be better at detecting weak or transient interactions [11]. Converting protein complex data into interaction data is also an issue with AP/MS. This method entails using a 'bait' protein to 'capture' other proteins that subsequently bind to it to form complexes. Once one has obtained these complexes and identified their proteins using mass spectrometry, one can assign protein-protein interactions using either the spoke or the matrix model [64]. The spoke model only counts interactions between the bait and each of the proteins captured by it, whereas the matrix model counts all possible pairwise interactions in the complex. Unsurprisingly, the actual topology of the complex is generally different from either of these representations. On the other hand, AP/MS is expected to be more reliable at finding permanent associations. Two-hybrid approaches also do not seem to be particularly suitable for characterising protein complexes, giving rise to the view that complex formation is not merely the superposition of binary interactions [61]. Thus, the two major techniques appear to be disjoint and to cover different aspects of the interactome, and the differences between data sets from these sources perhaps correspond mostly to false negatives rather than false positives [11].
Given these factors, choosing which data sets to use for building and analysing the network is itself a significant issue (see the discussion in the main text). For our analysis, we chose to work predominantly with networks consisting of multiply-verified interactions, which are constructed from evidence attained using at least two distinct sources. Such data sets are unlikely to contain many false positives, but might include many false negatives (i.e., missing interactions). In Table 2, we summarise the data sets that we employed. Here are additional details about how they were compiled:

N Online Predicted Human Interaction Database (OPHID):
This data was sent to us by Taylor et al. [2]; it is an updated version of the interaction data used in their paper. It is based on their curation of the online OPHID repository [41]; they have mapped proteins to their corresponding NCBI (National Center for Biotechnology Information) gene IDs. Additionally, we removed genes that did not have expression data in GeneAtlas [43] (avPCC cannot be calculated for these, as GeneAtlas is the only expression data set used by Taylor et al.  N Filtered High-Confidence (FHC): This data set was generated by Bertin et al. [4] by filtering a data set called high-confidence (HC), which was compiled by Batada et al. [3]. To conduct the filtration Bertin et al. applied criteria similar to those used for FYI and obtained 5991 independently-verified interactions amongst 2559 proteins. HC consists of 9258 interactions amongst 2998 proteins, taken from (published) literaturecurated and high-throughput data sets, and they were also supposed to be multi-validated. However, Bertin et al. [4] claimed that many interactions in HC had in fact been derived from a single experiment that was reported in multiple publications and thus removed such instances from it to generate FHC. N Database of Interacting Proteins core (DIPc): We obtained this data set from the DIP website (http://dip.doe-mbi.ucla.edu/). DIP is a large database of protein interactions compiled from a number of sources. The 'core' subset of DIP consists of only the most reliable interactions, as judged manually by expert curators and also automatically using computational approaches [65]. We used the version dated 7 October 2007, which contains 2808 proteins and 6212 interactions. interactions. An attractive feature of this data set is that it measures interactions between proteins in their natural cellular context, in contrast to other prominent methods, such as Y2H (which requires transportation to the cell nucleus) and AP/MS (which requires multiple rounds of in vitro purification). To our knowledge, this is the only published large-scale interaction study of this sort.
N Center for Cancer Systems Biology Human Interactome version 1 (CCSB-HI1): This data set was constructed by Rual et al. [49] using a high-throughput yeast two-hybrid system, which they employed to test pairwise interactions amongst the products of about 8100 human open reading frames. The data set, which contains 2611 interactions amongst 1549 proteins, achieved a verification rate of 78% in an independent coaffinity purification assay (that is, from a representative sample of interactions in the data set, 78% could be detected in the independent experiment).

Betweenness Centrality
Betweenness centrality is a way of quantifying the importance of individual nodes or links to the connectivity of a network. It is based on the notion of information flow in the network. The (geodesic) betweenness centrality of a node/link is defined as the number of pairwise shortest paths in the network that pass through that object [18,39]. If there are multiple shortest paths between a pair of nodes, each one is given equal weight so that all of their weights sum to unity. Thus, the weighted count of all pairwise shortest paths passing through a given node/link equals its betweenness centrality.
For finite, sparse, unweighted networks such as the ones we study, one observes an interesting effect in the distribution of link betweenness centrality values. The distribution is almost normal, with the exception of a large spike at a value well above the mean (see the long vertical bar of points in the plots in Figure 5). This results from the large number of nodes with degree 1. The link that connects such a node to the rest of the network must have a betweenness of N{1, where N is the total number of nodes in the network. Simply, this link must lie on the N{1 shortest paths that connect the degree 1 node to all of the other nodes, and it cannot lie on any other shortest paths. Thus, for our networks, the link betweenness centrality distribution shows a strong spike at a value of precisely N{1.

Topological Metrics and Node Roles
The within-community degree refers to the number of connections a node has within its own community. It is normalised here to a z-score, which for the i th node is given by the formula where s i denotes the community label of node i, k i is the number of links of node i to other nodes in the same community s i , the quantity k k s i is the average of k i for all nodes in community s i , and s ks i is the standard deviation of k i in community s i . The participation coefficient of node i measures how its links are distributed amongst different communities. It is defined as [31] where N is the number of communities, k is is the number of links of node i to nodes in community s, and k i is the total degree of node i. The participation coefficient approaches 1 if the links of node i are uniformly distributed amongst all communities (including its own) and is 0 if they are all within its own community.
In the main text, we plot all nodes in the network in a twodimensional space using coordinates determined by withincommunity degree and participation coefficient, and we divide the space into regions that correspond to different node roles. The boundaries between regions are of course arbitrary, so for simplicity we have used the demarcations employed by Guimerà and Amaral [31]. First, it is important to distinguish between 'community hubs' and 'non-hubs'; the former are defined as those nodes with within-community degree z §2:5. In this context, the term 'hub' is applied to nodes with high within-community degree [31], so 'non-hubs' might have high overall degree. One can further partition both 'community hubs' and 'non-hubs' on the basis of the participation coefficient P as follows [31]: N Non-hubs can be divided into ultra-peripheral nodes (Pƒ0:05-virtually all links within their own community), peripheral nodes (0:05vPƒ0:62-most links within their own community), non-hub connector nodes (0:62vPƒ 0:80-links to many other communities), and non-hub kinless nodes (Pw0:80-links distributed roughly homogeneously amongst all communities).
N Community hubs can be divided into provincial hubs (Pƒ0:30-vast majority of links within own community), connector hubs (0:30vPƒ0:75-many links to most other communities), and kinless hubs (Pw0:75-links distributed roughly homogeneously amongst all communities).
We depict these 7 roles as demarcated regions in the plots in Figure 3.

Jaccard Distance
If one has two partitions of a given set of nodes, and a node i is part of subset (or community) C 1 i of nodes in one partition and part of subset C 2 i in the other partition, then the Jaccard distance [51] for node i across the two partitions is defined as The symbols \ and | correspond, respectively, to set intersection and union, and DCD denotes the number of elements in set C. A Jaccard distance of 0 corresponds to identical communities, whereas the distance approaches 1 for very different communities. By averaging J i ð Þ over all nodes in the set, we can get an estimate of the similarity of the two partitions.

Functional Similarity
In order to compute the functional similarity of two interacting proteins, we first define the set information content (SIC) [66] of each term in our ontology for a given data set. Suppose the complete set of proteins is denoted by S, and the subset annotated by term i is denoted by S i . The SIC of the term i is then defined as SIC i ð Þ~{ log 10 DS i D DSD : Now suppose that we have two interacting proteins called A and B. Let S A and S B , respectively, denote their complete sets of annotations (consisting of not only their leaf terms but also all of their ancestors) from the ontology. Then the functional similarity of the proteins is given by