Functional Integrative Levels in the Human Interactome Recapitulate Organ Organization

Interactome networks represent sets of possible physical interactions between proteins. They lack spatio-temporal information by construction. However, the specialized functions of the differentiated cell types which are assembled into tissues or organs depend on the combinatorial arrangements of proteins and their physical interactions. Is tissue-specificity, therefore, encoded within the interactome? In order to address this question, we combined protein-protein interactions, expression data, functional annotations and interactome topology. We first identified a subnetwork formed exclusively of proteins whose interactions were observed in all tested tissues. These are mainly involved in housekeeping functions and are located at the topological center of the interactome. This ‘Largest Common Interactome Network’ represents a ‘functional interactome core’. Interestingly, two types of tissue-specific interactions are distinguished when considering function and network topology: tissue-specific interactions involved in regulatory and developmental functions are central whereas tissue-specific interactions involved in organ physiological functions are peripheral. Overall, the functional organization of the human interactome reflects several integrative levels of functions with housekeeping and regulatory tissue-specific functions at the center and physiological tissue-specific functions at the periphery. This gradient of functions recapitulates the organization of organs, from cells to organs. Given that several gradients have already been identified across interactomes, we propose that gradients may represent a general principle of protein-protein interaction network organization.


Introduction
In metazoans, differentiation and ontogeny processes lead to the formation of differentiated tissues. Ultimately, these tissues, alone or in combination, constitute organs and ensure specific physiological functions. Deciphering the combinatorial arrangements of transcription factors that lead to tissue-specific gene expression can help explain the diversity of the differentiated tissues and the fundaments of their differences at a global scale [1,2]. However, the combinatorial arrangements of proteins, other than transcription factors, also contribute to the diversity of the hundreds of different differentiated cell types. Therefore, studying the organization of the protein-protein interaction networks could reveal novel information insight on tissue diversity. Furthermore, tissue-specific interactions can occur between proteins which are not strictly tissue-specifically expressed [3], therefore reinforcing the relevance of considering tissue specificity from an interactome perspective.
Protein-protein interaction maps (or interactomes) are sets of interactions that have been experimentally identified using either high throughput technologies (such as large-scale two hybrid screens and affinity purifications mass spectrometry [4][5][6][7][8][9][10][11][12][13][14]) or regular low-scale experiments. They are assembled into large networks representing sets of possible biophysical interactions between the tested proteins. However, because of the experimental methods used to identify interactions, interactomes lack spatiotemporal information, thereby hindering any studies on specific biological contexts or conditions. This has been overcome by the integration of secondary data types such as (i) gene expression to identify different types of hubs and sub-network markers in cancer [15][16][17][18] and (ii) functional annotations to highlight context-specific interactions [19].
In the work described here (Figure 1), we aim at understanding the influence of function on interactome topology and answering questions such as 'Does interactome topology reflect functional issues? Does function 'shape' the interactome? Is there an organizational 'functional logic' in the interactome?'. For this, we classified the interactions according to gene expression and the proteins according to network topology. By interpreting the results functionally using GO annotations, we address tissue diversity from an interactome point of view. We first defined the Largest Common Interactome Network (LCIN) which consists of those interactions possible in all the tissues tested, and whose proteins are mainly involved in housekeeping functions. We show that this LCIN corresponds to a 'functional interactome core', lying at the topological center of the interactome, and that tissue-specific interactions by definition excluded from the LCIN, are interestingly (i) centrally located when involved in regulatory and developmental functions and (ii) located at the periphery when involved in organ physiological functions. Combining interactions, expression, interactome network topology with cellular function annotations, we show that the organization of the interactome follows a functional gradient recapitulating the organization of organs.

Inferring contextualized tissular interactomes
A human interactome composed of 27286 high confidence binary interactions between 9596 proteins was built by joining manually curated interactions derived from the literature, to those reported in APID [20]. First, using the EST clusters from the UniGene database [21] for gene expression data, we inferred several possible tissue-specific proteomes (Table S1). EST clusters were chosen as gene expression data rather than microarray data for coverage and function representation concerns. Indeed, in accordance with Zhu et al. [22] who have shown that microarray data exhibit a high rate of false negative, leading to an underestimation of the number of housekeeping genes, we confirmed the depletion of housekeeping gene detected in microarray data compared to ESTs (File S1).
Second, we postulated that if genes encoding two interactors are co-expressed in a given tissue, then that interaction is possible in that tissue. If either of the genes is not expressed in that tissue, it is assumed that the gene product is also absent, and therefore, that the interaction is impossible. In this way, we recovered 45 inferred proteomes and interactomes corresponding to the 'body sites' proposed by UniGene. We eliminated small and relatively incomplete interactomes due to poor EST coverage of certain tissue transcriptomes [22], and only considered the 22 largest contextualized interactomes, which contain more than 10 000 interactions, for further studies (Table S1).

Interaction usage and Functions
Distribution of the 'interaction usage'. A tissue-specific interaction may exist between proteins that are not necessarily tissue-specifically expressed. This is why it is different to investigate tissue-specific interactions rather than tissue-specific genes/ proteins. Indeed, although the genes may be widely expressed, their shared tissular expression may be restricted to only few tissues, therefore leading to a tissue-specific interaction between gene products not tissue-specifically expressed. This accounts for 20% of the most tissue-specific interactions in our dataset (data not shown).
To reflect this difference, we defined the notion of 'interaction usage' (IU) as the number of tissues in which an interaction is possible. This corresponds to the number of tissues in which both interactors are co-expressed (Table S1). The distribution of the IU values ( Figure 2) shows first, the scarcity of strictly tissue-specific interactions: only 5% of the interactions are possible in less than 3 tissues and 11% in less than 6 tissues; second, that 77% of the interactions are possible in more than half of the tissues; and third, that 21% of the interactions are common to all the considered tissues.
The relevance of the IU distribution was assessed by showing that (i) it significantly differs from that obtained when randomly assigning proteins to tissue proteomes (Mood's median test, pval = 0,03; Table S2); (ii) a similar distribution is obtained on a larger contextualized interactomes built from an interaction dataset of lower confidence (58189 interactions, 12531 proteins) (Table S3); (iii) it is in accordance with the human protein expression profiles recently revealed by immunological detection across a large number of cell types [23], in which only 2% of the tested proteins show a strict tissue-specificity and 20% are ubiquitously expressed (Table 1); (iv) it significantly differs from the distribution of gene/protein usage ( = number of tissues in which the genes are expressed; Wilcoxon test, p-value ,2.2e-16; (data not shown)). The validity of using EST information for proteome and interactome inferences is further validated by such agreement with experimentally derived protein localization data.
The Largest Common Interactome Network: a functional interactome core devoted to housekeeping functions. The peculiarity of the distribution of the IU suggests a possible relationship between the usage of the interactions and the cellular functions they contribute to. We therefore investigated the interactions possible in all the studied tissues. They form the Largest Common Interactome Network (LCIN), which contains 4200 interactions between 1996 expressed proteins in all the tested tissues. This organization is not fortuitous, because when interactomes are inferred from randomly generated proteomes, none of the interactions are possible in all tested tissues, rendering the delineation of a LCIN impossible (Materials and Methods, Table S2).
Interactions of the LCIN are expected to participate in housekeeping cellular functions which occur in all cell types. Indeed, GO term analysis shows that the LCIN is particularly enriched (p-val,10 25 ) in proteins involved in nucleic acid and protein metabolic processes, intracellular transport, and cellular processes linked to cell cycle and nuclear organization.  Table S4). We believe, therefore, that the LCIN represents a functional interactome core devoted to housekeeping functions.
Interaction usage profiles according to cellular functions. As shown above, the more common interactions are mainly involved in housekeeping functions. To investigate the remaining interactions, which should account for the functional and morphological differences between cells and tissues, we extended the previous LCIN analysis to the other IU categories (interactions binned according to the number of tissues in which they are possible; Figure 2). A heatmap representing the enrichment/depletion of GO Biological Process terms for the proteins in each IU category, is shown in Figure 4A (see Material and Methods). The observed enrichment/depletion profiles of the 1023 GO terms common to all IU categories were then grouped in 15 clusters using the k-means algorithm (6 clusters are detailed in Figure 4, the 9 others in Figure S1, S2, S3). The GO terms grouped in each cluster were represented using SimCT, a webbased tool that provides a simplified subgraph of the ontology, facilitating the interpretation of functional annotations [28]. Two main annotation profiles are distinguishable ( Figure 4D): GO terms that are over-represented among proteins whose interactions were detected in many tissues and under-represented in the rest, and conversely, terms that are overrepresented among proteins whose interactions are possible in only few tissues and depleted in the rest. These annotation profiles are discretized by the classification process, leading to clusters into which the enrichment or the depletion status of terms progressively extends from the categories containing the more common interactions to the others.
Firstly, the clusters 6, 5 and 4 ( Figure 4B) and 9, 10, 14 ( Figure  S1) group terms over-represented among the more common interactions (possible in 7 to 22 tissues according to the cluster) and depleted among interactions possible in only a few tissues. As expected and detailed in the previous paragraph, GO terms enriched solely in the more common interactions correspond to housekeeping functions (such as 'DNA replication', 'DNA repair', 'mRNA processing', 'translation' and 'transport', Figure 4D). Interestingly, in addition to terms related to housekeeping functions, cluster 4 also contains terms referring to the regulation of housekeeping functions (as 'regulation of cell cycle'). Unlike interactions mediating the housekeeping functions themselves, interactions regulating these processes are not expected to be shared by all tissue types but to be more tissue-specific. Indeed, cluster 4 contains terms over-expressed in all tissues as expected for housekeeping interactions, as well as in a more restrained number of tissues, as expected from regulatory interactions. This cluster may illustrate the specificity of the regulation of common processes in particular tissues.
Second, clusters 7, 2, and 15 ( Figure 4C) and 1, 3, 8, 11, and 12 ( Figure S2, S3) group terms depleted among the more common interactions and over-represented elsewhere. These terms ( Figure 4D) are related to regulation and signal transduction (such as 'regulation of MAP kinase activity', 'Wnt receptor signaling pathway' or 'EGFR signaling pathway' in cluster 7),  organ development (as 'central nervous system development', 'heart development' or 'hemopoiesis' in cluster 2), and the physiological functions of the organs (for instance 'muscle contraction', 'blood circulation' or 'visual perception' in cluster 15). Therefore, regulatory and physiological processes appear to be excluded from the more common interactions, therefore mediated by tissue-specific interactions.

Functions, Interaction Usage and Network Topology
Intuitively, as the LCIN forms the core cellular machinery common to all tissues, it is tempting to speculate that it should be buried in the innermost part of the interactome, leaving the topological periphery of the interactome to more tissue-specific functions. To verify this hypothesis, we used the k-core decomposition of the graph to define the topological layers of the interactome [29]. Essentially, this means progressively pruning the graph vertices (proteins) according to the number of edges (interactions) linking them to the connected component [30]. Proteins of high k-core are topologically central in the network whereas proteins of low k-core are peripheral. In the studied interactome, proteins of the highest k-core (k-core 9) are almost double what would be expected by chance in the LCIN (pval = 3,78610 214 ), indicating a correlation between the centrality of a protein and its involvement in common interactions. By extension, proteins of the highest k-core should be involved in housekeeping functions. Building on this idea, we addressed the possible relationship between the IU, network topology and function. As before, we calculated over-and under-representation of Gene Ontology terms annotating the proteins in each k-core category. The resulting data, for each of the clusters previously defined according to the IU categories, are shown as heatmaps and graphs in Figure 5 and Figures S4, S5, S6.
Housekeeping functions are common to all tissues and topologically central. Functions that are over-represented in the LCIN and therefore mediated by the more common interactions are also enriched among the high k-core proteins and depleted in the low k-core ones (cluster 4, 5, 6 on Figure 5, and 9, 10 on Figure S4). In this case, the enrichment/depletion profiles of the GO terms show the same tendency with respect to the IU and the topology. Housekeeping functions and their regulation are therefore mediated by interactions present in a large number of tissues (from 7 to 22) and are centrally located in the interactome. Thus, we show that all these interactions, including those of the functional core previously defined, map to the topological center of the interactome.
Regulatory and developmental functions are tissuespecific and topologically central. Interestingly, other functions, although depleted among the more common interactions, are found enriched among the more central proteins of the interactome (clusters 1, 2, 3, 7 and 11, Figure 5, Figure S5), as shown by the inversion of their enrichment/ depletion profiles in the more central and common parts of the interactome. This suggests that these functions, although almost excluded from the functional core, are mediated by topologically central interactions. More particularly, they correspond to the regulation of biological and cellular processes (such as 'regulation of apoptosis', 'positive regulation of T cell activation') and the molecular regulatory processes themselves (for instance, 'signal transduction' or 'protein amino acid phosphorylation') ( Figure 5). They are also related to the developmental processes of the organs  and organisms (such as 'regionalization', 'embryonic development', 'organ morphogenesis').
Physiological functions are tissue-specific and topologically peripheral. Finally, a third type of functions mediated by tissue-specific interactions is found relatively enriched among proteins of the lowest k-core (clusters 8, 12 and 15) ( Figure 5, Figure S6). These tissue-specific interactions are related to physiological processes and their underlying molecular processes (such as 'cognition', 'spermatogenesis', 'synaptic transmission' or 'cation transport') and lie at the periphery of the interactome.
It appears that the tissue-specific interactions excluded from the functional core are located at the center as well as at the periphery of the interactome, each location corresponding primarily to a particular type of tissue-specific function. The topological criterion, therefore, distinguishes between a tissue-specificity related to the regulatory and the developmental processes on one hand and the physiological processes on the other.

Discussion
Tissue-specific interactions can occur between proteins that are not necessarily tissue-specifically expressed [3]. Indeed, a tissuespecific function can be performed by proteins that, while not exclusive to the tissue of interest, can only interact in that tissue. This explains why considering tissue-specific interactions rather than proteins should, therefore, bring a deeper functional insight to the understanding of cell and tissue diversity. We adopted this point of view to investigate whether the topology of the interactome reflects functional issues. To do so, we classified interactions according to gene expression on one hand, proteins according to their network topological features on the other hand, aiming to finally interpret the results in the light of functional annotations. The novelty of the approach relies on the common functional interpretation of two independent classification schemes. This allowed distinguishing two types of tissue-specific interactions which would not have been detected otherwise, by a global interactome study.
Classifying interactions by integrating gene expression allows estimating their level of tissue-specificity. Although proteomic data would have been the data of choice for protein tissue expression, such comprehensive data are not yet available to our knowledge. We then chose EST as tissular gene expression source to contextualize protein interaction rather than microarray data mainly for two reasons: interactome coverage and protein function representation. Indeed, the coverage of the used expression data is a concern because the studied interactome is only a subset of limited size. As shown in File S1 and in accordance with Zhu et al. [22], the number of housekeeping genes detected with microarrays (taken from Gene Expression Atlas [31] in this study and from [32] in [22]) is underestimated when compared to EST and immuno-detected proteins. This fact is probably due to the stringent thresholds chosen for microarray analyses. Moreover, the use of microarray data for contextualization and the consequent high rate of false negative among housekeeping genes would (i) trivially lead to very small tissular interactomes and more importantly, (ii) introduce a bias in the functional analyses of the inferred interactomes, ie. a depletion of interactions occurring between housekeeping gene products and tissue-specifically expressed proteins.
Centrally located tissue-specific interactions mediate regulatory and developmental functions while peripheral ones carry out physiological functions. This appears illustrating the differences between, for instance, the tissue-specificity of (i) the regulatory interactions that can occur between regulators and housekeeping proteins and lead to tissue-specific transcriptional activation of gene expression, and (ii) the interactions responsible of the assembly of tissue-specific protein complexes involved in physiological functions.
With regards to housekeeping interactions, this work defines an interactome functional core present in all the investigated tissues and essentially formed by interactions devoted to ubiquitous functions. This centrality of the housekeeping functions is reminiscent of the organization evoked for unicellular organisms by Vinogradov [33], while studying the modularity of the yeast interactome. Noticeably, this is, to the best of our knowledge, the first time that such evidence emerges from the interactome analysis of a metazoan organism, suggesting that such organization may be a trend of the functional interactome organization across species.
Bossi and Lehner [3] demonstrated that housekeeping proteins interact with tissue-specifically expressed proteins, without providing functional insights on these particular interactions. Our analysis interestingly extends their results by showing that some of those interactions are regulatory. Specifically, the annotation 'intracellular signaling cascade' is the most over-represented (pval = 2.36610 28 , data not shown) among those proteins of the functional core that interact with tissue-specifically expressed proteins. Finally, regulatory interactions appear as topologically central, irrespective of the biological process considered (housekeeping or tissue-specific).
Together, these results led to the proposal that the functional core subnetwork formed by the common interactions combined with the tissue-specific regulatory interactions corresponds to the topological center of the interactome. Interestingly, this particular organization may be a hallmark of the metazoan interactome since noticeably, the topologically central part of the yeast interactome has been previously demonstrated to be rather enriched in evolutionary conserved and essential proteins [29] mainly dedicated to housekeeping functions [33].
Tissue-specific interactions involved in physiological functions are found at the periphery of the interactome. Noticeably, this links the observations that tissue-specific proteins are more likely to be recent evolutionary innovations [34], that rapidly evolving genes are expressed in a narrow range of tissues [35] and that proteins having the higher potential to evolve are located at the interactome periphery [36].
The functional organization of the human interactome deciphered in this work recapitulates the integrative organization of organs, from cells to organ, by following a gradient of functions from center to periphery ( Figure 5C). This reflects the fact that tissues and organs first acquire their specificity from the developmental and regulatory programs which build on common molecular mechanisms. Tissues and organs then become fully functional when physiological functions are established. Our results suggest that the pace of these events may be encoded in the organization of the interactome. Interestingly, this gradient is reminiscent of two other gradients recently described in interactomes: a gradient of evolvability suggested by the observation of the preferential peripheral location of human proteins under positive selection, and a gradient of cellular localization reflecting that proteins located at the cellular periphery are also peripheral in the network ( [36] and data not shown). Gradients (such as of functions, evolvability or cellular location) may therefore represent a trend of protein-protein interaction network organization.

Protein-protein interaction datasets
A high confidence dataset of 27286 binary interactions involving 9596 proteins was built by joining (i) 2325 human interactions manually extracted from the literature and (ii) 24961 binary interactions compiled in APID [20] (Table S1). Interactions identified by at least one experimental method leading to the detection of binary interactions (Table S5) were selected. A larger dataset of lower confidence (comprising 58189 interactions between 12531 proteins) was built by adding the 2325 manually curated interactions to the complete set of human interactions present in APID.

Expression data, proteome and interactome inferences
Clusters of expressed sequence tags (ESTs) from the UniGene database (Homo sapiens release 214 -June 23rd , 2008) [21] were used as source of gene expression data. For each cluster, a list of tissues in which ESTs were expressed was created, independently of their level of expression. After mapping their corresponding gene names to protein names by parsing UniProt files, the composition of 45 proteomes was inferred. Assuming that an interaction present in the interaction dataset is possible in a given tissue if both protein partners are expressed in this tissue, 22 tissular interactomes containing more than 10000 interactions each were inferred (Table S1)

Functional Analyses, Heatmaps and Clusters
The functional enrichment/depletion of the LCIN was determined by comparing the Gene Ontology [37] annotations of the proteins involved in the LCIN to those involved in the LPIN using the GOToolBox application [38], which uses an hypergeometric law corrected for multiple testing. All three ontologies were tested (Table S4).
This functional analysis was extended to the other IU bins (containing interactions with respect to the number of tissues in which they are possible) to reveal enriched or depleted GO Biological Process (BP) annotations in each bin. On average, 88% of the proteins per bin are annotated. Only GO terms annotating more than 10 proteins of the LPIN are considered in the analysis (Table S6). As a result, a p-value(go,b) is computed for each annotation go in each bin b. A global matrix M in which the lines are the GO BP terms and the columns are the IU bins is built. The matrix contains 1023 GO BP terms annotating the LPIN proteins. For each GO annotation go and for each bin b, the matrix M[go,b] contains 2log(p-value(go,b)) if annotation go is enriched in bin b and log(p-value(go,b)) if it is depleted.
Heatmap visualization and further analyses were performed using MeV [39]. A k-means clustering algorithm was applied and the matrix was split into 15 clusters. The number of clusters was chosen following a Figure Of Merit (FOM) analysis which estimates the predictive power of a clustering algorithm [40].

Centrality analysis and k-core decomposition
The interactome can be formally represented by a graph G = (V,E), where V is the set of vertices and E the set of edges. The k-core decomposition of the graph [30], can be intuitively seen as a peeling process. A subgraph H(C, E|C) induced by the set C(V is a k-core or a core of order k if and only if Vv [ C : degree H (v) §k, and H is the maximum subgraph with this category. A vertex v has a coreness c if it belongs to the c-core but not to the (c+1)-core. The algorithm used to compute the coreness was extracted from [41].
To explore whether the 15 functional clusters tend to have a clear behavior in terms of centrality, the enrichment or depletion of these 1023 annotations was computed for proteins of coreness 1 to 9. As previously explained, the p-values were considered negative if the annotation is under-and positive if the annotation is over-represented. These centrality values were not used for the clustering, but added subsequently.  Table S4 Functional Analyses of the LCIN using the three ontologies of GO. Terms of the Biological Process subontology that are found over-represented (Sheet Enriched BP annotations) and depleted (Sheet Depleted BP annotations) among the protein annotations composing the LCIN relatively to the proteins composing the LPIN. Results are obtained with GOTool-Box, using an hypergeometric law corrected for multiple testing with the Benjamini & Hochberg correction. The terms whose over/under-representation p-value is less than 1E-05 (in red/blue) have been used to build the SimCT trees ( Figure 3). The same analyses have been performed for the two other sub-ontologies. (XLS)