Conceived and designed the experiments: LAR MS. Performed the experiments: LAR. Analyzed the data: LAR MS AFS. Contributed reagents/materials/analysis tools: MR. Wrote the paper: LAR MS AFS.
Current address: JFRC, HHMI, Ashburn, Virginia, United States of America
The authors have declared that no competing interests exist.
With the advent of large-scale protein interaction studies, there is much debate about data quality. Can different noise levels in the measurements be assessed by analyzing network structure? Because proteomic regulation is inherently co-operative, modular and redundant, it is inherently compressible when represented as a network. Here we propose that network compression can be used to compare false positive and false negative noise levels in protein interaction networks. We validate this hypothesis by first confirming the detrimental effect of false positives and false negatives. Second, we show that gold standard networks are more compressible. Third, we show that compressibility correlates with co-expression, co-localization, and shared function. Fourth, we also observe correlation with better protein tagging methods, physiological expression in contrast to over-expression of tagged proteins, and smart pooling approaches for yeast two-hybrid screens. Overall, this new measure is a proxy for both sensitivity and specificity and gives complementary information to standard measures such as average degree and clustering coefficients.
Over the last ten years, several experimental methods such as Yeast-two-hybrid (Y2H), affinity purification followed by mass spectrometry (AP/MS), and protein complementation assay (PCA) have been used for large-scale protein interaction mapping. Other approaches for reconstituting protein interaction networks range from computational and structural methods to manual curation and automated text-mining of large corpora of literature. Considerable obstacles have been encountered and the ways to assess data quality remain controversial. Despite many efforts, the interaction space for most species is still sparsely explored and reliable gold standards are difficult to define
Comparison of the first genome-wide Yeast Y2H networks by Uetz et al. and Ito et al.
Several methods have been proposed for assessing the quality of protein interaction datasets. A first approach is to compare error-prone high-throughput data with interactions curated from literature on small-scale interaction studies
The question of
Molecular systems in the cell are inherently modular, cooperative, and redundant
modules (e.g. protein sub-complexes which are re-used),
redundant interactions (e.g. multiple inhibitors for the same enzyme),
protein domain and motif mediated interactions
Modularity is a hallmark of protein interaction networks
In computing, compression algorithms identify patterns in data and use these patterns to obtain compact representations, thus reducing data size. Lossless compression algorithms are reversible: the compressed representation is sufficient to recover the original data. In 1948, Shannon discovered a fundamental and unexceedable limit to lossless data compression based on the notion of entropy
There exist a variety of approaches for measuring the information content of graphs going back to Rashewsky et al. and Mowshowitz et al. who proposed to calculate the information content of graphs using Shannon's entropy formula
Instead of measuring the network's information content using information theory and Shanon's entropy, we rely on the notion of graph compression. Other approaches for graph compression exploit neighborhood similarity, non-uniform network motif statistics, and scale-free properties of complex networks
Because we aim at comparing different networks, it is necessary to normalize against the effects of different sizes and topologies on a network's compressibility (see methods section for details and in depth discussion). Instead of measuring the entropy which varies according to the network's data size, we consider the
We understand network quality as encompassing both sensitivity and specificity. To illustrate this consider the following example: i) take a perfect and complete interactome and remove many interactions at random, or ii) take the same perfect and complete network and now add many interactions at random. As we will show, both alterations result in networks that have global properties closer to that of random networks and yet the truthfulness of individual positive interactions differs: individual interactions are more reliable in i) than in ii). The situation is reversed when looking at the network's complement, at the negative interactions: the absence of an interaction in i) is less reliable than in ii). Importantly, network quality is not solely determined by the quality of individual interactions.
In the following we give a four point validation of network compressibility as a measure of network's richness in structure – our proxy for the notion of overall network quality.
First we validate the link between relative compression rate and network quality as previously defined. We investigate to which extent it correlates with other measure proposed in the literature. We then compare the relative compressibility of all large-scale interactomes and discuss how assay parameters such as protein expression level, tagging, and pooling strategies influence the networks' relative compressibility. Importantly, we show that relative compressibility is independent of the network topology such as number of proteins, interactions, average number of interaction partners, or average clustering coefficient. Finally, we verify that networks derived from completely and accurately known complex systems are compressible at levels similar to the best interactomes.
If relative compressibility measures the fidelity of the networks to the systems they represent, then the relative compression rate should deteriorate with the addition of noise to networks. Noise can be applied by randomly adding interactions – introducing false positives (FP) – or by randomly removing interactions – introducing false negatives (FN). We consider two models for adding or removing interactions in protein interaction networks. In the Erdös–Rényi model (ER), the choice of interactions is independent of the network topology and all possible interactions are equally likely to be selected for addition or removal
(
Networks that have a low relative compressibility (below
With one exception (
In addition, we consider false positives and false negatives caused by missing or added proteins.
(
Published interactomes are often reported as binary interactions, i.e. either two proteins interact or not. Underlying these data are confidence scores – authors define a threshold and only report interactions above that threshold. Defining such a threshold is a difficult compromise since a conservative threshold may improve precision but lowers the coverage, while a generous threshold achieves the opposite effect. Thus, the threshold controls the amount of false positives and false negatives in the network and the question arises of how is this reflected in the compression rates. To answer this question we systematically analysed the networks of Gavin (TAP/MS), Tarassov (PCA), Parrish (Y2H), Kiemer (WI-PHI integrated network) and computed the compression rates for networks defined by interactions above a minimum and below a maximum confidence score (see
(
Remarkably, for Gavin's network, the highest relative compression rate is found for a minimum confidence score (socio-affinity index) of
For Tarassov's network we find that the highest relative compressibility is found for a minimum score of
For Parrish's network we observe that interactions with confidence scores below
We also tested a high quality merged dataset: the WI-PHI network
The network by Collins et al. is a merge and re-analysis of the raw data from the Gavin and Krogan datasets aimed at improving coverage and reducing false positives
Yu et al. compared their novel experimental dataset (CCSB-YI1) and their own merge of several datasets (Y2H-Union) to a gold standard of binary interactions derived from literature (CCSB-binaryGS)
Ito et al. discouraged the use of the Ito full dataset and instead recommended the use of a subset: Ito core. We observe that the Ito core network has a slightly higher relative compression rate (of
Similarly, false positive estimates by Lemmens et al.
Finally, the WI-PHI core network by Kiemer et al.
Assortativity in protein interaction networks refers to the preference of proteins to interact with other proteins that are similar or share certain properties
For all interacting pairs of proteins for which we have information about both, we compute the proportion – or assortativity ratio – of interacting proteins that are significantly co-expressed, share a cellular function, are found in at least one common cellular compartment, and have similar phylogenetic profiles. We normalize these ratios by subtracting the average proportion found for equivalent randomized networks similarly to the relative compression rate (see Methods for details). (
To summarize, the above four validation points substantiate our claim that higher network compressibility is a good proxy for overall network quality. Next, we will discuss in detail how the different experimental methods influence the relative compressibility of available large-scale interactomes.
Relative compression rates plotted against compression rates for several types of large-scale networks: Y2H, AP/MS, PCA, and other derived networks. More details are given in
Progress has been made with higher relative compression rates achieved in recent years.
author | species | system | year | compression rate | relative compression rate | number of proteins | number of interactions | average degree | PubMed Id |
Collins et al. | Yeast | AP/MS | 2007 | 0.71 | 0.48 | 1622 | 9070 | 11.18 | 17200106 |
Gavin et al. (socio-affinity) | Yeast | AP/MS | 2006 | 0.64 | 0.42 | 1462 | 6942 | 9.50 | 16429126 |
Gavin et al. ( |
Yeast | AP/MS | 2006 | 0.56 | 0.22 | 1386 | 3244 | 4.68 | 16429126 |
Krogan et al. | Yeast | AP/MS | 2006 | 0.50 | 0.18 | 2708 | 7123 | 5.26 | 16554755 |
Ewing et al. | Human | AP/MS | 2007 | 0.54 | 0.12 | 2294 | 6449 | 5.62 | 17353931 |
Butland et al. | E. coli | AP/MS | 2005 | 0.44 | 0.11 | 1277 | 5324 | 8.34 | 15690043 |
Ho et al. | Yeast | AP/MS | 2002 | 0.37 | 0.10 | 1693 | 8038 | 9.50 | 11805837 |
Arifuzzaman et al. | E. coli | AP/MS | 2006 | 0.40 | 0.02 | 2457 | 8663 | 7.05 | 16606699 |
Tarassov et al. | Yeast | PCA | 2008 | 0.53 | 0.14 | 1507 | 3030 | 4.02 | 18467557 |
Parrish et al. | C. jejuni | Y2H | 2007 | 0.41 | 0.20 | 1326 | 11659 | 17.59 | 17615063 |
Stelzl et al. | Human | Y2H | 2005 | 0.52 | 0.10 | 1664 | 3083 | 3.71 | 16169070 |
Yu et al. (CCSB-YI1) | Yeast | Y2H | 2008 | 0.55 | 0.06 | 1278 | 1641 | 2.57 | 18719252 |
Titz et al. | T. pallidum | Y2H | 2008 | 0.47 | 0.05 | 724 | 3627 | 10.02 | 18509523 |
Yu et al. (Y2H-Union) | Yeast | Y2H | 2008 | 0.52 | 0.05 | 2018 | 2705 | 2.68 | 18719252 |
Simonis et al. | C. elegans | Y2H | 2009 | 0.56 | 0.05 | 1515 | 1748 | 2.31 | 19123269 |
Uetz et al. | Yeast | Y2H | 2000 | 0.48 | 0.05 | 806 | 644 | 1.60 | 10688190 |
Ito et al. (core) | Yeast | Y2H | 2001 | 0.53 | 0.05 | 813 | 761 | 1.87 | 11283351 |
Rual et al. | Human | Y2H | 2005 | 0.51 | 0.04 | 1527 | 2529 | 3.31 | 16189514 |
Ito et al. (full) | Yeast | Y2H | 2001 | 0.54 | 0.03 | 3243 | 4367 | 2.69 | 11283351 |
Giot et al. | D. melanogaster | Y2H | 2003 | 0.31 | 0.03 | 6988 | 20240 | 5.79 | 14605208 |
Sato et al. | Synechocystis | Y2H | 2007 | 0.43 | 0.02 | 1915 | 3100 | 3.24 | 18000013 |
LaCount et al. | P. falciparum | Y2H | 2005 | 0.38 | 0.01 | 1272 | 2643 | 4.16 | 16267556 |
Aranda et al. (IntAct) | 292 species | Database | 2010 | 0.40 | 0.15 | 46011 | 162082 | 7.05 | 4,247 publ. |
Ceol et al. (MINT) | 332 species | Database | 2010 | 0.40 | 0.11 | 29407 | 77954 | 5.30 | 2,942 publ. |
Prasad et al. (HPRD) | Human | Database | 2010 | 0.34 | 0.09 | 9463 | 35021 | 7.40 | 453,521 publ. |
Breitkreutz et al. (BioGRID) | 15 species | Database | 2010 | 0.24 | 0.09 | 29499 | 229471 | 15.56 | 22,645 publ. |
Salwinski et al. (DIP) | 230 species | Database | 2010 | 0.37 | 0.09 | 20685 | 58596 | 5.67 | 3,609 publ. |
Reguly et al. | Yeast | Literature | 2006 | 0.60 | 0.22 | 1536 | 2844 | 3.70 | 16762047 |
Yu et al. (CCSB-binaryGS) | Yeast | Literature | 2008 | 0.58 | 0.13 | 1090 | 1263 | 2.32 | 18719252 |
Kiemer et al. (WI-PHI-core) | Yeast | Mixed | 2007 | 0.56 | 0.23 | 2443 | 5244 | 4.29 | 17285561 |
Kim et al. (SIN) | Yeast | Structure | 2006 | 0.68 | 0.22 | 1178 | 2195 | 3.72 | 17185604 |
The complete list of protein interaction networks analyzed is given together with the species, system, publication year, compression rate, relative compression rate, number of nodes and edges, average number of interaction partners (avg. num. of int. partners), and PubMed identifier of publication for referencing. Networks by
To investigate the “average signal” of all available interactome data we computed the relative compression rate of all protein interaction data available in the multi-species databases: IntAct, MINT, BioGRID, and DIP. These database averages cluster around a relative compressibility of
First introduced by
datatset | species | strategy | num. of prot. coding genes | screening completeness | avg. num. of int. partners | rel. comp. rate |
Stelzl et al. | Human | two-phase pooling (8) | 22,286 |
|
3.7 | 10% |
Parrish et al. | C. jejuni | two-phase pooling (96) | 1,685 |
|
17.5 | 20% |
Titz et al. | T. pallidum | matrix | 1,028 |
|
10.0 | 5% |
Rual et al. | Human | library | 22,286 |
|
3.3 | 4% |
Simonis et al. | C. elegans | library | 20,185 |
|
2.3 | 5% |
Giot et al. |
D. melanogaster | library | 14,144 |
|
5.7 | 3% |
Yu et al. (CCSB-YI1) | Yeast | library | 5,797 |
|
2.5 | 6% |
Ito et al. (core) | Yeast | library | 5,797 |
|
1.8 | 5% |
Uetz et al. | Yeast | library | 5,797 |
|
1.6 | 5% |
LaCount et al. | P. falciparum | library | 5,268 |
|
4.1 | 1% |
Sato et al. |
Synechocystis | library | 3,569 |
|
3.2 | 2% |
There are three main strategies for large-scale Y2H screens, briefly: i) matrix – all bait-prey pairs are tested, ii) library – preys are pooled and growing colonies are picked and then sequenced, and iii) two-phase pooling – preys are pooled in a first phase and in a second phase baits that reported interactions are pooled and screened against individual preys (see
The lower sensitivity of library based Y2H screens is apparent if one examines the average number of interaction partners. Depending on the database – IntAct, BioGRID, Mint, HPRD, or DIP – the average number of interaction partners per protein can be roughly estimated to be between
Overall,
As
datatset | species | expression modes | purification method | num. of prot. coding genes | completeness | rel. comp. rate |
Collins et al. | Yeast | physiological expression (knock-in) | TAP | 5,797 | 80% |
|
Gavin et al. (socio-affinity) | Yeast | physiological expression (knock-in) | TAP | 5,797 | 78% |
|
Gavin et al. | Yeast | physiological expression (knock-in) | TAP | 5,797 | 78% |
|
Krogan et al. | Yeast | physiological expression (knock-in) | TAP | 5,797 | 76% |
|
Butland et al. | E. coli | physiological expression (knock-in) | TAP/SPA | 4,263 | 23% |
|
Ewing et al. | Human | over-expression (cDNA) | FLAG-tag | 22,286 | 1% |
|
Ho et al. | Yeast | over-expression (cDNA) | FLAG-tag | 5,797 | 10% |
|
Arifuzzaman et al. | E. coli | over-expression (cDNA) | His-tag | 4,263 | 61% |
|
The Arifuzzaman dataset is an outlier when compared with other AP/MS datasets. A possible explanation is that it is the only screen that combined both non-physiological protein expression and His-tagging instead of the superior tandem purification procedure. Note: by default AP/MS datasets are interpreted using the
What about the compressibility of a network of non-interacting proteins – a negatome? Take a perfectly accurate interactome network and consider its complement or negatome. This network is not random since it contains the same information as the original network: it states exactly which proteins pairs are not interacting. From this the original network can be recovered. Our approach detects this non-randomness in the same way that it detects it for positive interactions. Noise – in the form of both false positives and false negatives – affect both a network and its complement in similar ways by destroying patterns. Our approach does not
As shown in
Since the individual interactions between proteins can directly and unambiguously be extracted from 3D structures, why is the SIN network which is derived from protein complexes' structures ranked below AP/MS networks? This is an artifact of the structural network (SIN) which is derived from structural templates. While the reliability of each interaction is arguably high, the coverage is very sparse and biased for protein and complexes found in solved structures. 3D structures coverage is still order of magnitude lower than coverage achieved by state of the art protein tagging, purification and identification in AP/MS screens which are genome-wide. As shown in validation 1, under-sampling by proteins or interactions leads to decreasing relative compression rates. Therefore if coverage in SIN were to be unbiased and genome-wide it would probably have a higher relative compression rate. Together with PCA, and two-phase Y2H, AP/MS screens produce the networks with the best balance between coverage and accuracy revealing more non-random structures and patterns than other experimental or compilation approaches.
In the absence of at least one complete and accurate interactome map it is difficult to estimate the range of true relative compression rates. In particular, an important question is whether some of the high relative compression rates – above
In order to estimate the relative compression rate of true and complete interactome maps we computed the relative compression rates of a wide range of networks derived from complex systems from ecology, neuroanatomy, software engineering, and the Internet.
Our results show that experimental methods (AP/MS versus Y2H, pooling strategy, expression level, tagging) strongly influence relative compressibility. Together with the validation steps, this suggests that relative network compressibility is a suitable quality measure for interactomes. However, before drawing this conclusion, there are some more points to consider:
As argued in the introduction, complex and random networks have different topologies. Relative network compressibility can quantify this difference. Can it go further and quantify the
A reason why network compressibility may not be indicative solely of quality is that it might also reflect changes in other network measures such as network size, degree distribution, or clustering coefficient – which could differ between experimental methods. First we show in
To better understand the influence of clique content on the relative compressibility we remove from the networks the cliques identified by our algorithm and recompute the relative compression rates and clustering coefficients. Removing the networks' cliques diminishes the proportion of indirect interactions in the network – in a sense it de-blurs the network (see Methods for details).
These results confirm that the clustering coefficient is strongly influenced by the amount of cliques while the relative network compressibility is not. As a consequence, network compressibility offers new insights into network topology.
Overall, we observe that Y2H networks are on average 6 times less compressible than all other networks. AP/MS networks have on average a relative compression rate of 21%, whereas it is 7% for Y2H networks. Wilcoxon-Mann-Whitney tests confirm that the relative compression rate of Y2H is significantly different from PCA, SIN, and literature curated networks (
While network compression shows some clear differences between types of networks, it should be noted that the method does not rate individual interactions, but it simply measures the structure of the global network. And in this sense the individual interaction of low compression networks are a valuable source of information. Moreover, different experiments (AP/MS, Y2H, PCA) even performed in a perfect world without any noise or artifacts would probably not produce exactly the same networks. Each experiment defining an accessible interactome with different properties (i.e. co-complex versus binary, stable versus transient). Therefore it is conceivable that a perfect and complete Y2H interactome would not be exactly at the same relative compressibility level as a perfect and complete AP/MS or PCA interactome. Comparisons within experimental classes as done in
As argued by
(
In particular, panel E and F show a visualization of the low compression Y2H sub-network and the high compression AP/MS sub-network. The panels clearly show that the low compression network has only a few scattered and isolated edges and hence no structure, while the high-compression sub-network comprises non-trivial nested structures.
Over the past years numerous genome-wide protein interaction datasets have been published. They have been obtained by different experimental methodologies sparking a discussion on data quality and coverage. Since proteomic interactions are inherently co-operative, modular, and redundant, interactomes are expected to be rich in structure and patterns. We propose the relative compression rate as a measure of this richness in patterns and structure and show that it correlates with data quality – understood as encompassing both sensitivity and specificity. We underpin this relationship as follows:
First, by showing that adding noise (both false positives and false negatives) adversely affects relative compressibility independently of the noise model and type of network.
Second, gold standard datasets and community-recognized higher quality datasets (low false positive rates) exhibit higher relative compressibility.
Third, an assessment of confidence thresholds based solely on the relative compressibility agrees with the authors' own benchmarks and analyses aimed at minimizing false positives and false negatives.
Fourth, we show that relative compressibility correlates with co-expression, co-localization, and shared function.
We also show that well characterized complex systems from other domains also exhibit relative compressibility levels similar to that of many protein interaction networks – thus suggesting that accurate and complete interactomes are also significantly compressible.
We screened all 22 large interactome datasets available, 5 complete interaction databases, as well as four other networks. First, we observe that within an experimental method (Y2H or AP/MS) there is strong effect of the experimental details on the relative compressibility. Networks derived from state-of-the-art purification procedures (Tandem affinity purification, TAP) and detecting interactions of baits expressed at physiological levels (knock-in versus cDNA over-expression) exhibit higher relative compressibility.
Second, we observe that networks derived from Y2H library screens are less compressible than networks derived from two-phase pooling Y2H screens and other experimental methods (AP/MS and PCA). The consistently low average number of interaction partners of networks derived from library Y2H screens suggests that the high selection stringency employed to achieve high specificity leads to too sparse networks
Based on the results presented in this paper, we make the following recommendations:
The relative compression rate of new large-scale protein interaction networks can be compared to that of other assays
Large-scale interactome screens should employ state-of-the-art methods such as Y2H two-phase pooling and AP/MS with TAP tagging to obtain networks richer in patterns and structure.
Networks with less than 15% relative compression rate might suffer from poor sensitivity and/or poor specificity.
Overall, relative compressibility is a new measure for comparing networks by their information content defined as the richness in patterns and structure distinguishable from pure noise. This new measure is a good proxy for both sensitivity and specificity and gives complementary information to classic measures such as average degree and clustering coefficient, thus helping to assess the structure of interactomes.
We collected all (21) large-scale protein interaction networks derived from experimental data published between 2000 and 2009. The data files where obtained directly from the supplementary material of the publications. In the cases where the interaction data was not provided in the supplementary material or in the companion website, we obtained the data from one of the interactome databases – Biogrid, Intact, Mint, or DIP. Moreover, we did an automatic scan of these four databases and verified that we had collected all experimental datasets satisfying our strict inclusion criteria: we only consider experimental protein interaction networks that are large-scale and symmetric. We exclude dataset focused on proteins of a specific biological function.
In symmetric networks the sets of baits and preys are largely overlapping. We exclude highly asymmetric datasets because they are not comparable to symmetric ones. For example, if the number of baits is small in comparison to the number of potential preys. Networks from Formstecher et al. and Rain et al.
In the case of species with large proteomes such as D. melanogaster, C. elegans, and Human, the screening completeness of individual datasets may be low. However, if the experiment has largely overlapping and symmetric sets of baits and preys – and is unbiased – we included it (for example the Rual
In the case of AP/MS datasets we interpreted the data using the
In addition to these experimental networks we added two literature curated datasets
Some of the datasets overlap: the Ito full dataset contains the same interactions as the Ito core dataset with the addition of lower confidence interactions. The network by Collins et al.
Graphs, power graphs, and compressibility.
A graph
The notion of clustering or edge-transitivity in networks was first introduced by
Where
Given a graph
A power graph
Similarly, if and only if a power node is self-adjacent, then in
There is one exception, we ignore self-adjacent nodes:
Any two power nodes are either disjoint, or one is included in the other. Therefore, power nodes form a hierarchy. This guarantees that the power node hierarchy can be represented in the plane which facilitates visualization.
Each edge of the original graph is represented by one and only one power edge. In other terms, the power edges form a partition of the set of edges.
The power graph algorithm is described in
The algorithm proceeds in two phases: The first phase of the algorithm collects candidate power nodes and the second phase uses those for the search for power edges. In the first phase potential power nodes are identified with hierarchical clustering based on neighborhood similarity. A set of nodes is a candidate power node if its nodes have neighbors in common. We use a hierarchical clustering algorithm based on neighborhood similarity to identify such sets. The similarity of two neighborhoods is the Jaccard index of these two sets. It ranges always between
In the second phase power edges are searched. The minimal power graph problem is to be seen as an optimization problem in which the power graph achieving the highest edge reduction is searched. The greedy power edge search follows the heuristic of making the locally optimum decision at each step with the aim of finding the global optimum. Among the candidate power nodes found in phase one each pair that corresponds to a power edge is a candidate power edge. The candidates abstracting the most edges are added successively to the power graph.
As explained above, power graphs exploit shared neighbors of two proteins as pattern for compression. In principle, any algorithm exploiting shared neighbors should perform similar to power graphs
For
Compression rates for protein interaction networks and rewired networks were calculated with the power graph algorithm. The compression rate of a network is calculated from a power graph by computing the edge reduction. If the original network has
The compression rate is between
Clique or biclique membership is not covered in the measure of compression rate because it only assesses the number of edges before and after compression. There are two reasons for our choice: First, simplicity – our goal is to keep the measure as simple as possible. Combining reduction of nodes and edges into one measure leads directly to a number of subsequent questions: Are they of equal importance? Should they be weighted? How should they be combined?
Second, compression with and without nodes strongly correlate.
An important point is that compressibility as measured by power graphs can capture network motifs based on cliques but also based on bicliques. Therefore, a bipartite network that does not contain a single clique can still exhibit the whole range of compression rates. Therefore networks with a clustering coefficient of zero may still have high compression rates – see
network | source | year | number of nodes | number of edges | average degree | clustering coefficient | relative compression rate |
South Florida Ecosystem |
|
2000 | 381 | 2,137 | 11.2 | zero | 0.48 |
Cytoscape class dependencies | Cytoscape | 2009 | 615 | 3,463 | 11.2 | 0.26 | 0.47 |
Bible co-appearance network |
|
1993 | 130 | 743 | 11.4 | 0.77 | 0.33 |
US Airports |
|
2007 | 500 | 2,980 | 11.9 | 0.61 | 0.21 |
Corporate Ownership |
|
2002 | 7,253 | 6,711 | 1.8 | 0.01 | 0.20 |
Java library class dependencies | Java | 2006 | 1,538 | 7,817 | 10.1 | 0.39 | 0.17 |
Internet (autonomous systems) |
|
2006 | 22,963 | 48,436 | 4.2 | 0.23 | 0.17 |
C. elegans neural network |
|
1986 | 297 | 2,148 | 14.4 | 0.29 | 0.15 |
Power Grid (USA) |
|
1998 | 4,941 | 6,594 | 2.6 | 0.08 | 0.04 |
Network relative compressibility in the range
The relative compression rate measures an original network's compression rate in relation to an average random network of same topology. To compute the relative compression rate one generates 1000 random networks following the null model (see below) and computes the average compression rate. The relative compression rate measures by how much the original network's compression rate differs from the average random compression rate:
Where
Given a protein interaction network, we generate a large (1000) population of randomly rewired networks. These random networks have the same number of nodes and edges, as well as the same number of interaction partners per node and hence the same degree distribution as the original network. These networks are generated by randomly re-wiring the original network
For all noise models we perturb the original networks and recompute their compressibility. For the results in
We generated networks with simulated false positives and false negatives for
We obtained the raw interaction confidence scores for the three datasets by Gavin et al., Parrish et al., and Tarassov et al. (provided in the supplementary material of the publications). As illustrated on
We correlate interactions with gene co-expression, cellular function, cellular co-localization, and phylogenetic profile similarity for 12 Yeast networks and for all interacting pairs of proteins for which we have complete information. We use the following assortativity ratio:
Where
Where
We collected nine networks from the network science literature derived from complex systems of interacting entities (
(EPS)
(EPS)
(EPS)
(EPS)
(EPS)
(EPS)
(EPS)
(EPS)
(EPS)
Work by LR, MR, MS was supported by EU, DFG, BMWi (Ponte, Hybris, GeneCloud). Work by AFS was supported by the BMBF NGFN Program, DiGtoP (Disease genes to pathways). Thanks to Christof Winter for discussions and detailed comments on the figures. Thanks to Assen Roguev for discussions.