Finding the “Dark Matter” in Human and Yeast Protein Network Prediction and Modelling

Accurate modelling of biological systems requires a deeper and more complete knowledge about the molecular components and their functional associations than we currently have. Traditionally, new knowledge on protein associations generated by experiments has played a central role in systems modelling, in contrast to generally less trusted bio-computational predictions. However, we will not achieve realistic modelling of complex molecular systems if the current experimental designs lead to biased screenings of real protein networks and leave large, functionally important areas poorly characterised. To assess the likelihood of this, we have built comprehensive network models of the yeast and human proteomes by using a meta-statistical integration of diverse computationally predicted protein association datasets. We have compared these predicted networks against combined experimental datasets from seven biological resources at different level of statistical significance. These eukaryotic predicted networks resemble all the topological and noise features of the experimentally inferred networks in both species, and we also show that this observation is not due to random behaviour. In addition, the topology of the predicted networks contains information on true protein associations, beyond the constitutive first order binary predictions. We also observe that most of the reliable predicted protein associations are experimentally uncharacterised in our models, constituting the hidden or “dark matter” of networks by analogy to astronomical systems. Some of this dark matter shows enrichment of particular functions and contains key functional elements of protein networks, such as hubs associated with important functional areas like the regulation of Ras protein signal transduction in human cells. Thus, characterising this large and functionally important dark matter, elusive to established experimental designs, may be crucial for modelling biological systems. In any case, these predictions provide a valuable guide to these experimentally elusive regions.

genomes other than g which contain such proteins s. For a domain pair j in genome g, the fusion score C j is taken as a maximum over all genomes in T: (1) Where |T| is the number of elements of set T (i.e. the number of target genomes), and are the frequencies of domain A and domain B respectively in genome g and and are the frequencies of domains A and B respectively in genome t. For a particular protein pair i, in query genome g, the maximum C j is taken over all possible domain pairs j. (2) Where |J| is the number of elements in set J (i.e. distinct domain pairs). Thus C i is the CODA score for proteins p,q (pair i); the best (highest) score over all domain pairs between the proteins and over potential fusion proteins in all genomes (other than the query genome). The important novel aspect of this score is that it takes the maximum score over all the genomes whereas other methods do not consider target genomes individually. The score was chosen to reflect the uncertainty that fused domains and their unfused relatives are orthologues. The highest (best) possible score is 1 which is returned when there is only one example of each domain family in the query genome and one fused protein in a target genome, with no other domain homologues. In this case it is highly likely that the query protein domains are orthologous to the target protein.

hiPPI (homology inherited Protein-Protein Interaction) Method:
The hiPPI method takes advantage of the Gene3D families of structurally conserved proteins as well as multiple sources of protein-protein physical interaction (PPI) data to reliably infer ('inherit') novel protein-protein interactions from homologues. hiPPI exploits the Gene3D protein families (G3D_families) datasets [59]. These are families of proteins with similar multi-domain architectures generated using an automated, but conservative, clustering procedure. Interactions are only inherited between proteins belonging to the same Gene3D family, even though there may be recognisable sequence similarity with a protein in another cluster. This step helps to reduce the amount of noise produced by attempting to inherit from overly-distantly related sequences.
The interaction dataset is formed from a merger of the PPI resources from MIPS, IntAct, HPRD and MINT protein-protein interaction datasets, obtained from the Gene3D database [59]. From each dataset the interacting proteins, their family, species, and experimental method was retrieved. Gene3D family codes consist of 11 elements. The first is the root family code, then the family is further sub clustered at 10 levels of sequence identity, termed S-levels (S30, S35, S40, S50, S60, S70, S80, S90, S95, S100).
hiPPI score (for a graphical description see Figure 1): For every test protein ('the inheritees'), all the relatives ('the inheritors') with interactions with proteins ('the complementors') with relatives in the inheritee's species ('the complementees') are identified. The inherited interaction ('inheritance') is that between the inheritee and the potential complementees. Known direct interactions were discarded. For each inheritance the similarity between the inheritee and the inheritor is measured by what Slevel they belong to, on a scale of 1 -11 (1 is the family code and 11 is 100% identical); this is termed the 'iLevel'. Identically, the 'cLevel' is calculated for the complement. The two values are then averaged to create the 'icLevel'.
At this stage two alternative steps can be taken, and both are useful in different situations. The first assumes that if a protein interacts with one member of family then it is likely to at least show some affinity for another member in the same species. This can be considered biologically realistic, as the effect is seen in many genetic experiments (i.e. complementation tests). In this case all inheritances are counted. The second disregards this assumption in order to identify the probable biologically most important interaction. In this case, those inheritances with either a cLevel or an iLevel of 10 are disregarded. For the current study the former approach was used.
Since each protein-protein interaction can be inherited from more than one species, experimental method or iLevel, a summed score is created (the 'iScore') for each distinct pair of icLevels (NB iLevel = 10, cLevel = 8 is not the same as iLevel = 8, cLevel = 10). The first entry (non-redundant experiment) at that level contributes the full score of the icLevel. For subsequent entries at that icLevel if the experiment type or species is not new the score is halved; if neither is new but is a recombination of previously observed ones then the score is quartered. The final iScore is the sum of all the intermediate icLevel scores. An example of identifying two potential interaction partners (labelled (d)) for the query protein (a). The interaction is inferred from the known interaction between (b) and (c), which are homologous to (a) and (d) respectively. Small example trees are shown for each family; each branch in the trees occurs at a particular family S-level (i.e. 80% sequence identity). The Slevel in common between (a) and (b) is the iLevel, while the S-levels in common with (c) and the (d) s are the cLevels.

Benchmarking studies for the integrated prediction methods using KG datasets.
These studies explored the variation of Precision vs. Recall with progressive reduction of noise, due to increasing evidence (yeast ≥ 2, 1 evidences and in Human ≥ 3, 2 and 1), in the gold standard KG datasets used to estimate statistical power. These plots ( Figures  2A and B) reinforce the results showed in Figure 1 in the main manuscript by showing that reduction of noise in the gold standard KG datasets increases the area under the curve in the precision vs. recall plots for both species.
There are two related effects that could distort the estimation of FP rates in our validation model: the fact that the gold standard datasets do not contain all the true PPIs in nature; and the fact that the random model used for benchmarking can randomly select true positives. Therefore, we have performed an experiment to estimate the consequence of considering TPs as FPs in our validation protocol.
We have validated performance using as random datasets the combination of 2/3 randomly obtained PPI plus a random selection of 1/3 known TPs from the GOSS, Int and Reactome_int datasets (these TP enriched random datasets are called "noisier"). We then compare the results for these noisier datasets with validations performed using the remaining 2/3 of PPIs from the same gold standards ( Figure 2C and D). When TPs are counted as FPs the precision decreases in all cases giving an underestimate of the performance of the prediction methods. Panels present precision versus recall for Yeast and human, as discussed in the main manuscript. In panel A, validation for yeast was performed using KG ≥2 evidences and KG ≥1 evidences as gold standards. Panel B shows the validation in human using the KG ≥3, 2 and 1 evidence gold standard datasets. The enlargement highlights the improvements obtained. In panel C validation was performed in yeast using the Goss and Int noisier datasets as random models and the remaining 2/3 of the respective gold standard datasets as TPs. In panel D validation was performed as in panel C but in human using the Reactome_int extra datasets.  Table 1. Study of dependencies between the prediction methods. The mutual information remains low suggesting a minimal overlap of features. Tables A and C correspond to the normalised measure or universal metric whilst B and D show the mutual information scores between the predictors. In order to meet the statistical requirements for integrating PG datasets it is important to ascertain whether integration improvements could be an artefact caused by correlation and dependencies between the prediction methods. Mutual information is a more general measure of correlation than the Pearson's correlation coefficient metric, capturing both linear and non-linear correlations.

Metric
Many applications require a metric, that is, a distance measure between points. The quantity d(X,Y) = H(X,Y) − I(X;Y) satisfies the basic properties of a metric; most importantly, the triangle inequality, but also non-negativity and symmetry. Where H is the entropy between X and Y samples and I the corresponding mutual information.
In addition, one also has , and so obtains The metric D is a universal metric, in that if any other distance measure places X and Y close, then D will also judge them close.
In this way, mutual information and mutual metric measure opposite behaviours. The latter is adopted with the aim of normalizing for database size.

Validation of the FisherW p-values using physical interaction datasets.
We have performed the validation of the Fisher predictions using the "Int" datasets for yeast and human and with an additional Reactome_int dataset for human (see Figure 3). The Int dataset combines the interaction data from the HRPD, MINT, and Intact databases (see Methods). In human we have used an extra Reactome_int dataset which contains the physical interactions annotated in the Reactome database. No similar Reactome_int dataset was available for yeast since Reactome is only focused on human molecular interactions and reactions. The methodology employed for performing analysis of precision vs. coverage was the same as that described in Methods.
We found that Fisher p-values correlate inversely with the precision scores in the yeast and human validations (see Figure 3a and b respectively), as expected if physical interaction information is linked to the Fisher prediction score. Fisher integration retrieves 240,840 predictions with precision ≥ 80% from the Int validation dataset in yeast ( Figure 3a). This figure is more than two fold the number of hits obtained in the Gossr validation at the same precision level ≥ 80% (see Figure 1 in the main manuscript). For the functional analysis performed in our work, physical interaction validation with the Int dataset in yeast assigns a precision ≥ 90% to the PG 0.01 dataset (pairs with p-values ≤ 0.01), higher than the Gossr validation estimation which assigns a lower precision (precision ≥ 80%) to the same PG 0.01 dataset (see Figure 1a in the main manuscript).
For the human Int validation dataset Fisher predicts 455,410 pairs with precision ≥ 80%, whilst in the validation using Reactome_int the number of predictions rise to 3,823,840 at the same level of precision ≥ 80% (Figure 3b). These figures are about half and almost four fold, respectively, the PG 0.014 human dataset used for functional analysis in our work. In the Gossr validation of the PG 0.014, the selected Fisher predictions achieve precisions ≥ 80%. However, in the Int and Reactome_int validations the PG 0.014 selected Fisher predictions dataset reaches precisions of ≥ 76% and ≥ 82%, very close to the ≥ 80% precision assigned by the Gossr validation of the same PG 0.014 dataset.

Fisher integration of the STRING individual ab-initio prediction datasets and benchmarking of the integrated STRING predictions.
We performed a weighted Fisher integration of the neighbourhood, fusion, cooccurrence and co-expression prediction datasets obtained from the STRING site (http://string-db.org/newstring_cgi/) [14]. The yeast and human STRING predictions were extracted from the downloaded file: protein.links.detailed.v8.2.txt.gz. Duplicated (ie redundant) protein-protein pairwise predictions were identified in the STRING yeast and human datasets and removed. The four STRING prediction datasets scores showed Gaussian distributions and their scores were translated into p-values. Then, FisherW integration was performed for the STRING prediction datasets following the same protocol implemented in this work for obtaining the PG predictions (see "Calculating pvalues for the predictions and data integration" section in Methods).
We compared our PG Fisher prediction performance against the STRING Fisher integrated scored predictions. STRING is an updated and well-known resource for predicting protein interaction, and represents a good gold standard to compare against. STRING comprises, amongst other methods, four ab-initio prediction methods similar to the methods integrated in the PG models. Amongst other methods, STRING provides the following ab-initio predictions based on genome and gene expression comparison: gene neighborhood, gene fusion (comparable to CODA), gene co-occurrence, and gene co-expression (comparable to GECO). STRING does not include a prediction algorithm similar to hiPPI, but instead STRING implements a phylogenetic profiling-like method (gene co-occurence) and a gene genome co-localization method (gene neighborhood). In any case, hiPPI predictions represent a small percentage (around 0.1%) of the total predictions integrated by Fisher in our work, and therefore with little influence on the overall increase in prediction power observed.
The number of STRING predictions is significantly smaller that the PG predictions indicating a much lower coverage for all the methods compared ( Since a given method can predict large numbers of predictions without there being any functional information associated with the predictions, the total number of predictions is not a good indicator of the methods' performance in itself, but the number of accurate predictions above a significant precision level (e.g. 80% precision) is a useful measure.
Therefore, the STRING predictions were validated with the same Gossr and Int gold standard datasets used to validate the PG predictions in our work ( Figure 4).
The STRING Fisher predictions' scores show correlation with precision in yeast and human (see Figures 4a and b, respectively). However, the STRING Fisher's prediction power is much lower than the PG predictions. Whilst, in yeast, the number of STRING predictions with precision ≥80% number 14,293 in Gossr validation and 19,849 in Int validations respectively (see Figure 4a)

Assessment studies for the Bayes integration of CODA, hiPPI and GECO datasets.
In order to contrast the results of the Fisher weighted method with other methods, exploiting prior knowledge, we compared the Fisher weighted method to the Naïves-Bayes classifier, which employs a semi-learning approach. We used the Gossr as a True Positive (TP) training dataset and a randomisation of this dataset for the True Negative (TN) training dataset (with 1,000 iterations). These TP and TN datasets were useful for learning the corresponding parameters of our model and subsequently for calculating, by bayesian inference, the posterior maximum likelihood for every pair involved in the integration. A Likelihood Ratio (LR) was calculated for every TP and TN integrated dataset's p-values as follow: LR(p 1 ,p 2 ,…,p n ) = P(p 1 ,p 2 ,…,p n |I)/P(p 1 ,p 2 ,…,p n |~I) = ∏ i=1 n [P(p i |I)/P(p i |~I)] (In the ∏ i LR calculation it is assumed the conditional independence of the integrated datasets).
And therefore, by inference of Bayes rule, we are interested in determining the posterior odds ratio of interaction between two proteins in our integration: O post = P(I|p 1 ,p 2 ,…,p n )/P(~I|p 1 ,p 2 ,…,p n ) = P(I)/P(~I) * P(p 1 ,p 2 ,…,p n |I)/P(p 1 ,p 2 ,…,p n |~I) = O prior * LR(p 1 ,p 2 ,…,p n ) Where O post and O prior are the posterior and prior odds ratios respectively and I is a binary variable for the interaction and absence of interaction in the case of ~I. Finally, we termed p 1 ,p 2 ,…,p n as the scores of datasets to being integrated.
Bayes integration of the datasets from the four individual methods (GECO, CODAcath, CODApfam and hiPPI) was implemented as described above for yeast and human, and the Bayes integration was compared against the Fisher integration. The yeast and human Bayes integrations were validated and compared using Int and Reactome_int datasets, since the Gossr datsets could not be used for both, training and validating proposes.
Bayes integration produced uneven results in yeast and human compared to Fisher ( Figure 5). Whilst Fisher outperforms Bayes for the highest levels of precision in yeast (see left side of the Figure 5a), in human Bayes performs better (see Figure 5b). These results show that the Fisher's integration gives a good performance compared to a trained method, despite the fact that this has the advantage of learning from the experimental (KG) information to predict PPIs. Therefore, despite the increase in precision achieved by using Bayes with the human data, Bayes, as with any trained and/or supervised integration method, could not be used in our analysis since we would then be comparing Bayes PG models, that were not completely independent from the KG training datasets, against the KG models.

Calculating topology features for the PG and KG protein networks.
The following topological parameters were calculated for the PG and KG networks: Degree Statistic or Vertex Degree Connection, the most commonly calculated parameter. Measurements in classical undirected graphs indicate that the power-law growth described as ck -γ with γ>0 and c>0 and its associated exponents (γ≈2-3 typically for stable models) are significant statistics for the classification of graphs and can be used to determine the similarity of the predicted network model to the real network model [36][37][38][39][40].
Degree Correlation or Assortativity [39,40], this parameter measures the preferential attachment of a new node i.e. the likelihood that a new node will be associated with others that are highly connected in the graph. Together with power-law growth this is a sufficient and necessary condition for Scale Free (SF) network models [37,38].
Clustering Distribution, this is a density function based on the probability of each node belonging to a completely connected triplet, and is a necessary condition for proving a Hierarchical or modular organization [41,52]. For disconnected sub-graphs we set ℓ=∞.

Radius, Diameter and Eccentricity Statistics
[42] By fixing one argument of the distance, the analysis of this eccentricity might be usefully reduced to εcc(n i ) as the maximal distance to another node in the graph. In our case this can be calculated from the distance matrix O(n i 2 ) (obtained from the single-source shortest-paths problem solution (SSSP) time). The radius of the Graph, Rad(G), as the minimal eccentricity of all vertices. The diameter of a graph G, diam(G), as the maximal distance between two arbitrary (connected) vertices (used in testing the network modularity [52]).

Other Parameters Involved in the Analysis of the Networks
In this section we present the parameters measured in the analysis of the networks. In Tables 3 and 4 several measures are shown which support the highly modular nature of our networks: as the significance levels of the PG and KG models increases there is a decreasing density with concomitant decrease in the cluster average. This correlation between density, cluster average and significance level together with the fact that the radius remains almost equal and both diameter and ℓ increase appreciably, is consistent with the other topological signals of high modularity found in our modelled networks. These include: relatively low assortativity (in human lower than in yeast), scale free exponents below 3, and no hierarchical structure (detailed support of these statements can be found in references [37, 39, 42]). Note that the density parameter was defined as the probability of obtaining triangles in any network, and that radius, ℓ, and diameter measurements are referred to the largest component of a network in accordance with definition of path algebra introduced in section 7.
Some of the parameters are not explored in human because of the high computational cost in calculating these values. Nevertheless, the tendency for high modularity is confirmed in the other parameters calculated for human.
To further illustrate the modular behaviour of our models we present two models (Afor human and B -for yeast in Figure 12) highlighting the difference in modularity of the yeast and human networks. Notice a higher assortativity in the yeast network (as in most real-life networks) compared to the human example. These examples support the proposed possible configuration of our networks, although they do not necessarily depict an exact model of them.

The GO semantic similarity refined dataset (Gossr) used for validating the prediction methods.
In order to increase the quality and reliability of the human and yeast validation datasets, those annotations with term evidence type Inferred from Electronic Annotation (IEA), No biological Data available (ND) and those inferred from Computational Analysis were removed. We expected the source data of CODA, GECO and hiPPI to have overlap with IGC (Inferred from Genomic Context), IEP (Inferred from Expression Pattern) and IPI (Inferred from Physical Interaction) annotation sets respectively. To minimise this overlap and prevent circularity, these evidence codes (IGC, IEP and IPI) were removed.

Quantitative analysis (in terms of the edges and real nodes) of the union and intersection of the source sets (Yeast (Y) and Human (H)).
The total number of proteins (nodes) in the yeast and human proteomes are also indicated at the bottom of the Table. 16. Functional enrichment analysis of the yeast and human dark (hidden) hubs.
We have performed a comparative enrichment analysis of the top 10% and bottom 10% of the PGk i _er ranked lists, against the human proteome background datasets, using the DAVID algorithm [43]. DAVID includes functional annotation from an extensive number of public resources and identifies enriched biological themes, particularly GO terms, computing a p-value for the observed enrichment. Results indicate that the dark hubs in our dataset (top 10% of the PGk i _er ranked lists) are significantly enriched in proteins integral to membrane in yeast (Table 6), compared to the bottom 10%, and in unknown (i.e. with no functional annotation) proteins in yeast and human (Table 7). The space external to the outermost structure of a cell. For cells without external protective or external encapsulating structures this refers to space outside of the plasma membrane. This term covers the host cell environment outside an intracellular parasite Table 6. Gene-annotation enrichment analysis of the yeast and human dark (hidden) hubs. Geneannotation enrichments in GO of the 10% top of hubs and the bottom 10% of nodes in the yeast and human protein lists, ranked by their PGk i _er values, performed using the DAVID algorithm with the human proteome as the background (Dennis et al., 2003; http://david.abcc.ncifcrf.gov/; see in Results the section: "Analysing the 'dark matter' in the PG models." and in Methods the section: "Calculating the PGk i enrichment ratio and the PG functional enrichment"). PGk i _er values rank the dark (hidden) hubs at the top of the list. These are proteins with high connectivity in the PG model and low connectivity in the KG models; while those hubs with few connections in the PG models compared to the KG models are ranked on the bottom. The Table shows from left column to right: specie, yeast and human; name of the GO term; GO category: cc (cellular component) and bp (biological process); Top 10% -results related to the enrichment analysis of the top 10% of the ranked lists; Bottom 10% -results related to the bottom 10% of the ranked lists; Tot. % -protein enrichment percentage over the analysed datasets; p-value, statistical significance provided by DAVID -note that all the enrichment data shown are statistically highly significant; GO identification number; and functional definition in GO. Only the Tot.% enrichment values equal or greater than 20% are reported. Enrichment results show that 22% of the top 10% dataset proteins are "Integral to membrane" in yeast, while 25% of the bottom 10% dataset proteins are annotated as "Intracellular non-membrane-bounded organelle". In human there is no significant enrichment equal to or above 20% in the top 10% dataset, while in the bottom 10% of the dataset 21% of proteins are annotated with the "Extracellular region" GO localization term. The results in yeast statistically support the observation that the hidden hubs are significantly related to membrane proteins, which are difficult to isolate by conventional purification protocols.

Context analysis and validation
The following example represents the typical context analysis framework: We observe in Figure 13, for example, that in the comparison between Prot.1 and Prot.2 only three matching bits (1-1) are found, whilst between Prot. 1 and Prot. 3 there is an absence of matching bits and between Prot. 3 and Prot. 4 two matching bits (1-1) are found and one non matching bit (1-0). It can be seen as well that there is not a direct relationship between Prot. 1 and 2 or between Prot. 3 and 4, consequently the prediction retrieved using context could not be detected from their primary interactions.

Functional association predictions based on context information in the PG networks
The protein association profiles (i.e. vectors of interacting proteins for each protein in the PG networks; see Methods) were compared between all protein pairs in order to retrieve additional association signals not explicitly present in the first order predictions. Similarities between protein association profiles were calculated using three different measurements: bits and specific bits for human and yeast (see methods), and congruence only for yeast [53] (congruence measure involves a combinatorial calculation which is not feasible given the large size of the human PG network; see Methods).
Bits and specific bits scores show very similar behaviour in all the KG datasets. In yeast specific bits slightly outperforms bits score for all the datasets, while in human the opposite is observed and bits slightly outperforms specific bits score except for Reactome_Int where bits calculation shows remarkable improvement over specific bits predictions (Figures 16e, 17e and 18e). However, the congruence measure in the yeast PG 0.001 network shows a poor performance close to random behaviour for practically all KG datasets in yeast ( Figure 4a; and Figures 14 and 15 in Text S1). Therefore, we decided not to use congruence further as a measure.
The large sizes of both PG matrices (4,374 x 4,374 nodes in yeast and 19,618 x 19,618 nodes in human; see Table 5), make it very unlikely that two proteins would share a significant number of interactions in their respective association profiles by chance, which would explain the good performance of the bits score.
Similarity score performances vary according to the nature of the gold standard datasets used to validate them ( Figure 20). If we consider the bits specific scores as the most stable measure, we observe that for yeast the second order approach is better at predicting protein relationships in KEGG, than predicting physical interactions (Int dataset) or ontological associations in the GO or FunCat databases (Goss and Foss datasets) (see Figure 20a and b). Whilst, in human, second order predictions seem to work much better at predicting associations in biological pathways (Reactome and KEGG) than predicting physical interactions (Int and Reactome_Int) and are even worse at predicting ontological relationships (GOSS and FOSS; see Figure 20c and d). These differences can probably be explained by the difference in noise (error) in the different biological sources used as gold standards to validate performance.

Weights in Fisher's integration statistic
As described in the main manuscript (Methods), Ztests were performed using Matlab to ensure that probability density function (PDF) distributions fit random Gaussian distributions (null hypthesis) at a 5% significance level. Calculation of weights in the Fisher's integration is based on running the Monte Carlo and ESA algorithms simultaneously. Both these approaches are well defined and well known in the field of data integration. More specifically, the Monte Carlo simulation methods are particularly useful in studying systems with a large number of coupled degrees of freedom such as the data analysed in the main manuscript. We apply them as in a predictor-corrector system in which we predict a weight for our datasets and this is corrected at the same iteration in order to avoid a local optima effect. We summarize these methods in the following paragraphs: The Monte Carlo method (Matlab package) provides approximate solutions to a variety of mathematical problems by performing statistical sampling experiments on a computer generating suitable random numbers, and observing that fraction of the numbers obeying some property or properties. The method is useful for obtaining numerical solutions to problems which are too complicated to solve analytically.
The method applies to problems with no probabilistic content as well as to those with inherent probabilistic structure. Among all numerical methods that rely on N − point evaluations in M − dimensional space to produce an approximate solution, the Monte Carlo method has absolute error of estimate that decreases as N superscript -1/2 whereas, in the absence of exploitable special structure all others have errors that decrease as N superscript −1/M at best. This may produce incorrect results, but with bounded error probability. an algorithm simulating the behaviour of a system at a given temperature is proposed. At each iteration, a neighbouring solution, sol′ of the current solution, sol, is randomly generated. If the neighbour is better than the current solution, it is always accepted and becomes the current solution for the next iteration. Otherwise, the neighbour is accepted with a probability P accept depending on the energy difference ∆ between the two solutions (i.e. energy of neighbour minus energy of current solution) and a parameter t called temperature. This probability increases when the temperature increases and decreases when the energy difference increases. This selection scheme is called Metropolis criterion. The advantage of this scheme over Stochastic Hill Climbing (i.e. only accept better solutions) is the possibility to escape local optima. The short pseudo-code, the following text, the framework of a Metropolis chain of length L in temperature t is shown.

Metropolis chain
Algorithm Metropolis chain (initial, L, t) { sol = initial repeat L times { sol′ = new neighbor of sol ∆ = ENERGY(sol′ ) -ENERGY(sol) if ∆ ≤ 0 then Paccept =1 else Paccept =exp(−∆/t) if random(0,1)< Paccept then sol = sol′ } return sol } Simulated annealing consists of a series of Metropolis chains at different decreasing temperatures. The aim of each Metropolis chain is to permit the system to reach thermal equilibrium. A slow cooling leads eventually to a frozen system yielding a good final solution.
In order to use simulated annealing algorithm for a specific optimization problem, an appropriate state space corresponding to the possible feasible solutions, a neighbourhood relation between the states, a cost function of each state and an appropriate cooling schedule should be selected. The role of neighborhood-relation is to express the similarity between the elements of the state space. The neighborhood of a state is typically defined as the set of the states that can be obtained by making some kind of local modifications on the current state.
Given the source node w, the state space of the simulated annealing is the set of all possible spanning trees rooted at the source node w and the cost of a state is the power cost of it. During the execution, the temperature decreases exponentially between successive Metropolis chains, i.e. the temperature ti after ith Metropolis chain is given by where α is the so called cooling factor, which is a number close to 1 and t0 is initial temperature. The following idea is used to find a suitable neighbourhood structure.

Predicition validation, network topology and context analysis of the PG models without the hiPPI predictions.
Whilst CODA and GECO do not use any publicly available PPI information, hiPPI uses available experimental data by exploiting sequence similarity information between known interacting partners and their homologues to predict new interacting pairs (see section 1 in this text S1 document).
Therefore, there is a reasonable concern that the sequences we are predicting in human and yeast could be very close homologues of the sequences in the KG datasets and perhaps the weight accorded to the hiPPI predictions in the integrated prediction set (CODAcath, CODApfam, GECO and hiPPI) could be high enough to significantly bias the Fisher's integrated PG model so that the characteristics resemble those of the experimental KG network. Addressing this possibility we have repeated the main analyses of this work excluding the hiPPI predictions.
When hiPPI predictions were excluded we observed that the Fisher integration of the remaining predictions gives a similar performance predicting almost the same number of true PPIs in yeast and in human above 80% precision (see Figure 21). PG models without hiPPI built at different significant pvalues show the same behavior observed when all predictors are integrated, and the scale free trend of the ki distribution is also observed when PG (without hiPPI) pvalue significance levels increases (see Figure 22). Significant PG models without hiPPI (precision ≥ 80%) also show context information of PPI in both species (see Figure 23). All these results indicate that the similarity of the PG and KG models is not due to any bias or circular information and support all the observations, discussion and conclusions of our work. Figure 21. Prediction power of the integrated methods in yeast and human without the hiPPI predictions. Panels present the results for the yeast and human validations in terms of precision versus recall for the Fisher integration of all methods except hiPPI. In panel A validation was performed using Goss and Int datasets as gold standards in yeast. Panel B shows the validation in human using the Goss, Int and Reactome_int gold standard datasets. Number of predicted hits above 80% precision are also indicated in the panels' upper legends.

Figure 22. Degree distribution for the various PG networks built without the hiPPI predictions.
Panels A and B correspond to the PG networks ki (degree) frequency distribution at different pvalue significance levels (without the hiPPI predictions) in yeast and human respectively. The legend for these panels show the exponents corresponding to the linear regression fit of the data.