Simulated Evolution of Protein-Protein Interaction Networks with Realistic Topology

We model the evolution of eukaryotic protein-protein interaction (PPI) networks. In our model, PPI networks evolve by two known biological mechanisms: (1) Gene duplication, which is followed by rapid diversification of duplicate interactions. (2) Neofunctionalization, in which a mutation leads to a new interaction with some other protein. Since many interactions are due to simple surface compatibility, we hypothesize there is an increased likelihood of interacting with other proteins in the target protein’s neighborhood. We find good agreement of the model on 10 different network properties compared to high-confidence experimental PPI networks in yeast, fruit flies, and humans. Key findings are: (1) PPI networks evolve modular structures, with no need to invoke particular selection pressures. (2) Proteins in cells have on average about 6 degrees of separation, similar to some social networks, such as human-communication and actor networks. (3) Unlike social networks, which have a shrinking diameter (degree of maximum separation) over time, PPI networks are predicted to grow in diameter. (4) The model indicates that evolutionarily old proteins should have higher connectivities and be more centrally embedded in their networks. This suggests a way in which present-day proteomics data could provide insights into biological evolution.


Introduction
We are interested in the evolution of protein-protein interaction (PPI) networks. PPI network evolution accompanies cellular evolution, and may be important for processes such as the emergence of antibiotic resistance in bacteria [1,2], the growth of cancer cells [3], and biological speciation [4][5][6]. In recent years, increasingly large volumes of experimental PPI data have become available [7][8][9][10], and a variety of computational techniques have been created to process and analyze these data [11][12][13][14][15][16][17][18]. Although these techniques are diverse, and the experimental data are noisy [19], a general picture emerging from these studies is that the evolutionary pressures shaping protein networks are deeply interlinked with the networks' topology [20]. Our aim here is to construct a minimal model of PPI network evolution which accurately captures a broad panel of topological properties.
In this work, we describe an evolutionary model for eukaryotic PPI networks. In our model, protein networks evolve by two known biological mechanisms: (1) a gene can duplicate, putting one copy under new selective pressures that allow it to establish new relationships to other proteins in the cell, and (2) a protein undergoes a mutation that causes it to develop new binding or new functional relationships with existing proteins. In addition, we allow for the possibility that once a mutated protein develops a new relationship with another protein (called the target), the mutant protein can also more readily establish relationships with other proteins in the target's neighborhood. One goal is to see if random changes based on these mechanisms could generate networks with the properties of present-day PPI networks. Another goal is then to draw inferences about the evolutionary histories of PPI networks.

Results
We represent a PPI network as a graph. Each node on the graph represents one protein. A link (edge) between two nodes represents a physical interaction between the two corresponding proteins. The links are undirected and unweighted. To model the evolution of the PPI graph, we simulate a series of steps in time. At time t, one protein in the network is subjected to either a gene duplication or a neofunctionalizing mutation, leading to an altered network by time tzDt. We refer to this model as the DUNE (DUplication & NEofunctionalization) model.

Gene Duplication
One mechanism by which PPI networks change is gene duplication (DU) [21][22][23]. In DU, an existing gene is copied, creating a new, identical gene. In our model, duplications occur at a rate d, which is assumed to be constant for each organism. All genes are accessible to duplication, with equal likelihood. For simplicity, we assume that one gene codes for one protein. One of the copies continues to perform the same biological function and remains under the same selective pressures as before. The other copy is superfluous, since it is no longer essential for the functioning of the cell [24].
The superfluous copy of a protein/gene is under less selective pressure; it is free to lose its previous function and to develop some other function within the cell. Due to this reduced selective pressure, further mutations to the superfluous protein are more readily accepted, including those that would otherwise have been harmful to the organism [25,26]. Hence, a superfluous protein diverges rapidly after its DU event [27,28]. This well-known process is referred to as the post-duplication divergence. Following [29], we assume that the link of each such superfluous protein/gene to its former neighbors is deleted with probability w. The postduplication divergence tends to be fast; for simplicity, we assume the divergence occurs within the same time step as the DU. The divergence is asymmetric [30,31]: one of the proteins diversifies rapidly, while the other protein retains its prior activity. We delete links from the original or the duplicate with equal probability because the proteins are identical. As discussed in the supporting information (SI), this is closely related to the idea of subfunctionalization, where divergence freely occurs until redundancy is eliminated (see SI text in File S1). In our model, w is an adjustable parameter.
In many cases, the post-duplication divergence results in a protein which has lost all its links. These 'orphan' proteins correspond to silenced or deleted genes in our model. As discussed below, our model predicts that the gene loss rate should be slightly higher than the duplication rate in yeast, and slightly lower in flies and humans.
We simulate a gene duplication event at time t as follows: 1a. Duplicate a randomly-chosen gene with probability dDt. 2a. Choose either the original (50%) or duplicate (50%), and delete each of its links with probability w. 3a. Move on to the next time interval, time tzDt.

Neofunctionalization
Our model also takes into account that DNA can be changed by random mutations. Most such mutations do not lead to changes in the PPI network structure. However, some protein mutations lead to new interactions with some other protein (which we call the target protein). The formation of a novel interaction is called a neofunctionalization (NE) event. NE refers to the creation of new interactions, not to the disappearance of old ones. Functional deletions tend to be deleterious to organisms [32]. We do not account for loss-of-function mutations (link deletions) except during post-duplication divergence because damaged alleles will, in general, be eliminated by purifying selection. In our model, NE mutations occur at a rate m, which is assumed to be constant. All proteins are equally likely to be mutated.
How does the mutated protein choose a target protein to which it links? We define a probability q that any protein in the network is selected for receiving the new link from the mutant protein. To account for the possibility of homodimerization, the mutated protein may also link to itself [24,33]. Random choice dictates that q~1=N (see SI).
Many PPI's are driven by a simple geometric compatibility between the surfaces of the proteins [34]. The simplest example is the case of PPI's between flat, hydrophobic surfaces [35], a type of interaction which is very common [36]. These PPI's have a simple planar interface, and the binding sites on the individual proteins are geometrically quite similar to one another. One consequence of these similar-surface interactions is that if protein A can bind to proteins B and C, then there is a greater-than-random chance that B and C will interact with each other. We refer to this property as transitivity: if A binds B, and A binds C, then B binds C. The number of triangles in the PPI network should correlate roughly with transitivity. As discussed below, the number of triangles (as quantified by the global clustering coefficient) is about 45 times higher in real PPI networks than in an equally-dense random graph. This suggests that transitivity is quite common in PPI networks. Another source of transitivity is gene duplication. If A binds B, then A is copied to create a duplicate protein A', then A' will (initially) also bind B. If A interacts with A', then a triangle exists. However, duplication is unlikely to be the primary source of transitivity; recent evidence shows that, due to the post-duplication divergence, duplicates tend to participate in fewer triangles than other proteins [37].
A concrete example of transitivity is provided by the evolution of the retinoic acid receptor (RAR), an example of neofunctionalization which has been characterized in detail [38]. Three paralogs of RAR exist in vertebrates (RARa, b, and c), as a result of an ancient duplication. The interaction profiles of these proteins are quite different. Previous work indicates that RARb retained the role of the ancestral RAR [38], while RARa and c evolved new functionality. RARa has several interactions not found in RARb. RARa has novel interactions with a histone deacetylase (HDAC3) as well as seven of HDAC3's nearest-neighbors (HDAC4, MBD1, Q15959, NRIP1, Q59FP9, NR2E3, GATA2). None of these interactions are found in RARb. The probability that all of these novel interactions were created independently is very low. RARa has 65 known PPI's and HDAC3 has 83, and the present-day size of the human PPI network is a little over 3000 proteins. Therefore, the chance of RARa randomly evolving novel interactions with 7 of HDAC3's neighbors is less than 1 in a billion. This strongly suggests that when a protein evolves an interaction to a target, it has a greater-than-random chance of also linking to other, neighboring proteins.
How do similar-surface interactions affect the evolution of PPI networks? First, consider how an interaction triangle would form. Suppose proteins A and B bind due to physically similar binding sites. Protein X mutates and evolves the capacity to bind A. There is a reasonable chance that X has a surface which is similar to both A and B. If so, protein X is likely to also bind to B, forming a triangle. Denote the probability that two proteins interact due to a simple binding site similarity by a. The probability that A binds B (and X binds A) in this manner is a. Assuming these probabilities are identical and independent, the probability that X binds B is a 2 .
So far, we have discussed transitivity as it affects the PPI's in which protein A is directly involved (A's first-neighbors). We now introduce a third protein to the above example, resulting in a chain of interactions: protein A binds B, B binds C, but C does not bind A. Protein X mutates and gains an interaction with A (with probability a 2 ). What is the probability that X will also bind C? The probability that B binds C due to surface similarity is a. Thus, X will bind C (A's second-neighbor) with probability a 3 . In general, the probability that X will bind one of A's j th neighbors is a jz1 . We refer to this process as assimilation, and the 'assimilation parameter' a is a constant which varies between species. As discussed in SI, it is primarily mutliple-partner proteins which bind to their partners at different times and/or locations which are affected by this process; consequently, at most one link is created by assimilation at the first-neighbor level, second-neighbor level, etc. Assimilation is assumed to act on a much shorter time scale than duplication and neofunctionalization; in our model, it is instantaneous.
Our hypothesized assimilation mechanism makes several predictions that could be tested experimentally: (1) the probability of a protein assimilating into a new pathway should be a 2 (at the first-neighbor level), a 3 (at the second-neighbor level), and so on, where a is a constant which varies between species; (2) weak, nonspecific binding and planar interfaces should be overrepresented in interaction triangles (and longer cycles) between nonduplicate proteins; (3) competitive inhibitors should be overrepresented in interaction triangles; and (4) domain shuffling should be associated with assimilation. (See SI for discussion of (3) and (4).).
We simulate a neofunctionalization event at time t as follows: 1b. Mutate a randomly-chosen gene with probability mDt. 2b. Link to a randomly-chosen target protein.
3b. Add a second link to one of the target's first-neighbor proteins, chosen randomly, with probability a 2 . 4b. Add a link to one of the target's second-neighbor proteins, with probability a 3 , etc. 5b. Move on to the next time interval, time tzDt.

Model Simulation and Parameters
A flowchart of how PPI networks evolve in our model is shown in Figure 1. To simulate the network's evolution, one of the two mechanisms above is used at each time step, using [39]. We call each possible time series a trajectory. We begin each trajectory starting from two proteins sharing a link (the simplest configuration that is still technically a network). Each simulated trajectory ends when the model network has grown to have the same total number of links, K, as found in the experimental data, K data . Here, we perform sets of simulations for three different organisms: yeast (Saccharomyces cerevisiae), fruit flies (Drosophila melanogaster), and humans (Homo sapiens). Because evolution is stochastic, there are different possible trajectories, even for identical starting conditions and parameters. We simulated 50 trajectories for each organism. Our figures show the median values of each feature as a heavy line, and individual trajectories as light lines.
For a given data set, the number of links (K data ) is known. We estimate the duplication rate d from literature values. There have been several empirical estimates of duplication rates, mostly falling within an order of magnitude of each other [27,[40][41][42][42][43][44][45]. We averaged together the literature values to estimate d for each species (Table 1).
The quantity m is not as well known. Its value relative to d has been the topic of considerable debate [24,[46][47][48]. Although, in principle, m is a measurable quantity, it has proven difficult to obtain an accurate value, in part because the fixation rate of neofunctionalized alleles varies with population size [49,50]. In the absence of a consensus order-of-magnitude estimate, in our model, we treat m as a fitting parameter. Consistent with the findings of [51] and [46], our best-fit values of m are within an order of magnitude of each other for yeast, fruit fly, and human networks. Best-fit parameter values are given in Table 1.

Present-day Network Topology
One test of an evolutionary model is its predictions for presentday PPI network topologies. Current large-scale PPI data sets have a high level of noise, resulting in significant problems with false positives and negatives [19,52]. To mitigate this, we compare only to 'high-confidence' experimental PPI network data gathered in small-scale experiments (see Methods). We computed 10 topolog-ical features, quantifying various static and dynamic aspects of the networks' global and local structures: degree, closeness, eigenvalues, betweenness, modularity, diameter, error tolerance, largest component size, clustering coefficients, and assortativity. 8 of these properties are described below (see SI for others).
The degree k of a node is the number of links connected to it. For protein networks, a protein's degree is the number of proteins with which it has direct interactions. Some proteins interact with few other proteins, while other proteins (called 'hubs') interact with many other proteins. Previous work indicates that hubs have structural and functional characteristics that distinguish them from non-hubs, such as increased proportion of disordered surface residues and repetitive domain structures [53]. The high degree of a protein hub could indicate that protein has unusual biological significance [54]. The network's overall link density is described by its mean degree, SkT ( Table 2). The degree distribution p(k) is the probability that a protein will have k links. PPI networks have a few hub proteins and many relatively isolated proteins. The heavy tail of the degree distribution shows that PPI networks have significantly more hubs than random networks have. Simulated and experimental degree distributions are compared in Figure 2.
Component refers to a set of reachable proteins. If any protein is reachable from any other protein (by hopping from neighbor to neighbor), then the network only has one component. If there is no path leading from protein A to B, then A and B are in different components. The fraction of nodes in the largest component (f 1 ) is a measure of network fragmentation (Table 2 and Figure S3). Note that, although silent genes (proteins with no links) exist in real systems, these genes do not appear in data sets consisting only of PPI's. Therefore, calculations of f 1 for all models exclude orphan proteins (proteins with k~0).
Gene loss, the silencing or deletion of genes, is known to play an important role in evolution. The loss of a functioning gene will damage an organism, making the gene loss unlikely to be passed on. The exception is if the gene is redundant. Consistent with this reasoning, evidence suggests that many gene loss events are losses of one copy of a duplicated gene [30,55]. Although empirical estimates of the gene loss rate varied considerably, a consistent finding across several studies is that the rates of gene duplication and loss are of the same order-of-magnitude [27,41,44]. This broad picture is in good agreement with our model. In our model, a gene is considered lost when it has degree zero. Our model predicts that the ratio of orphan to non-orphan proteins is 1:6+0:4 in yeast, 0:58+0:06 in flies, and 0:67+0:09 in humans. The gene loss rate has been previously estimated to be about half the duplication rate in both flies and humans [27,44], consistent with our model's prediction.
The distance between nodes i and j is defined as the number of node-to-node steps that it takes along the shortest path to get from node i to j. The closeness centrality of a node i, ' i , is the inverse of the average distance from node i to all other nodes in the same component. The diameter, D, of a network is the longest distance in the network. Simulated closeness distributions are compared to experiments in Figure 3. Interestingly, proteins have about 'six degrees of separation', similar to social networks [56,57]. The closeness distributions p(') have peaks around 1='&5{7.
Another property of a network is its modularity [58]. Networks are modular if they have high densities of links (defining regions called modules), connected by lower densities of links (between modules). One way to quantify the extent of modular organization in a network is to compute the modularity index, Q [59,60]: where k i and k j are the degrees of nodes i and j, u i and u j denote the modules to which nodes i and j belong, d(u i ,u j )~1 if u i~uj and d(u i ,u j )~0 otherwise, and A ij~1 if nodes i and j share a link, and A ij~0 otherwise. Q quantifies the difference between the actual within-module link density to the expected link density in a randomly connected network. Q ranges between {1 and 1; positive values of Q indicate that the number of links within modules is greater than random. The numerical value of Q required for a network to be considered 'modular' depends on the number of nodes and links and method of computation. To calibrate baseline Q values given our particular network data, we used the null model described in [61]. Our non-modular baseline values are Q~0:603 for the human PPI net, Q~0:590 for yeast, and Q~0:722 for flies (see SI). As shown in Table 2, PPI networks are highly modular, and our simulated Q values are in good agreement with those of experimental data. The clustering coefficient, C i , for a protein i, is a measure of mutual connectivity of the neighbors of protein i. C i is defined as the ratio of the actual number of links between neighbors of protein i to the maximum possible number of links between them, In a PPI network, clustering is thought to reflect the high likelihood that proteins of similar function are mutually connected [62]. The average (or global) clustering coefficient, SCT, quantifies the extent of clustering in the network as a whole. As shown in Table 2, PPI networks have large global clustering coefficient values; the yeast PPI network, for example, has a value of SCT which is 45 times higher than that of a random graph of equivalent link density. In flies and humans, our simulated networks have SCT values in excellent agreement with the data; in yeast, our predicted value is slightly low.
A network is said to be 'hierarchically clustered' if the clustering coefficient and degree obey a power-law relation, C~k {j [63] ( Figure S1), indicating that nodes are organized into small-scale modules, and the small-scale modules are in turn organized into larger-scale modules following the same pattern [64]. By plotting each node's clustering coefficient against its degree, we observed a trend consistent with hierarchical clustering, although data in the tail is very limited.
The betweenness of a node measures the extent to which it 'bridges' between different modules. Betweenness centrality, b, is defined as: # shortest paths passing through node i # total shortest paths : Betweenness has been proposed as a uniquely functionallyrelevant metric for PPI networks because it relates local and global topology. It has been argued that knocking out a protein that has high betweenness may be more lethal to an organism than knocking out a protein of high degree [65]. Betweenness distributions are shown in Figure 4.
If a network's well-connected nodes are mostly attached to poorly-connected nodes, the network is called disassortative. A simple way to quantify disassortativity is by determining the median degree of a protein's neighbors (n) as a function of its degree (k). Previous work has found that yeast networks are disassortative [61]. It has been argued that disassortativity is an essential feature of PPI network evolution, and recent modeling efforts have heavily emphasized this feature [66,67]. However, it was noted by [68] that disassortativity may simply be an artifact of the yeast two-hybrid technique, and [69] pointed out that this trend is quite different among different yeast datasets, and in some cases is completely reversed, resulting in assortative mixing, where high degree proteins prefer to link to other highdegree proteins. As shown in Figure 5 and Table S1, the empirical data shows no evidence of disassortativity in flies or humans, and even the trend in yeast is quite weak. This conclusion is based solely on analysis of the empirical data, and casts further doubt on the role of disassortative mixing in PPI network evolution.
Comparisons of simulated and experimental eigenvalue spectra and error tolerance curves are shown in SI (Figures S7 and S8). As discussed in SI, the various per-node network properties we have analyzed are largely uncorrelated ( Figure S9).

Evolutionary Trajectories
We now consider the question of how PPI networks evolve in time. The present-day networks show a rich-get-richer structure: PPI networks tend to have both more well-connected nodes and more poorly connected nodes than random networks have. In our model, the rich-get-richer property has two bases: duplication and assimilation. The equal duplication chance per protein means the probability for a protein with k links to acquire a new link via duplication of one of its interaction partners is proportional to k. Likewise, the probability of a protein to receive a link from the first-neighbor assimilation probability a is proportional to its degree k. 'Rich' proteins get richer because the probability of acquiring new links rises with the number of existing links.
First, we discuss two dynamical quantities for which experimental evidence exists: the rate of gene loss, and the relation between a protein's age and its centrality. Gene losses in our model correspond to 'orphan' proteins which have no interactions with other proteins. As shown in Figure S3, the fraction of orphan proteins grows quickly at first, then levels off. This is consistent with the findings of [44]: in humans, while the overall duplication rate is higher than the loss rate, when only data from the past 200 Myr are considered, the loss rate is slightly higher than the   [29], 'Berg' is the link dynamics model [85], 'RG' is random geometric [89], 'MpK' is the physical desolvation model presented in [52], and 'ER' is an Erdö s-Rényi random graph [90]. duplication rate. In our model, after the initial rapid expansion, the rate of gene loss stabilizes relative to the duplication rate. We define the 'age' of a protein in our simulation according to the order in which proteins were added to the network. Our model shows that a protein's age correlates with certain network properties. Consistent with earlier work [70][71][72][73], we find that older proteins tend to be more highly connected. We plotted the 'age index' of a protein (the time step at which the protein was introduced) versus its centrality scores. As shown in Figure S2, the age index negatively correlates with degree, betweenness, and closeness centralities: older proteins tend to be more central than younger proteins. Figure S2 shows our model's prediction that a protein's age correlates with degree, betweenness, and closeness centrality. We confirmed this prediction by following the evolutionary trajectories of individual proteins ( Figure S4). These results are consistent with the eigenvalue-based aging method described in [73] (Figure S5). Phylogenetic protein age estimates indicate that older proteins tend to have a higher degree [70,73], which our model correctly predicts. Interestingly, the eigenvaluebased scores are only modestly correlated with other centrality scores (0.36 degree, 0.47 betweenness, and 0.10 closeness correlations). Using the eigenvalue method in tandem with our centrality-based method could provide stronger age-discriminating power for PPI networks than either method alone.
The correlation between centrality and age suggests that static properties of present-day networks may be used to estimate relative protein ages. Suppose each normalized centrality score (k':k= max (k), '':'= max ('), b':b= max (b)) represents a coordinate in a 3-D 'centrality space'. We can then define a composite centrality score (S) as S 2 :(k') 2 z('') 2 z(b') 2 .
Do older proteins typically have different functions than newer proteins? We classified S. cerevisiae proteins using the GO-slim gene ontology system in the Saccharomyces Genome Database. As shown in Figure S6, GO-slim enrichment profiles were somewhat different between the oldest and youngest proteins (as measured by their S values). Several categories which were more enriched for the oldest proteins were the cell cycle, stress response, cytoskeletal and cell membrane organization, whereas younger proteins were overrepresented in several metabolic processes. Overall, the differences were not dramatic, suggesting that cellular processes generally require both central and non-central proteins to function. Consistent with this, ancient proteins tend to be centrally located with modules, as their betweenness values gradually decline over time ( Figure S4). The roughly linear relation between degree and betweenness also suggests that ancient proteins do not occupy structurally 'special' positions within the network, such as stitching together separate modules (Table S1 and Figure S10). This may indicate that modules tend to accumulate around the most ancient proteins, which act as a sort of nucleus. Thus, ancient proteins are involved in all kinds of pathways, because they have each nucleated their own pathway.
In contrast to the two dynamical quantities discussed so far, most structural properties of PPI networks have only been measured for the present-day network. Although our model accurately reproduces the present-day values of these quantities, there is no direct evidence that the simulated trajectories are correct; rather, these are predictions of our model. Figure 6 shows that both modularity Q and diameter D increase with time. These are not predictions that can be tested yet for biological systems, since there is no time-resolved data yet available for PPI evolution. Time-resolved data is only currently available for various social networks (links to websites, co-authorship networks, etc.). Interestingly, the diameters of social networks are found to shrink over time [74]. Our model predicts that PPI networks differ from these social networks in that their diameters grow over time. In addition to Q and D, we tracked the evolutionary trajectories of several other quantities: the evolution of the global clustering coefficient, the rate of signal propagation, the size of the largest connected component ( Figure S3), as well as betweenness and degree values for individual nodes ( Figure S4). See SI for details.

Discussion
The relevance of selection to PPI network evolution has been a topic of considerable debate [75], particularly in the context of higher-order network features, such as modularity. A number of authors have argued that specific selection programs are required to generate modular networks, such as oscillation between different evolutionary goals [76][77][78][79][80][81]. However, previous work has shown that gene duplication by itself, in the absence of both natural selection and neofunctionalization, can generate modular networks [82,83]. Consistent with the findings of [82,83], modularity in our model is primarily generated by gene duplications (Figure S11; see SI for sensitivity analysis). Unfortunately, duplication-only models err in their predictions of other network properties (Tables 2 and S2; Figure S12). A well-known problem with duplication models is that they generate excessively fragmented networks, with only about 20% of the proteins in the largest component. This is in sharp contrast to real PPI networks, which have 73% to 89% of their proteins in the largest component. Neofunctionalization-only models have most of their proteins in the largest component, but are significantly less modular than real networks. As shown in Table 2, by modeling duplication and neofunctionalization simultaneously, the DUNE model generates networks which have the modularity found in duplication-only models, while retaining most proteins in the largest component. This lends support to the idea that gene duplication contributes to the modularity found in real biological networks, and that protein modules can arise under neutral evolution, without requiring complicated assumptions about selective pressures. This is consistent with recent experimental work characterizing a real-world fitness landscape, showing that it is primarily shaped by neutral evolution [84].
Previous estimates of NE rates in eukaryotes have varied widely, generally falling in the range of 100 to 1000 changes/genome/ Myr [24,46,85], or on the order of 0.1 change/gene/Myr. However, more recent empirical work has identified several problems with the methods used to obtain these estimates, suggesting that de novo link creation is much less common than previously thought [48]. This is consistent with our model. The best-fit values of our NE rate m are in the range of 10 {5 to 10 {4 / gene/Myr (Table 1), which in all three organisms are considerably slower than the duplication rates d.
Biologically, many of the interactions created by our neofunctionalization mechanism are expected to initially be weak, nonfunctional interactions. The results of [86] suggest that strong functional interactions are correlated with hydrophobicity, which in turn is correlated with promiscuity. We posit that initially weak, non-functional interactions are an essential feature of PPI evolution, as they provide the 'raw material' for the subsequent evolution of functional interactions. If this reasoning is correct, one consequence should be that hub proteins are, on average, more important to the cell than non-hub proteins. This has been found to be true: both degree [54] and betweenness centrality [65] have positive correlations with essentiality, indicating that hub proteins are often critical to the cell's survival.
We have described here a model for how eukaryotic protein networks evolve. The model, called DUNE, implements two biological mechanisms: (1) gene duplications, leading to a superfluous copy of a protein that can change rapidly under new selective pressures, giving new relationships with other proteins and (2) a protein can undergo random mutations, leading to neofunctionalization, the de novo creation of new relationships with other proteins. Neofunctionalization can lead to assimilation, the formation of extra novel interactions with the other proteins in the target's neighborhood. Biological evidence suggests that this type of mechanism exists. Our specific implementation is based on a simple geometric surface-compatibility argument for the observed transitivity in PPI networks. This is, of course, a heavily simplified model of PPI network evolution, and there are many biological factors which have not been included. However, our relatively simple model shows good agreement with 10 topological properties in yeast, fruit flies, and humans. One finding is that PPI networks can evolve modular structures, just from these random forces, in the absence of specific selection pressures. We also find that the most central proteins also tend to be the oldest. This suggests that looking at the structures of present-day protein networks can give insight into their evolutionary history.

Methods
Genome-wide PPI screens have a high level of noise [19], and specific interactions correlate poorly between data sets [52]. We found that several large-scale features differed substantially between types of high-throughput experiments (see SI). Due to concerns about the accuracy andprecision of data obtained through high-throughput screens, we chose to work with 'high-confidence' data sets consisting only of pairwise interactions confirmed in small-scale experiments, which we downloaded from the public HitPredict database [87]. We found sufficient high-confidence data in yeast (S. cerevisiae), fruit flies (D. melanogaster), and humans (H. sapiens).
All simulations and network feature calculations were carried out in Matlab. Our scripts are freely available for download at http://ppi.tinybike.net. We computed betweenness centralities, clustering coefficients, shortest paths, and component sizes using the MatlabBGL package. Modularity values were calculated with the algorithm of [88]. All comparisons (except the degree distribution) are between the largest connected components of the simulated and experimental data.
Due to the human network's somewhat larger size, most dynamical features were calculated once per 50 time steps for the   human network, but were updated at every time step in the yeast and fly networks. For dynamical plots, the y coordinates of the trend line are medians-of-medians. The amount of time elapsed per time step (the x coordinate) varies between simulations. We binned the time coordinates to the nearest 10 million years for yeast and fly, and 25 million years for human. When multiple values from the same simulation fell within the same bin, we used the median value. We then calculated the median value between simulations. Scatter plot trend lines are calculated in a similar way. The trend line represents the median response variable (C, b, or ') value over all nodes within a single simulation with degree k. The y coordinate of the trend line is therefore the median (across 50 simulations) of these median response variables. This median-ofmedians includes all simulations that have nodes of a given degree.

Supporting Information
File S1 Supporting information text. Comparison of five other models to the yeast PPI network: Vázquez [29] (green), Berg [85] (red), random geometric [89] (dark blue), MpK desolvation [52] (purple), and ER random graph [90] (brown). For reference, DUNE model results are shown as a black line. Dots represent high-confidence experimental yeast data, and solid lines are median values over 50 simulations. (TIF) Table S1 Scaling exponents. Distributional exponents (p(k)*k {c , p(b)*b {b ) were estimated using the maximum likelihood method of [91]. Other exponents (C*k {j , b*k a , n*k {d ) were estimated using nonlinear regression. Due to the relatively small sizes of the data sets, there is considerable uncertainty in these estimates. (PDF) Table S2 SMAPE values. Symmetric mean absolute percentage error (SMAPE) of simulation versus experiment in yeast (Eq. ??). 'E.T.' is the error tolerance curve with random protein removal, and 'E.T. (k)' is the error tolerance curve with highestdegree proteins removed first. 'DUNE' is the model described here, 'Vázquez' is the DU-only model of [29], 'Berg' is the link dynamics model [85], 'RG' is random geometric [89], 'MpK' is the physical desolvation model presented in [52], and 'ER' is an Erdös-Rényi random graph [90]. For each comparison, the lowest value is shown in bold. (PDF)