Non-Uniform Survival Rate of Heterodimerization Links in the Evolution of the Yeast Protein-Protein Interaction Network

Protein-protein interaction networks (PINs) are scale-free networks with a small-world property. In a small-world network, the average cluster coefficient () is much higher than in a random network, but the average shortest path length () is similar between the two networks. To understand the evolutionary mechanisms shaping the structure of PINs, simulation studies using various network growth models have been performed. It has been reported that the heterodimerization (HD) model, in which a new link is added between duplicated nodes with a uniform probability, could reproduce scale-freeness and a high . In this paper, however, we show that the HD model is unsatisfactory, because (i) to reproduce the high  in the yeast PIN, a much larger number (n HI) of HD links (links between duplicated nodes) are required than the estimated number of n HI in the yeast PIN and (ii) the spatial distribution of triangles in the yeast PIN is highly skewed but the HD model cannot reproduce the skewed distribution. To resolve these discrepancies, we here propose a new model named the non-uniform heterodimerization (NHD) model. In this model, an HD link is preferentially attached between duplicated nodes when they share many common neighbors. Simulation studies demonstrated that the NHD model can successfully reproduce the high , the low n HI, and the skewed distribution of triangles in the yeast PIN. These results suggest that the survival rate of HD links is not uniform in the evolution of PINs, and that an HD link between high-degree nodes tends to be evolutionarily conservative. The non-uniform survival rate of HD links can be explained by assuming a low mutation rate for a high-degree node, and thus this model appears to be biologically plausible.


Introduction
The information of protein-protein interaction networks (PINs) at the whole-genome level is now available from several organisms, including Saccharomyces cerevisiae [1][2][3], Caenorhabditis elegans [4], and Drosophila melanogaster [5]. These data were provided by using highthroughput experimental techniques such as yeast two-hybrid screens [1,2]. The structure of PINs is represented as nodes (proteins) and links (interactions between proteins). Studies of PIN structures have revealed that PINs exhibit the following interesting properties [6].
First, PINs are scale-free networks [7,8]. The number of links connected to a node is called a degree. The degree distribution P(k) gives the probability that a node has k links (i.e., degree k). In a scale-free network, P(k) decays as a power law, following P(k),k 2c [9]. (In the case of PINs, it is known that P(k) better fits a power law with an exponential cut-off, i.e., P(k),(k 0 +k) 2c e 2k/kc [7,10].) Therefore, a scale-free network is highly heterogeneous and is characterized by the presence of a large number of nodes having only a few links and a small number of nodes (hubs) that have numerous links. A scale-free network is known to be tolerant to random removal of nodes, but it is very fragile against selective removal of hubs [7,11]. Second, PINs are small-world networks [4,5,8,10]. A small-world network is highly clustered like regular lattices, but it has small path lengths like a random network [12]. A small-world property is quantified by two statistics of a network, the average cluster coefficient ,C. and the average shortest path length ,L.. The cluster coefficient of node i is defined as C i = 2e i / k i (k i 21), where k i is the degree of node i and e i is the number of links connecting k i neighbors of node i to one another [12]. (When k i is zero or one, C i is defined to be zero.) In other words, e i is the number of triangles that pass through node i. C i is equal to one when all neighbors of node i are fully connected to one another, while C i is zero when none of the neighbors are connected to one another. A small-world network is characterized by a ,C. that is larger, and an ,L. that is similar, to those of a random network [12]. (In a random network, ,C. = ,k./N and ,L.,logN/ log,k. [13], where ,k. is the average degree and N is the number of nodes.) Scale-free and small-world properties are commonly observed in various complex networks such as the Internet [9], coauthorship of scientific papers [14], metabolic pathways [15], and functional connections in the human brain [16]. Third, PINs show a hierarchical structure. In a network showing a hierarchical structure, ,C(k)., the average cluster coefficient of k-degree nodes, decays as a power law ,C(k).,k 2m [17,18]. This indicates that a node with a small number of links has a high C and belongs to a small subnetwork in which all nodes are densely connected, while a hub has a low C and links different subnetworks. Fourth, PINs show a disassortative structure, in which ,K nn (k). (''nn'' represents ''nearest neighbors''), the average degree among the neighbors of all k-degree nodes, follows ,K nn (k).,k 2n [19][20][21]. Therefore, the connections between a hub and a low-degree node are favored, while those between hubs and those between low-degree nodes are suppressed [19][20][21][22].
It has been reported that the emergence of scale-free networks can be explained by the mechanisms of network growth and preferential attachment, in which a new node is preferentially attached to a node that already has many links [23]. In the PIN evolution, gene duplication is thought to be responsible for preferential attachment, because gene duplication creates a new node having the same interacting patterns as the original node, and a high-degree node is more likely to gain a new link by the duplication of a randomly selected node than a low-degree node [6]. To account for the properties of PINs mentioned above, several network growth models have been proposed. These models are generally based on gene duplication and divergence. In a divergence process, some of the links created by duplication are removed and some new links are added to a network. Sole et al. [10] proposed a model in which a divergence process includes two mechanisms, random removal of links from one of the duplicated nodes and random attachment of new links between a duplicated node and another node. Both simulation and analytical studies have shown that this model can generate scale-free and smallworld properties [10,[24][25][26][27]. However, studies have reported that a network generated by this model having the same number of nodes and links as those in the yeast and fly PINs showed a much smaller ,C. than these PINs [10,28,29]. To overcome this difficulty, Vazquez et al. [30] and Ispolatov et al. [29] proposed the heterodimerization (HD) model. In this model, gene duplication is followed by divergence and HD; in the divergence process links are removed from duplicated nodes with a uniform probability a, and in the HD process a new link is established between two duplicated nodes with another probability b, forming a heterodimer [29,30]. When a self-interacting protein is duplicated, the duplicated proteins will interact to each other. Therefore, b in the HD model represents the probability that a randomly selected protein is self-interacting and the link between two duplicated proteins survives after divergence. Simulation and analytical studies have shown that the HD model could reproduce a similar ,C. to the yeast and fly PINs as well as a scale-free property [28][29][30]. The reason for the successful reproduction of a large ,C. is that an HD process creates a triangle, and a network containing a large number of triangles shows a large ,C.. Middendorf et al. [28] reported that the HD model could best reproduce the fly PIN among seven network growth models using a technique from machine learning.
In this paper, we examine the yeast PIN, since it constitutes the most reliable PIN data currently available at the whole genome level [31]. We first show that the HD model is unsatisfactory as an evolutionary model of the yeast PIN. We then propose a new model named the non-uniform heterodimerization (NHD) model, in which an HD link is preferentially attached between two duplicated nodes that share many common neighbors. The NHD model can successfully reproduce various features of the yeast PIN that cannot be explained by the HD model.

Results
In this study, we examined two models, the heterodimerization (HD) model and the non-uniform heterodimerization (NHD) model (see Materials and Methods for details). In the HD model, there are two parameters, the probability that a link is removed from one of the duplicated nodes (a) and the probability that a new link is attached between two duplicated nodes (b) ( Figure 1A), which represents the probability that a duplicated protein is selfinteracting and the interaction between two duplicated proteins survives after the divergence process. These parameters were determined to let ,k. and ,C. in a generated network be the same as those in the yeast PIN. To compare the number of HD links in a generated network with that in the yeast PIN, we defined an evolutionary distance ( Figure 1B). Two nodes are defined to be homologous when the evolutionary distance between these nodes is lower than or equal to a given threshold value d T . The statistics of the networks generated by the HD model are shown in Table 1. The number of homologous pairs in the yeast PIN (n H = 6,544) is between n H for d T = 3 (5,309) and that for d T = 4 (8,337). However, the number of interactions between homologous nodes (n HI = 395 and 514 for d T = 3 and 4, respectively) is much larger than that in the yeast PIN (175). This observation is consistent with the investigation of the fly PIN by Ispolatov et al. [29], in which it was reported that the HD model requires a much larger number of HD links (270) than the actual number in the fly PIN (142) [32] to generate the 1,405 triangles present in the fly PIN.
As was mentioned in the Introduction, the HD model can generate a network with a high ,C., because an HD link produces triangles. When two duplicated nodes share n N common neighbors, n N new triangles are created by an HD link between them ( Figure 1C). Therefore, if a new link is attached between duplicated nodes more preferentially when a larger number of neighbors are shared between them, it is expected that HD links fewer than those required by the HD model can reproduce the high ,C. in the PIN. For this reason, we examined the NHD model, in which the probability that a new link is added between duplicated nodes is proportional to the number of neighbors shared by these nodes. The probability of removing a link (a) and the proportionality constant to add a new link (b) were adjusted to let ,k. and ,C. in a generated network be the same as those in the yeast PIN. The results of simulations by the NHD model are shown in Table 1. Both n H (6,544) and n HI (175) in the yeast PIN are between the values for the NHD model with d T = 3 and 4 (n H = 5,315 and 8,351, respectively, and n HI = 157 and 208, respectively). Moreover, the values of n HI /n H for the NHD model with d T = 3 and 4 are very close to that in the yeast PIN. Therefore, both a high ,C. and a low n HI were well reproduced by the NHD model. Table 1 also shows that ,L. in both HD and NHD networks are similar to that in a random network, indicating that they are small-world networks. However, interestingly, ,L. in the yeast PIN is much lower than that in a random network (see Discussion). Figure 2A shows the degree distribution of the networks generated by the HD and NHD models and that of the yeast PIN. Although there is a discrepancy between the model networks and the yeast PIN for a large k (see Discussion), the results showed that both models can reproduce the degree distribution of the yeast PIN that follows a power law with an exponential cut-off. Figure 2B shows that ,C(k). in the networks generated by the models follows a power law, indicating that these networks exhibit a hierarchical structure. In the case of the yeast PIN, however, ,C(k). decreases following a power law as k increases only for a non-small k (k.10). This relationship was also observed in the previous studies [18,19]. As shown in Figure 2C, both the HD and NHD networks display a disassortative structure ,K nn (k).,k 2n , but the values of n are smaller than that in the yeast PIN (see Discussion). Figure 2D shows the probability P T (n T ) that a given link is contained in n T triangles in a network. For example, n T = 2 for the link between nodes A and A9 (dashed line) in the middle of Figure 1C. The probability distribution P T (n T ) is a statistic describing a spatial distribution of triangles in a network. In a network generated by the HD model, the spatial distribution of triangles can be regarded to be random, because addition of a new HD link occurs randomly. As shown in this figure, the distribution of P T (n T ) in the yeast PIN is quite different from that in the network by the HD model, suggesting that the spatial distribution of triangles in the yeast PIN is highly skewed. In other words, in the yeast PIN, the extent of overlapping of triangles is larger than the expectation from a random distribution. On the other hand, the P T (n T ) distribution for the NHD model is close to that in the yeast PIN. Therefore, the structure of a network generated by the NHD model is more similar to the PIN than that by the HD model.
In the HD and NHD models, self-interactions were not explicitly considered, though it was assumed that an HD link is created only when a self-interacting protein is duplicated. However, in the yeast PIN, the fraction of self-interacting proteins is only 0.049, and ,k. increases slightly (3.84) when self-interactions are considered. The effect of self-interactions to other statistical properties of the yeast PIN is negligible ( Figure S1). Therefore, it is expected that explicit consideration of selfinteractions in a model does not essentially alter the results described above. We should also note that the fraction (0.049) of self-interactions in the yeast PIN is consistent with the value of b (0.028) in the NHD model. The fraction in the yeast PIN is much smaller than that (0.18) in the human transcription factor network, in which the statistical properties are considerably different between the networks with and without self-interactions [33].
We also examined the effect of gene deletions that are caused by mutations. For this purpose, we modified the NHD model by adding the process of random elimination of nodes (NHD+E model). However, the elimination of nodes did not essentially change the results (Table S1 and Figure S2).

Discussion
In this study, we showed that the NHD model can successfully reproduce both a high ,C. and a low n HI in the yeast PIN, whereas the HD model cannot regenerate the value of n HI . We also demonstrated that the distribution of triangles in the yeast (A) HD model. Node A is duplicated to generate node A9. Each of the links to node A9 is removed with a uniform probability a (left). Note that this method is based on completely asymmetric divergence [44], in which only one (A9) of the duplicated nodes is the target of removal of links. An HD link between node A and node A9 is attached with a uniform probability b (middle). (B) Evolutionary distance. When a node is duplicated, the evolutionary distance between each of the duplicated nodes and each of the other nodes in a network is assumed to increase by one due to mutations occurring in the duplicated nodes during the divergence process. Suppose that the evolutionary distance between node A and node B is d (left). After the duplication of node A to generate node A9 and the divergence of them, the evolutionary distance between nodes A and B, and that between nodes A9 and B become d+1 whether a link between nodes A and B and that between A9 and B are present or not (middle). (A dashed line indicates absence of a link.) The evolutionary distance between nodes A and A9 is defined to be 1 regardless of the presence of a link between them. After that, if node A9 is duplicated to create node A'', the evolutionary distance between nodes A and B continues to be d+1, while the evolutionary distances between nodes A and A9, A and A0, B and A9, and B and A0 become 2, 2, d+2, and d+2, respectively (right). (C) NHD model. In this model, the probability that a link is added between A and A9 is proportional to the number (n N ) of common neighbors shared by these nodes. doi:10.1371/journal.pone.0001667.g001 PIN is highly skewed and the skewed distribution can be reproduced by the NHD model but not by the HD model. These results suggest that the NHD model would reflect the actual evolutionary mechanism of PINs. Is the NHD model biologically realistic? In the PIN evolution, when a self-interacting protein is duplicated, an HD link between duplicated proteins is added to the PIN. Some HD links survive in evolution, but other links disappear because of mutations occurring at interacting sites in one or both of the duplicated proteins. Therefore, in the HD model, an HD link is assumed to survive at a uniform rate. On the other hand, in the NHD model, it is assumed that the survival rate of an HD link is proportional to the number of common neighbors shared by the duplicated nodes. Figure 3A shows the probability P HD (n N ) that two homologous nodes have an HD link when they share n N common neighbors. This figure indicates that P HD (n N ) is nearly constant regardless of n N in the networks by the HD model. On the other hand, in the yeast PIN, P HD (n N ) increases in proportion with n N , which is consistent with the NHD model. These observations suggest that, in the evolution of PINs, the survival rate of HD links is not uniform in terms of n N . Therefore, the NHD model appears to be realistic. The value of P HD (n N ) in the NHD network is smaller than that in the yeast PIN for n N ,15. This appears to happen because, in the NHD network, several protein pairs have very large values of n N , which is not the case in the yeast PIN. That there are no protein pairs with large n N in the yeast PIN may be due to the high duplicability of low-degree nodes [34], which was not considered in the NHD model (see below).
Why, then, is the survival rate of HD links not uniform, but rather proportional to the number of common neighbors? One possible explanation is as follows. It has been reported that the degree of proteins in the yeast PIN is negatively correlated with their evolutionary rates [35][36][37], though this assertion is controversial [38]. Not surprisingly, proteins connected by an HD link that has a large n N tend to have a high degree ( Figure 3B). Therefore, the evolutionary rates of proteins in an HD link with a large n N are expected to be low. If this is the case, the possibility of the occurrence of mutations at the binding sites would also be low, and thus the survival rates of HD links having a large n N are thought to be higher than those of HD links having a small n N .
Although the degree distribution P(k) of the NHD network is generally in good agreement with that of the yeast PIN, the number of nodes with k.50 in the former is much smaller than that in the latter (Figure 2A). The average of the maximum degrees among the NHD networks is 75.2, while the maximum degree in the yeast PIN is 286. Moreover, though the NHD network exhibits a disassortative structure ,K nn (k).,k 2n , the value of n is considerably smaller than that in the yeast PIN ( Figure 2C). These discrepancies might be resolved by introducing a mechanism wherein low-degree nodes duplicate more frequently than high-degree nodes. Prachumwat and Li [34] reported a negative correlation between the degree of proteins and their duplicability. Due to the disassortative structure ( Figure 2C), lowdegree nodes have more links to high-degree nodes than to lowdegree nodes. Therefore, as a result of frequent duplication of lowdegree nodes, links between a high-degree node and a low-degree node are preferentially generated, and a high-degree node tends to gain new links. For this reason, with the mechanism of high duplicability of low-degree nodes, the degrees of hubs and the value of n in Figure 2C are expected to become larger than the current values. Moreover, the lack of HD links with high n N in the yeast PIN ( Figure 3A) would also be explained with this mechanism, because HD links with high n N should be rare if the duplicability of high-degree nodes is low.
Although PINs are generally considered to be small-world networks [4,8,10,24], the ,L. in the yeast PIN is much lower The number in parentheses represents the standard deviation calculated from 100 networks generated by simulations. 2, the same as above. a.
Parameters used in the simulations. See Materials and Methods. b. The number of homologous pairs. Two nodes are defined to be homologous when the evolutionary distance between the two nodes is d T or less. A random network that has the same ,k. and N as those in the yeast PIN, where N is the number of nodes (3,891) in the yeast PIN. The values of ,C. and ,L. were calculated using the formulae ,K./N and logN/log,k., respectively. doi:10.1371/journal.pone.0001667.t001  than that in a random network (Table 1). In fact, this observation is consistent with the previous result by Pastor-Satorras et al. [24], in which it was reported that ,L. in the random network and that in their model network were 8.0 and 6.8, respectively. (However, they mentioned that these two values are ''comparable''.) Therefore, the yeast PIN is an ''ultrasmall'' network, in which ,L. is lower than that in a random network. It is known that a scale-free random network is ultrasmall [39,40]. The yeast PIN can be randomized without changing the distribution of P(k) by using the random rewiring method [21]. In this method, two links in a network were chosen randomly, and these links were rewired by exchanging their connecting partners. After randomization of the yeast PIN, ,L. (4.49) is similar, but ,C. (0.010) becomes much lower than that in the yeast PIN. Therefore, the yeast PIN is far from a scale-free random network. Nevertheless, interestingly, the yeast PIN is an ultra-small network.
The difference in the value of ,L. between the yeast PIN and the NHD network may be explained in the following way. It was reported that the removal of hubs drastically increases the value of ,L. in the yeast PIN [41]. Therefore, the low ,L. in the yeast PIN might be due to the fact that the number of hubs in the yeast PIN is larger than that in the NHD network ( Figure 2A). (There are 17 nodes with k.50 in the yeast PIN, while the average number of nodes with k.50 among the NHD networks is 5.8.) In fact, if we eliminate all nodes with k.50 and all links connected to them from the yeast PIN and the NHD network, both ,L. and ,C. become similar between the two networks (,L. = 6.13 and 6.51, and ,C. = 0.063 and 0.060 for the yeast PIN and the NHD network, respectively). It therefore appears that the presence of a large number of hubs in the yeast PIN would be the reason for a very low ,L..
The above discussion would indicate that the NHD model is merely a rough approximation of the actual mechanism of the PIN evolution. However, we should note that although our new model contains only two free parameters, it could well capture various aspects of the structure of the yeast PIN. The availability of highquality interaction data from other species will thus help to clarify the architecture and evolution of PINs in greater detail.

Materials and Methods Data
Human-curated interaction data of the yeast PIN were downloaded from the MIPS (Munich Information Center for Protein Sequences) database (http://mips.gsf.de) (18 May 2006) [3]. The interaction data are separated into several components that are not connected to each other; we used the largest component containing 3,891 proteins and 7,270 non-redundant interactions. Among these proteins, 191 proteins are selfinteracting. The amino acid sequences of 6,736 yeast proteins were also obtained from the MIPS database. In order to estimate the number of interactions between homologous proteins in the yeast PIN, we identified homologous gene pairs. Self-against-self homologous searches were conducted for the 6,736 sequences by using the BLASTP program [42] with the cut-off E-value of 1e-5. We identified 6,544 homologous pairs (n H ) and 175 interactions between these pairs of proteins (n HI ) in the yeast PIN (see Table 1). The value of n HI /n H did not essentially change when a more stringent cut-off E-value was used (n HI /n H was 0.027 and 0.032 for the E-values of 1e-5 and 1e-10, respectively).

Simulation
In this study, we used the ''minimal genome'' containing 113 proteins as the initial network, because the first living organism is assumed to have had at least 113 proteins [43]. We generated a single component random network containing 113 nodes with ,k. = 3.74, which is the average degree of the yeast PIN. The evolutionary distance between two nodes present in the initial network was assumed to be infinity. We obtained very similar results when we started a simulation from the initial network containing only two nodes.
At each time step of simulation in the HD model, a new node is added to the network according to the following rules ( Figure 1A).
(1) A node is randomly selected (A) and is duplicated to generate a new node (A9), having the same interacting pattern as node A. (2) Each of the links to node A9 is removed with a probability a (completely asymmetric divergence [44]). (3) A link between node A and node A9 is created with a probability b. If node A9 does not have any links after these processes (all links to node A9 were removed and no links were created), node A9 is not added to the network. These processes were repeated until the number of nodes became 3,891, which is the number of nodes contained in the yeast PIN. In the NHD model, the probability that a new link is added between two duplicated nodes (A9 and A) is defined to be bn N (when bn N #1), where n N is the number of common neighbors shared by these two nodes ( Figure 1C). The probability is defined to be one when bn N .1. (However, there were no such cases in the simulations.) We performed simulations using various values of a and b. For a given a and b, we conducted simulations 100 times and computed the average of ,k. and the average of ,C. from the 100 networks. The values of a and b that could reproduce ,k. (3.74) and ,C. (0.066) in the yeast PIN were used (Table 1).
In the NHD+E model, the following process was added after the addition of new links at each step of the NHD model. A node in a network is randomly selected, and the selected node is eliminated from the network with a probability d together with all interactions connecting to the selected node. If the selected node is connected to one-degree nodes, all of these one-degree nodes are also removed. We changed the value d from 0.001 to 0.1 (see Table  S1). The values of a and b were determined in the same way as in the NHD model.