A New Method for the Discovery of Essential Proteins

Background Experimental methods for the identification of essential proteins are always costly, time-consuming, and laborious. It is a challenging task to find protein essentiality only through experiments. With the development of high throughput technologies, a vast amount of protein-protein interactions are available, which enable the identification of essential proteins from the network level. Many computational methods for such task have been proposed based on the topological properties of protein-protein interaction (PPI) networks. However, the currently available PPI networks for each species are not complete, i.e. false negatives, and very noisy, i.e. high false positives, network topology-based centrality measures are often very sensitive to such noise. Therefore, exploring robust methods for identifying essential proteins would be of great value. Method In this paper, a new essential protein discovery method, named CoEWC (Co-Expression Weighted by Clustering coefficient), has been proposed. CoEWC is based on the integration of the topological properties of PPI network and the co-expression of interacting proteins. The aim of CoEWC is to capture the common features of essential proteins in both date hubs and party hubs. The performance of CoEWC is validated based on the PPI network of Saccharomyces cerevisiae. Experimental results show that CoEWC significantly outperforms the classical centrality measures, and that it also outperforms PeC, a newly proposed essential protein discovery method which outperforms 15 other centrality measures on the PPI network of Saccharomyces cerevisiae. Especially, when predicting no more than 500 proteins, even more than 50% improvements are obtained by CoEWC over degree centrality (DC), a better centrality measure for identifying protein essentiality. Conclusions We demonstrate that more robust essential protein discovery method can be developed by integrating the topological properties of PPI network and the co-expression of interacting proteins. The proposed centrality measure, CoEWC, is effective for the discovery of essential proteins.


Introduction
Genome-wide gene deletion studies show that a small fraction of genes in a genome are indispensable to the survival or reproduction of an organism [1,2]. These genes are referred as essential genes, and essential proteins are just the products of essential genes. The deletion of such essential proteins will result in lethality or infertility. The identification of essential proteins is very important not only for understanding the minimal requirements for survival of an organism, but also for finding human disease genes [3] and new drug targets. The genome-wide identification of essential genes is valuable for rational drug design [4]. Essential proteins in pathogenic organisms can be taken as the potential targets for new antibiotics [5].
Several experimental methods for the discovery of essential proteins have been conducted, such as single gene knockouts [6], RNA interference [7] and conditional knockouts [8]. However, these experimental methods are very time-consuming and laborious, and they often require large amounts of resources. essential proteins. Many centrality measures have been proposed to capture the correlation between network topological properties and protein essentiality. Local network features based centrality measures include degree centrality (DC) [11],sum of edge clustering (SoECC) [18], local average connectivity (LAC) [19], and density of maximum neighborhood component (DMNC) [20]. Global network characteristics based centrality measures include betweenness centrality (BC) [21], and closeness centrality (CC) [22]. Other previously proposed centrality measures include subgraph centrality [23], eigenvector centrality [24], information centrality [25], bottle neck [26,27], and the method by integrating network topology and gene expression data (PeC) [28]. Comparative studies on the two kinds of measures show that local features based measures are more effective for identifying essential proteins [28][29].
Since the currently available PPI networks for each species are not complete, i.e. false negatives, and very noisy, i.e. high false positives, especially for those obtained by high-throughput technologies, the identification of essential proteins based on network topology is still very challenging. Most centrality measures are sensitive to such noise of PPI network. In addition, it is well known that both false negatives and false positives in PPI networks are hard to be cleaned out. Therefore, robust centrality measures for the discovery of essential proteins would be of great value. Biological information has been integrated with network topology to improve the precision of essential protein discovery methods [28,30]. In [28], the authors proposed PeC method by integrating edge clustering coefficient and gene co-expression. In [30], essential proteins were explored based on the integration of network topological features and two types of GO annotations: cellular localization and biological process.
As reported in [13], essential proteins tend to form highly connected clusters rather than function independently. Some researchers began to pay attention to the relationship between protein essentiality and their cluster property [18,31]. According to [32], hubs in the yeast interactome network can be classified into date and party hubs on the basis of their partners' expression profiles. This distinction suggests a model of organized modularity for the yeast proteome. Modules are connected through the date hubs which act as regulators, mediators or adaptors, while party hubs represent integral elements within the modules and tend to function at a lower level of the organization of proteome. That is, party hubs are well co-clustered with their neighbors in PPI network while date hubs are not. In addition, party hubs and date hubs have the similar probability to be essential [32]. Clusterbased centrality measures, such as clustering coefficient and sum of edge clustering coefficient, would be not effective for identifying essential proteins from date hubs.
With respect to these various difficulties and progresses, we propose a new centrality measure, named CoEWC, by integrating PPI data and gene expression data. CoEWC determines a protein's essentiality based on whether it has a high probability to be co-expressed with its neighbors and whether each of its neighbors takes part in densely connected clusters. Different from SoECC and PeC, which all emphasize co-clustering relationship between a protein and its neighbors, CoEWC pay more attention to the clustering property of the protein's neighbors rather than the protein itself. As we know, proteins within a cluster tend to share some similar biological functions with its neighbors and proteins with similar functions tend to be co-expressed. Therefore, we think that the co-expression of a protein with its interacting neighbors in PPI network can capture the co-clustering relationship between the protein and its neighbors to some extent. Moreover, CoEWC takes clustering properties of a protein's neighbors into consideration. As a result, CoEWC is expected to identify essential proteins from date hubs and party hubs well. The performance of CoEWC was tested on the well studied species of Saccharomyces cerevisiae. Compared to several previous centrality measures which have better predicting precision, CoEWC achieves higher predicting precision for the identification of essential proteins. The experimental results demonstrate that centrality measures, which based on the appropriate integration of network topological properties and gene expression, are more robust and effective, than those only based on network topological features, for the discovery of essential proteins, and that CoEWC is a good example for such integration.

Motivations
As reported in [32], hubs in the yeast interactome network can be classified into date and party hubs on the basis of their partners' expression profiles, and moreover, party hubs and date hubs have the similar probability to be essential. Therefore, exploring the coexpression between a protein and its interacting neighbors in PPI network to identify the protein's essentiality is reasonable.
If we use Pearson Correlation Coefficient (PCC) to capture the co-expression, we found in yeast interactome that some nonessential hubs tend to co-express with their neighbors with PCC values in a very large range from negative to positive. We take the protein YJR091C as an example to illustrate the phenomenon.
YJR091C is a non-essential hub protein in yeast proteome. It has the maximal degree, 280, in the yeast PPI network. YJR091C ranges the first according to DC and SoECC mainly due to its large degree. Now let us see its co-expression with its neighbors. Figure 1 shows the pearson correlation coefficients of YJR091C with its 280 neighbors. The PCC values ranges from 20.846 to 0.802. The sum of the PCC values is about 3.37, and YJR091C gets 451 th place according to sum of PCC. This tells us that PCC is more suitable to discriminate such non-essential proteins like YJR091C than DC and SoECC.
Another motivation of the proposed centrality measure, CoEWC, can be demonstrated from the toy network in figure 2. Since edge clustering coefficient (ECC) measures whether two interacting nodes have a high probability to be co-clustered, according to the definition of SoECC [18,28], edges AC and BC put more weight on determining node C's essentiality, than edges CD1, CE1 and CF1. However, this goes against our intuition. By intuition, edges CD1, CE1 and CF1 should put more weight on determining node C's essentiality. That is, on the basis of coexpression, it would be reasonable to take the clustering properties of a node's neighbors into consideration rather than the clustering property of the node itself.
By further observing the topological properties of date hubs and party hubs, we can know that essential proteins in these two kinds of hubs have very different clustering property themselves, but their neighbors tend to be of some common features, i.e. clustering property. Moreover, it is cheerful that such clustering property can also discriminate non-essential hubs to some extent. Centrality measures based on this idea will be more effective to find essential proteins from date hubs than those based on ECC. Clustering coefficient (CC) measures how well a node's neighbors are connected with each other, thus it can be used to capture a node's clustering property. According to CC, edges CD1, CE1 and CF1 put more weight on determining node C's essentiality than edges AC and BC in the toy network.

New centrality measure: CoEWC
In this paper, a new centrality measure, CoEWC, is proposed based on the integration of PPI network and gene expression data. The basic ideas behind CoEWC are as follows: (1) A highly connected protein is more likely to be essential than a low connected one; (2) Essential proteins tend to form densely connected clusters; (3) Essential Proteins in the same cluster have a more chance to be co-expressed; (4) Party hubs and date hubs have the similar probability to be essential while they have very different clustering property. In CoEWC, a protein's essentiality is determined by the number of the protein's neighbors and the probability that the protein is co-expressed with its neighbors as well as its neighbors' clustering properties.
To describe the method simply and clearly, we give the following definitions and descriptions. The PPI network is represented by an undirected graph G(V, E), where a node vMV represents a protein and an edge e(u,v)ME denotes an interaction between two proteins u and v. Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. We only consider the gene expressions for proteins while some functional RNAs from non-protein coding genes may exist. For a protein u, its gene expressions with s different times are denoted as Ge(u) = {g(u,1),g(u,2),…,g(u,s)}. The probability that two proteins are co-expressed is evaluated based on the pearson correlation coefficient (PCC). The clustering property of a protein is evaluated based on the clustering coefficient (CC).
Pearson correlation coefficient. Pearson correlation coefficient (PCC) is a measure of the correlation between two variables, giving a value between +1 and -1 inclusive. It is widely used in the sciences as a measure of the strength of linear dependence between two variables. The PCC of a pair of genes (X and Y), which encode the corresponding paired proteins (u and v) interacting in the PPI network, is defined as: Where s is the number of samples of the gene expression data; g(X,i) (or g(Y,i)) is the expression level of gene X (or Y) in the sample i under a specific condition; g(X ) (or g(Y )) represents the mean expression level of gene X (or Y) and s(X ) (or s(Y )) represents the standard deviation of expression level of gene X (or Y).
The pearson correlation coefficient of a pair of proteins (u and v) is defined as the same as the PCC of their corresponding paired genes (X and Y), that is PCC(u,v) = PCC(X,Y). If PCC(u,v) has a positive value, there is a positive linear correlation between u and v.
As we know, co-clustered proteins tend to share some similar functions and proteins with similar functions tend to be coexpressed. That is, two proteins u and v with a larger value of PCC(u,v) are more likely to be in the same cluster and to function similarly.  Clustering coefficient. In graph theory, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social network, create tightly knit groups characterized by a relatively high density of ties [33,34]. Yeast PPI network is also a small world network. Two versions of this measure exist: the global and the local. The global version was designed to give an overall indication of the clustering in the network, whereas the local gives an indication of the embeddedness of single nodes. Here, we refer to the local clustering coefficient.
The local clustering coefficient of a node in a graph quantifies how close its neighbors are to being a clique (complete graph). Watts and Strogatz introduced the measure in 1998 to determine whether a graph is a small world network [34]. The local clustering coefficient for a protein u in PPI network can be defined as Where N u is the set of neighbors of protein u and k u denotes the number of immediately connected neighbors of u. CC(u) is a local variable which characterizes the clustering property of a protein u. A protein u with a larger value of CC(u) is expected to put more impact on its neighbors, which has demonstrated in section 2.
CoEWC method. It has been proved that there exist a number of protein complexes which play a key role in carrying out biological functionality [35] and essential proteins tend to form protein complexes [36]. In addition, essential proteins in the same cluster tend to be co-expressed. It seems that centrality measures by exploring the co-clustering and co-expression properties for a protein will work well for the task of identifying essential proteins, just like PeC does. PeC outperforms many previous centrality measures indeed. However, as reported in [32], hubs can be divided into date hubs and party hubs, and these two kinds of hubs tend to be essential with similar probability. PeC mainly emphasizes the co-clustering and co-expression properties of a protein with its neighbors, so it would be not effective to identify essential proteins from date hubs which are not well co-clustered with its neighbors. According to the analysis on a toy network in the section of motivations, SoECC may capture the wrong features for date hubs, which make it not effective for identifying essential proteins from date hubs.
Although date hubs and party hubs have very different coclustering property, each of their neighbors may have the similar co-clustering property. For a party hub, each of its neighbors is generally also a member of the same densely connected module that the hub involves in. So by exploring the clustering property of each of the hub's neighbors, we can capture the hub's high degree property. For date hubs, they often mediate different densely connected modules. Generally each neighbor of a date hub involves in a densely connected cluster, though the clusters its neighbors involve in are often different. By exploring each of its neighbors' own clustering property, we can also capture the high degree property of a date hub, and filter hubs whose neighbors are seldom connected with other proteins. Hubs with large number of disconnected neighbors tend to be non-essential.
In order to capture the characteristics of essential proteins based on the above standpoints, we propose a new centrality measure which is named as CoEWC. We use PCC to capture the coclustering and co-expression properties of a protein with its neighbors, and use local clustering coefficient to capture the high connectivity of a protein and also each of its neighbors' clustering property.
For a protein u, its CoEWC(u) is defined as the sum of the PCC between u and each of its neighbors weighted by the corresponding neighbor's clustering coefficient. The definition is given in equation (3).
Where N u denotes the set of all immediately connected neighbors of node u in PPI network.
From the above analysis and the definition of CoEWC, CoEWC can identify essential proteins from both party hubs and date hubs, and can discriminate those non-essential hubs whose neighbors are mainly disconnected single proteins.

Test data
To evaluate the performance of the proposed new centrality measure, CoEWC, the PPI network and gene expression data of Saccharomyces cerevisiae was used, as it has been well characterized by knockout experiments and widely used in the evaluation of methods for essential proteins discovery. The test data used in this paper come from [28]. We describe them briefly as follows.
The PPI data of Saccharomyces cerevisiae was downloaded from DIP database [37].

Compare CoEWC with other centrality measures
In order to validate the performance of the proposed new centrality measure, CoEWC, we carry out a comparison between it and several state-of-the-art centrality measures: Degree Centrality (DC) [11], Sum of Edge Clustering Coefficient (SoECC) [18], PeC [28] and Clustering Coefficient (CC) [34].
The reasons that we choose these four centrality measures to compare are as follows. DC has been proved to be a good indicator for protein essentiality by many researchers [11,28], and by comparing with it, we want to show the ability of CoEWC to identify essential proteins from hub proteins. SoECC is a better . Therefore, we only compare CoEWC with PeC, but don't compare with many mainstream centrality measures outperformed by PeC in the yeast PPI network for identifying essential proteins. PeC aims to capture the coclustering property of a protein with its neighbors from both a topological view and a biological view. However, CoEWC aims to capture the properties of both date hubs and party hubs while the two hubs have very different clustering property. CC is used to show how many improvements can be obtained by properly integrating it with gene expression data, just like CoEWC does. Figure 3 gives the comparison of the number of essential proteins detected by CoEWC and other four previously proposed centrality measures. Proteins are ranked according to their values calculated by each centrality measure. For each centrality measure, a certain number of top proteins are selected as candidates for essential proteins, out of which the number of true essential proteins is determined.
From figure 3 we can see that CoEWC significantly outperforms the centrality measures only based on network topological features (DC, CC and SoECC) for predicting essential proteins from yeast PPI network. CoEWC also outperforms PeC for predicting more essential proteins than PeC does. Especially, CoEWC obtains more than 50% improvement over DC and CC for predicting 500 proteins, and obtains about 20% improvement over SoECC. There is more than 4% improvement of CoEWC over PeC for predicting 600 proteins.

Validated by jackknife methodology
Now we use jackknife methodology [39] to test the comparison between the proposed centrality measure CoEWC and other four previously proposed centrality measures (DC, CC, SoECC and PeC). The comparison results are shown in figure 4. In figure 4, proteins are ordered from the highest value to the lowest value for each centrality measure and the cumulative counts of essential proteins are plotted. The areas under the curve (AUC) for CoEWC and other centrality measures are compared. In addition, ten random assortments are also plotted for comparison.
As shown in figure 4, it is clear that the sorted curve of CoEWC appears to be much better than three centrality measures: DC, CC and SoECC. For top 180 ranked proteins, CoEWC ties with PeC. Then the sorted curve of CoEWC is increasingly better than that of PeC with the increase of the number of top ranked proteins. All the results of the five centrality measures are better than those of randomized sortings. In figure 4, the AUC of CoEWC is  4.3174e+005, and the AUC of PeC is 4.0450e+005. This tells us that CoEWC is more effective than PeC for the task of identifying essential proteins. Therefore, our idea that capturing the properties of both date hubs and party hubs by using the coexpression of a protein with its neighbors weighted by the corresponding neighbor's clustering coefficient is better than that only capturing the co-clustering property of a protein with its neighbors.

Analysis of the differences between CoEWC and the compared centrality measures
To further analyze why and how CoEWC performs well on the identification of essential proteins, we study the relationship and difference between it and other compared centrality measures From table 1, we can see that the common proteins identified by CoEWC and DC, CC are not more than 30%, that common proteins predicted by CoEWC and SoECC are less than 40%, and that common proteins both predicted by CoEWC and PeC are less than 80%. The small overlap between the predicted proteins of CoEWC and DC, CC shows that CoEWC is a special centrality measure which is much different from classical centrality measures. In addition, we investigated the non-essential proteins predicted by other centrality measures, and found that about 50% of these non-essential proteins predicted by three network topology-based centrality measures (DC, CC and SoECC) are with very low values of CoEWC (less than 0.128) and there are 21.1% of the non-essential proteins predicted by PeC are with very low values of CoEWC (less than 0.128).
Secondly, we evaluate the different proteins identified by CoEWC and those by other centrality measures. Figure 5 gives the number of proteins which are predicted out of all the different proteins identified by CoEWC and those identified by DC, CC, SoECC and PeC. As shown in figure 5, the percentage of essential proteins identified by CoEWC is consistently higher than that identified by each other centrality measures for the different proteins between them. Take CC as an example, which has the largest different number of proteins from CoEWC. Out of all the top 200 proteins, 181 proteins are differently identified by CC and   Take YOL142W as an example. YOL142W is an essential protein whose degree is only 6. The interactions between YOL142W and its neighbors are shown in figure 6. To further study the characteristic of YOL142W and its neighbors, we show the following information of its neighbors: PCC value, CC value, and essentiality in table 3. From table 3, we can see that its 5 neighbors out of all 6 neighbors are also essential proteins, and that YOL142W is well co-expressed with its 5 neighbors which are also essential. All the CC values of its neighbors are significantly larger than the average CC value of the whole PPI network which is 0.097. Table 3 also tells us that co-clustered essential proteins tend to be co-expressed and that CoEWC can capture this property well.
Take another non-essential protein, YLR295C, as an example. YLR295C has 125 neighbors, out of which only 24 are essential. YLR295C gets its rank of 16, 2388, and 8 according to DC, CC, and SoECC, respectively. According to the definition of DC, CC and SoECC and the corresponding ranks of YLR295C according to these three centrality measures, we can conclude that YLR295C is a hub protein and is well co-clustered with some of its neighbors, and that there are very few connections between its neighbors (its CC value is only 0.0017). It is obvious that YLR295C cannot be discriminated by DC and SoECC.
In addition, in order to further compare CoEWC with PeC, we also compute the sum of PCC (SoPCC) between a protein and all its neighbors in PPI network, and rank all proteins according to SoPCC. YLR295C gets its rank of 121 according to SoPCC and gets the rank of 123 according to PeC. Figure 7 gives the properties of YLR295C and its 125 neighbors captured by PCC, CC and ECC. In figure 7, the neighbor proteins with first 24 neighbor IDs are essential, and the other proteins are nonessential.
From the distributions of PCC, CC and ECC in figure 7, we can see that about a quarter of its neighbors are well co-clustered with YLR295C (with ECC value equals to 1), but only very few neighbors have large CC values. Almost all ECC values of the neighbors are larger than zero, but about half of the neighbors' CC values are zero. Among YLR295C's interacting proteins, essential proteins tend to have non-zero CC values, which accords with the assumption that essential proteins tend to be co-clustered with some of its neighbors. According to the definition of SoECC, which is the sum of ECC, YLR295C's SoECC value is large due to its high degree. According to the definition of PeC, which is the sum of the product of ECC and PCC, the PeC value of YLR295C is considerably smaller than that of SoECC due to the negative values of PCC. Moreover, according to the definition of CoEWC, which is the sum of the product of PCC and CC, the CoEWC value of YLR295C is smaller than those of SoECC and PeC, due to both negative PCC values and smaller CC values. From figure 7, we can further understand why CoEWC can discriminate YLR295C as non-essential while DC, SoECC and PeC cannot. Table 4 shows a list of non-essential proteins which have a high degree but with a low value of CoEWC. In order to compare with other centrality measures, we also give their values of DC, CC, SoECC and PeC. Since the values predicted by different centrality measures are not comparable directly, here we take the 1110 th proteins' values, which are sorted in descending order according to each centrality measure, as reference values. The reference values for DC, CoEWC, CC, SoECC and PeC are 12, 0.127, 0.1428, 3.057 and 0.55, respectively. From table 4, we can see that these non-essential proteins cannot be discriminated by DC and SoECC. By considering the co-expression properties between a protein and its interacting proteins, both CoEWC and PeC have an improved discrimination ability over these non-essential proteins. As shown in table 4, all these non-essential proteins with

Conclusions
With the large amount of PPI data available for some species, the discovery of essential proteins from network level is becoming a hot topic. Many network topology-based centrality measures for the discovery of essential proteins have been proposed. However, the currently available PPI networks for each species are incomplete (false negatives) and very noisy (high false positives). At the same time, most of the network topology-based methods depend on the reliability of the available protein-protein interactions and thus are very sensitive to the network. Moreover, essential proteins may be of distinct clustering properties, i.e. date hubs and party hubs, at the same time essential and non-essential proteins are often of some common features, i.e. high degree for hub proteins. It is very challenging to well capture the true distinct features for essential proteins to distinguish them from nonessential proteins.
To tackle the above difficulties, we propose a new centrality measure, named CoEWC, based on the integration of PPI data and gene expression data. CoEWC aims to capture the common features of essential proteins in both date hubs and party hubs by integrating PCC with CC together. CoEWC is applied to the PPI network of Saccharomyces cerevisiae. The experimental results show that CoEWC significantly outperforms the network topologybased centrality measures: DC, CC and SoECC, and that CoEWC also outperforms PeC, a currently proposed centrality measure which also based on the integration of PPI data and gene expression data.
Although CoEWC performs well on the discovery of essential proteins, there should be still a space to improve the prediction precision. First, the integration of PCC and CC is very simple in this paper, and there may exist more abstruse relationship between PCC and CC. Second, there should exist some more excellent method to well capture the distinct properties between essential proteins and non-essential proteins. Finally, besides the gene expression data, some other protein related data, such as biological process, domain information, and localization, should be also valuable for the task of identifying essential proteins.