The authors have declared that no competing interests exist.
Participated in revising the draft: JX. Conceived and designed the experiments: XZ WXX. Performed the experiments: XZ WXX. Analyzed the data: XZ WXX. Contributed reagents/materials/analysis tools: XZ WXX. Wrote the paper: XZ.
Experimental methods for the identification of essential proteins are always costly, time-consuming, and laborious. It is a challenging task to find protein essentiality only through experiments. With the development of high throughput technologies, a vast amount of protein-protein interactions are available, which enable the identification of essential proteins from the network level. Many computational methods for such task have been proposed based on the topological properties of protein-protein interaction (PPI) networks. However, the currently available PPI networks for each species are not complete, i.e. false negatives, and very noisy, i.e. high false positives, network topology-based centrality measures are often very sensitive to such noise. Therefore, exploring robust methods for identifying essential proteins would be of great value.
In this paper, a new essential protein discovery method, named CoEWC (Co-Expression Weighted by Clustering coefficient), has been proposed. CoEWC is based on the integration of the topological properties of PPI network and the co-expression of interacting proteins. The aim of CoEWC is to capture the common features of essential proteins in both date hubs and party hubs. The performance of CoEWC is validated based on the PPI network of
We demonstrate that more robust essential protein discovery method can be developed by integrating the topological properties of PPI network and the co-expression of interacting proteins. The proposed centrality measure, CoEWC, is effective for the discovery of essential proteins.
Genome-wide gene deletion studies show that a small fraction of genes in a genome are indispensable to the survival or reproduction of an organism
Several experimental methods for the discovery of essential proteins have been conducted, such as single gene knockouts
With the advances of high-throughput experimental technologies, such as Y2H and mass spectrometry, large amounts of protein-protein interaction (PPI) data have been produced, which make it possible to study proteins in network level. In order to break through experimental constraints, recently researchers have been paid more attention to computational methods based on network topological characteristics. The correlations between network topological features and protein essentiality have been explored by many researchers. It has been observed in several species, such as Saccharomyces cerevisiae, Caenorhabditis elegans and Drosophila melanogaster
Computational methods could be seen as useful preprocessing techniques which could help experimental methods to quickly find essential proteins. Many centrality measures have been proposed to capture the correlation between network topological properties and protein essentiality. Local network features based centrality measures include degree centrality (DC)
Since the currently available PPI networks for each species are not complete, i.e. false negatives, and very noisy, i.e. high false positives, especially for those obtained by high-throughput technologies, the identification of essential proteins based on network topology is still very challenging. Most centrality measures are sensitive to such noise of PPI network. In addition, it is well known that both false negatives and false positives in PPI networks are hard to be cleaned out. Therefore, robust centrality measures for the discovery of essential proteins would be of great value. Biological information has been integrated with network topology to improve the precision of essential protein discovery methods
As reported in
With respect to these various difficulties and progresses, we propose a new centrality measure, named CoEWC, by integrating PPI data and gene expression data. CoEWC determines a protein's essentiality based on whether it has a high probability to be co-expressed with its neighbors and whether each of its neighbors takes part in densely connected clusters. Different from SoECC and PeC, which all emphasize co-clustering relationship between a protein and its neighbors, CoEWC pay more attention to the clustering property of the protein's neighbors rather than the protein itself. As we know, proteins within a cluster tend to share some similar biological functions with its neighbors and proteins with similar functions tend to be co-expressed. Therefore, we think that the co-expression of a protein with its interacting neighbors in PPI network can capture the co-clustering relationship between the protein and its neighbors to some extent. Moreover, CoEWC takes clustering properties of a protein's neighbors into consideration. As a result, CoEWC is expected to identify essential proteins from date hubs and party hubs well. The performance of CoEWC was tested on the well studied species of
As reported in
If we use Pearson Correlation Coefficient (PCC) to capture the co-expression, we found in yeast interactome that some non-essential hubs tend to co-express with their neighbors with PCC values in a very large range from negative to positive. We take the protein YJR091C as an example to illustrate the phenomenon.
YJR091C is a non-essential hub protein in yeast proteome. It has the maximal degree, 280, in the yeast PPI network. YJR091C ranges the first according to DC and SoECC mainly due to its large degree. Now let us see its co-expression with its neighbors.
Another motivation of the proposed centrality measure, CoEWC, can be demonstrated from the toy network in
A, A1, B, B1 and C are nodes of the toy network. D, E, and F are three complete sub-networks with size 20, 30, and 40. Node C connects with one node of D, E, and F respectively, say D1, E1 and F1.
By further observing the topological properties of date hubs and party hubs, we can know that essential proteins in these two kinds of hubs have very different clustering property themselves, but their neighbors tend to be of some common features, i.e. clustering property. Moreover, it is cheerful that such clustering property can also discriminate non-essential hubs to some extent. Centrality measures based on this idea will be more effective to find essential proteins from date hubs than those based on ECC. Clustering coefficient (CC) measures how well a node's neighbors are connected with each other, thus it can be used to capture a node's clustering property. According to CC, edges CD1, CE1 and CF1 put more weight on determining node C's essentiality than edges AC and BC in the toy network.
In this paper, a new centrality measure, CoEWC, is proposed based on the integration of PPI network and gene expression data. The basic ideas behind CoEWC are as follows: (1) A highly connected protein is more likely to be essential than a low connected one; (2) Essential proteins tend to form densely connected clusters; (3) Essential Proteins in the same cluster have a more chance to be co-expressed; (4) Party hubs and date hubs have the similar probability to be essential while they have very different clustering property. In CoEWC, a protein's essentiality is determined by the number of the protein's neighbors and the probability that the protein is co-expressed with its neighbors as well as its neighbors' clustering properties.
To describe the method simply and clearly, we give the following definitions and descriptions. The PPI network is represented by an undirected graph
Pearson correlation coefficient (PCC) is a measure of the correlation between two variables, giving a value between +1 and -1 inclusive. It is widely used in the sciences as a measure of the strength of linear dependence between two variables. The PCC of a pair of genes (
Where
The pearson correlation coefficient of a pair of proteins (
As we know, co-clustered proteins tend to share some similar functions and proteins with similar functions tend to be co-expressed. That is, two proteins
In graph theory, a clustering coefficient is a measure of degree to which nodes in a graph tend to cluster together. Evidence suggests that in most real-world networks, and in particular social network, create tightly knit groups characterized by a relatively high density of ties
The local clustering coefficient of a node in a graph quantifies how close its neighbors are to being a clique (complete graph). Watts and Strogatz introduced the measure in 1998 to determine whether a graph is a small world network
Where
It has been proved that there exist a number of protein complexes which play a key role in carrying out biological functionality
Although date hubs and party hubs have very different co-clustering property, each of their neighbors may have the similar co-clustering property. For a party hub, each of its neighbors is generally also a member of the same densely connected module that the hub involves in. So by exploring the clustering property of each of the hub's neighbors, we can capture the hub's high degree property. For date hubs, they often mediate different densely connected modules. Generally each neighbor of a date hub involves in a densely connected cluster, though the clusters its neighbors involve in are often different. By exploring each of its neighbors' own clustering property, we can also capture the high degree property of a date hub, and filter hubs whose neighbors are seldom connected with other proteins. Hubs with large number of disconnected neighbors tend to be non-essential.
In order to capture the characteristics of essential proteins based on the above standpoints, we propose a new centrality measure which is named as CoEWC. We use PCC to capture the co-clustering and co-expression properties of a protein with its neighbors, and use local clustering coefficient to capture the high connectivity of a protein and also each of its neighbors' clustering property.
For a protein
Where
From the above analysis and the definition of CoEWC, CoEWC can identify essential proteins from both party hubs and date hubs, and can discriminate those non-essential hubs whose neighbors are mainly disconnected single proteins.
To evaluate the performance of the proposed new centrality measure, CoEWC, the PPI network and gene expression data of
The PPI data of
Essential proteins of
The gene expression data of
In order to validate the performance of the proposed new centrality measure, CoEWC, we carry out a comparison between it and several state-of-the-art centrality measures: Degree Centrality (DC)
The reasons that we choose these four centrality measures to compare are as follows. DC has been proved to be a good indicator for protein essentiality by many researchers
From
Now we use jackknife methodology
As shown in
To further analyze why and how CoEWC performs well on the identification of essential proteins, we study the relationship and difference between it and other compared centrality measures (DC, CC, SoECC, and PeC) by predicting a small fraction of proteins. The top 200 proteins are selected for each centrality measure.
Firstly, we compare CoEWC with the other four centrality measures (DC, CC, SoECC and PeC) by investigating how many proteins are both predicted by CoEWC and by anyone of the other four centrality measures. The number of overlaps between CoEWC and one of the other centrality measures is shown in
Centrality measures ( |
| |
| |
Non-essential proteins in { |
Non-essential proteins in { |
Percentage of non-essential proteins in { |
Degree Centrality (DC) | 60 | 140 | 95 | 33 | 49.5% |
Clustering Coefficient (CC) | 19 | 181 | 123 | 50 | 69.9% |
Sum of ECC (SoECC) | 78 | 122 | 70 | 27 | 52.9% |
PeC | 155 | 45 | 19 | 12 | 21.1% |
From
Secondly, we evaluate the different proteins identified by CoEWC and those by other centrality measures.
There are 26 proteins which are predicted by CoEWC but not included in any of the top 200 proteins of the other four centrality measures. These 26 proteins are shown in
Rank | Protein Name | Degree | CoEWC | Essentiality |
104 | YDR365C | 23 | 1.698583 | essential |
124 | YDL232W | 18 | 1.503577 | essential |
127 | YJL033W | 19 | 1.49437 | essential |
128 | YGL099W | 14 | 1.492625 | essential |
130 | YBR234C | 23 | 1.473234 | essential |
131 | YIL075C | 32 | 1.472131 | essential |
139 | YLR200W | 10 | 1.435959 | non-essential |
145 | YDL087C | 23 | 1.401966 | essential |
147 | YKL095W | 38 | 1.390057 | essential |
151 | YHR081W | 5 | 1.358778 | non-essential |
152 | YPR088C | 14 | 1.351337 | essential |
154 | YOL094C | 37 | 1.327599 | essential |
156 | YHL030W | 21 | 1.310346 | non-essential |
158 | YOR259C | 28 | 1.282556 | essential |
161 | YBL041W | 9 | 1.275247 | essential |
163 | YNL182C | 21 | 1.268987 | essential |
170 | YMR314W | 16 | 1.246819 | essential |
178 | YBR126C | 29 | 1.219773 | non-essential |
179 | YOL142W | 6 | 1.219508 | essential |
181 | YBL023C | 14 | 1.211368 | essential |
187 | YNL290W | 34 | 1.176549 | essential |
190 | YFL008W | 24 | 1.156859 | essential |
191 | YPL012W | 20 | 1.153688 | essential |
193 | YER025W | 26 | 1.140787 | essential |
194 | YOR210W | 16 | 1.138906 | essential |
199 | YKL068W | 35 | 1.119936 | non-essential |
Take YOL142W as an example. YOL142W is an essential protein whose degree is only 6. The interactions between YOL142W and its neighbors are shown in
Proteins | PCC | CC | Essentiality |
YDR280W | 0.8046 | 0.3399 | essential |
YER025W | 0.639 | 0.1354 | essential |
YNL265C | −0.357 | 0.1648 | non-essential |
YGR195W | 0.771 | 0.4083 | essential |
YGR095C | 0.7414 | 0.3897 | essential |
YOL021C | 0.8391 | 0.375 | essential |
Take another non-essential protein, YLR295C, as an example. YLR295C has 125 neighbors, out of which only 24 are essential. YLR295C gets its rank of 16, 2388, and 8 according to DC, CC, and SoECC, respectively. According to the definition of DC, CC and SoECC and the corresponding ranks of YLR295C according to these three centrality measures, we can conclude that YLR295C is a hub protein and is well co-clustered with some of its neighbors, and that there are very few connections between its neighbors (its CC value is only 0.0017). It is obvious that YLR295C cannot be discriminated by DC and SoECC.
In addition, in order to further compare CoEWC with PeC, we also compute the sum of PCC (SoPCC) between a protein and all its neighbors in PPI network, and rank all proteins according to SoPCC. YLR295C gets its rank of 121 according to SoPCC and gets the rank of 123 according to PeC.
From the distributions of PCC, CC and ECC in
Protein Name | Essentiality | DC | CoEWC | CC | SoECC | PeC |
YCL018W | non-essential | 156 | 0.0723 | 0.0244 | 21.2481 | 0.1766 |
YBR127C | non-essential | 113 | 0.0502 | 0.0207 | 11.7541 | −0.2884 |
YMR106C | non-essential | 110 | -0.2571 | 0.0283 | 18.9553 | −0.543 |
YLR288C | non-essential | 99 | −0.1364 | 0.0033 | 31.4151 | 3.3742 |
YLR191W | non-essential | 97 | −1.5164 | 0.0215 | 24.9253 | −1.6514 |
YLR447C | non-essential | 95 | −0.0429 | 0.0035 | 26.2248 | 2.4992 |
YOL055C | non-essential | 93 | 0.0942 | 0.0140 | 8.1786 | −0.3599 |
YHR135C | non-essential | 84 | −0.2160 | 0.0203 | 13.849 | −0.3505 |
YGR040W | non-essential | 79 | 0.1058 | 0.0207 | 8.6171 | 0.3374 |
YLR453C | non-essential | 78 | 0.0411 | 0.003 | 23.4248 | 0.3608 |
YER118C | non-essential | 72 | −0.2675 | 0.0274 | 14.6054 | −0.2627 |
YDL059C | non-essential | 67 | 0.1163 | 0.0407 | 7.3506 | −0.0661 |
YCL027W | non-essential | 67 | 0.0730 | 0.009 | 18.7385 | 4.1962 |
YBL085W | non-essential | 67 | −0.2364 | 0.0212 | 13.4245 | −0.8018 |
YGR254W | non-essential | 67 | −0.0167 | 0.0298 | 7.08 | −0.0244 |
YAR014C | non-essential | 65 | −0.20841 | 0.016346 | 13.29514 | −2.5911 |
YDR171W | non-essential | 61 | −0.28158 | 0.023497 | 4.637564 | 0.0205 |
YHR140W | non-essential | 60 | −0.55414 | 0.249718 | 30.04586 | −2.13 |
YML048W | non-essential | 60 | −0.44267 | 0.136723 | 18.25437 | −0.4479 |
YGR262C | non-essential | 60 | −0.09723 | 0.029379 | 7.759587 | 0.4257 |
YJL098W | non-essential | 59 | −0.26835 | 0.04851 | 10.58723 | −0.5065 |
YLR096W | non-essential | 58 | −0.27887 | 0.047792 | 11.02443 | −0.6211 |
YJL095W | non-essential | 58 | −0.03866 | 0.050817 | 13.54987 | 0.7098 |
YGL237C | non-essential | 57 | −0.39525 | 0.025063 | 10.00474 | 0.8943 |
YNL135C | non-essential | 57 | 0.046016 | 0.022556 | 8.115793 | 0.1006 |
YDR386W | non-essential | 57 | −0.65712 | 0.030702 | 6.746308 | −1.4038 |
YCL040W | non-essential | 55 | −0.11312 | 0.020875 | 5.221536 | 0.0757 |
YDL101C | non-essential | 55 | 0.069512 | 0.041077 | 9.862956 | −0.4066 |
YGL173C | non-essential | 52 | −0.80869 | 0.032428 | 5.137779 | −0.8367 |
YER179W | non-essential | 50 | −0.14783 | 0.04 | 6.687021 | −0.357 |
YKL065C | non-essential | 50 | −0.3737 | 0.173061 | 15.15515 | −0.7295 |
With the large amount of PPI data available for some species, the discovery of essential proteins from network level is becoming a hot topic. Many network topology-based centrality measures for the discovery of essential proteins have been proposed. However, the currently available PPI networks for each species are incomplete (false negatives) and very noisy (high false positives). At the same time, most of the network topology-based methods depend on the reliability of the available protein-protein interactions and thus are very sensitive to the network. Moreover, essential proteins may be of distinct clustering properties, i.e. date hubs and party hubs, at the same time essential and non-essential proteins are often of some common features, i.e. high degree for hub proteins. It is very challenging to well capture the true distinct features for essential proteins to distinguish them from non-essential proteins.
To tackle the above difficulties, we propose a new centrality measure, named CoEWC, based on the integration of PPI data and gene expression data. CoEWC aims to capture the common features of essential proteins in both date hubs and party hubs by integrating PCC with CC together. CoEWC is applied to the PPI network of
Although CoEWC performs well on the discovery of essential proteins, there should be still a space to improve the prediction precision. First, the integration of PCC and CC is very simple in this paper, and there may exist more abstruse relationship between PCC and CC. Second, there should exist some more excellent method to well capture the distinct properties between essential proteins and non-essential proteins. Finally, besides the gene expression data, some other protein related data, such as biological process, domain information, and localization, should be also valuable for the task of identifying essential proteins.