A New Method for Identifying Essential Proteins Based on Network Topology Properties and Protein Complexes

Essential proteins are indispensable to the viability and reproduction of an organism. The identification of essential proteins is necessary not only for understanding the molecular mechanisms of cellular life but also for disease diagnosis, medical treatments and drug design. Many computational methods have been proposed for discovering essential proteins, but the precision of the prediction of essential proteins remains to be improved. In this paper, we propose a new method, LBCC, which is based on the combination of local density, betweenness centrality (BC) and in-degree centrality of complex (IDC). First, we introduce the common centrality measures; second, we propose the densities Den1(v) and Den2(v) of a node v to describe its local properties in the network; and finally, the combined strategy of Den1, Den2, BC and IDC is developed to improve the prediction precision. The experimental results demonstrate that LBCC outperforms traditional topological measures for predicting essential proteins, including degree centrality (DC), BC, subgraph centrality (SC), eigenvector centrality (EC), network centrality (NC), and the local average connectivity-based method (LAC). LBCC also improves the prediction precision by approximately 10 percent on the YMIPS and YMBD datasets compared to the most recently developed method, LIDC.


Introduction
Essential proteins are indispensable to the viability or reproduction of an organism and play a decisive role in cellular life [1]. Deletion of a single essential protein is sufficient for causing lethality or infertility [2]. Compared to non-essential proteins, essential proteins are more likely to be conserved in biological evolution [3]. Essential proteins provide insights into the molecular mechanisms of an organism at the system level, with significant implications for drug design and disease study [4]. For example, in drug development, essential proteins are excellent targets for potential new drugs and vaccines to treat and prevent diseases and for improved diagnostic tools more reliably to detect infections [5].
There are two types of methods for predicting essential proteins. One is experimental procedures, such as RNA interference [6], single gene knockouts [7], and conditional knockouts [8]. However, these experimental procedures require considerable time and resources, even for well-studied organisms, and they are not always practical. The other type of method is bioinformatics computational approaches that take advantage of the abundance of experimental data available for protein interaction networks, such as degree centrality (DC) [9], betweenness centrality (BC) [10], subgraph centrality (SC) [11], eigenvector centrality (EC) [12], network centrality (NC) [13], and the local average connectivity-based method (LAC) [14]. Obviously, the latter is faster and less expensive than the former.
In 2015, Luo and Qi [15] proposed a method named LIDC for discovering essential proteins based on the local interaction density and protein complexes. The experimental results obtained with the YMIPS dataset demonstrated that the performance of LIDC was superior to that of nine reference methods (i.e., DC, BC, NC, LID [15], PeC [16], CoEWC [17], WDC [18], ION [19], and UC [20]).
However, methods based on bioinformatics computational approaches are sensitive to the local or global topological properties of the network, and the prediction precision for identifying essential proteins requires further improvement. In this paper, we first introduce the densities Den 1 (v) and Den 2 (v) of a node v to describe its local properties in the network. Then, a novel method called LBCC is proposed, which is combined with Den 1 , Den 2 , BC, and IDC, where the local and global properties of the node are measured by Den 1 and Den 2 and by BC, respectively, and the information of the protein complex is measured by IDC, which was first introduced in [15]. This combination of features has not previously been considered for this problem.
We performed several experiments on different PPI (protein-protein interaction) networks of Saccharomyces cerevisiae, YMIPS, YMBD, YHQ and YDIP, which will be described in the Experimental data section. The experimental results demonstrate that our LBCC method provides superior prediction performance compared to centrality measures, including DC, BC, SC, EC, NC, and LAC. In particular, compared to the most recent method, LIDC, which is a more effective method for predicting essential proteins, LBCC improves the prediction precision by at least 10 percent on the YMIPS and YMBD datasets.

Notation
For an undirected simple graph G(V, E) with a set of nodes V and a set of edges E, a node v 2 V denotes a protein and an edge e(u, v)2E denotes an interaction between two proteins u and v. N v denotes the set of nodes containing all the neighbors of node v, and |N v | denotes the number of nodes in N v . Let G[S] denote the subgraph of G induced by the node set S.

Centrality measures
Many researchers have found that it is significative to predict essential proteins by centrality measures [21,22]. A PPI network is always represented as an undirected simple graph G(V, E).
Here, we will introduce six classical centrality measures based on network topological properties.
Degree centrality(DC). The degree centrality of a node v is the number of its neighbor nodes, where deg(v) is the number of its neighbor nodes.
Betweenness centrality(BC). The betweenness centrality of a node v is denoted as the average fraction of the shortest paths passing through the node v, where σ st is the number of shortest paths between s and t, and σ st (v) is the number of such paths passing through v. Subgraph centrality(SC). The subgraph centrality of a node v accounts for the participation of v in all subgraphs of the network, where μ k (v) is the number of subgraphs from node v to node v with length k. Eigenvector centrality(EC). The eigenvector centrality of a node v is the value of the vth component of the principal eigenvector of A, where α max represents the eigenvector that corresponds to the largest eigenvalue of the adjacency matrix A and α max (v) is the vth component of α max .
Local average connectivity centrality(LAC). The local average connectivity centrality of a node v is denoted as the local connectivity of its neighbors, where C v is the subgraph G[N v ] and deg C v (u) is the number of its neighbors in C v for a node u 2 N v .
In-degree centrality of complex(IDC). The in-degree centrality of complex of a node v is denoted as where ComplexSet(v) represents the set of protein complexes including protein v and IN-Degree(v) i is represented as the value of DC(v) for the ith protein complex belonging to Com-plexSet(v).

Local properties of nodes in a PPI network
There are many local properties of nodes in a PPI network, such as the degree centrality (DC) and local clustering coefficient [23], which is defined as In this section, we propose two types of local properties of nodes in a PPI network, Den 1 (v) and Den 2 (v), which are defined as follows.
; then, we define which is the proportion of the number of the edges to the number of all possible edges of H.
, and their relationship is : ; then, we define where M u is the set of nodes for which the distance to v is 2. Hence, Den 2 (v) is the density of the subgraph induced by v and the set of nodes for which the distance to v is 1 or 2. Considering the graph G shown in Fig 1 as an example, except for the leaf nodes, the values of Den 1 (v) and Den 2 (v) of the other nodes are presented in Table 1.
To evaluate the effects of Den 1 and Den 2 on the prediction of essential proteins, we performed some experiments on the YMIPS and YMBD datasets, which are described in the next section. Consider that the values of BC can represent the global properties of nodes. We first compute the value of BC(v) of each node v in YMIPS and YMBD, and we compute their local properties Den 1 (v) and Den 2 (v). For YMIPS, we find that there are 33 pairs of nodes, in which each pair has the same value of BC(v), and Den 1 (v) and Den 2 (v) can facilitate identifying the essential proteins in 6 pairs. For YMBD, we also find that there are 39 pairs of nodes, in which  each pair has the same value of BC(v), and Den 1 (v) and Den 2 (v) can facilitate locating the essential proteins in 8 pairs. In Table 2, we list the values of Den 1 (v), Den 2 (v) and BC(v) of these pairs of nodes for YMIPS and YMBD. Hence, we believe that the local properties Den 1 (v) and Den 2 (v) are important for aiding in locating essential proteins.

New centrality measure: LBCC
In this section, we propose a new method, LBCC, by combining Den 1 , Den 2 , BC and IDC. The following basic concepts underlie LBCC: 1. essential proteins tend to form highly connected clusters [24]; 2. essential proteins gather in protein complexes [20]; and 3. both local and global properties are important for aiding in locating essential proteins.
Therefore, for a node v of the network, we use IDC(v) to represent its information on protein complexes and BC(v) to represent its global properties. For the contribution of local properties and highly connected clusters, we use Den 1 (v) and Den 2 (v). Because the value ranges of these measures differ, we apply a log transformation to normalize the data. Now, we can describe our new measurement LBCC for evaluating the essentiality of a node v, where a, b, c, and d are scaling parameters that range from 0 to 10 and represent the importance of the corresponding item used in the LBCC calculation. We set IDC(v) = 0.001 if a protein v does not appear in any protein complex.
We perform a large number of experiments to identify essential proteins in the YMIPS dataset, and we find that the measurement LBCC has the best performance when a, b, c and d are set to 1, 4, 3 and 1, respectively. To improve the values of these parameters, we also conduct some experiments using a logistic regression classifier; however, the results are extremely poor due to the imbalanced datasets, in which the number of nonessential proteins is approximately three times greater than the number of essential proteins for the four PPI networks.
As shown in Table 2, the values of BC are far greater than those of Den 1 and Den 2 . For IDC, the majority of its values are between 10 and 100 on the YMIPS dataset. Hence, IDC and BC are more important than Den 1 and Den 2 when calculating LBCC.

Experimental data
To evaluate the performance of the LBCC method, we used Saccharomyces cerevisiae as the experimental material because relatively reliable and complete PPI data are available for this organism. The PPI network data are from the MIPS database (Mammalian Protein-Protein Interaction Database) [25], the DIP database [26], and other datasets from the website of the Mark Gerstein Lab (gersteinlab.org).
We selected four different datasets. The first dataset, a MIPS dataset, was marked YMIPS (S1 Text); the second and third datasets from the Mark Gerstein Lab were marked YMBD (S2 Text) and YHQ (S3 Text), respectively; and the fourth dataset, a DIP dataset, was marked YDIP (S4 Text). YMIPS included 4546 proteins and 12319 interactions, and its average degree was approximately 5.42. YMBD, which was selected from MIPS, BIND and DIP, includes 2559 proteins and 11835 interactions, and its average degree was approximately 9.25. YHQ was constructed by Yu et al. [27] comprehensively and reliably and includes 4743 proteins and 23294 interactions in total. The average degree of YHQ was approximately 9.82. YDIP included 5093 proteins and 24743 interactions, and its average degree was approximately 9.72.

Evaluation methods
In general, several statistical measures, such as sensitivity (SN), specificity (SP), positive predictive value (PPV), negative predictive value (NPV), F-measure (F), and accuracy (ACC), are used to determine how effectively the essential proteins are identified by different methods (see the references [13,15]). We introduce them in this section to evaluate the effectiveness of the proposed method LBCC. First, we provide four statistical terms: • True positives(TP). The essential proteins that are correctly selected as essential.
• False positives(FP). The nonessential proteins that are incorrectly selected as essential.
• True negatives(TN). The nonessential proteins that are correctly selected as nonessential.
• False negatives(FN). The essential proteins that are incorrectly selected as nonessential.
Next, we provide the definitions of six statistical measures: Sensitivity. Sensitivity is the ratio of the proteins that are correctly selected as essential to the total number of essential proteins, Specificity. Specificity is the ratio of the nonessential proteins that are correctly selected as nonessential to the total number of nonessential proteins, Positive predictive value. Positive predictive value refers to the ratio of the proteins that are correctly selected as essential, Negative predictive value. Negative predictive value refers to the ratio of the proteins that are correctly selected as nonessential, F-measure. F-measure refers to the harmonic mean of SN and PPV, Accuracy. Accuracy refers to the ratio of the proteins that are correctly selected as essential and nonessential in all the results, in which P represents the number of essential proteins and N represents the number of nonessential proteins.

Comparison with other prediction measures
To evaluate the performance of LBCC, we compared LBCC and other prediction measures using the four datasets described in the Experimental data section. The compared prediction measures included LIDC, DC, BC, SC, EC, NC, and LAC. The algorithm for LIDC was implemented according to [15], and the other algorithms were implemented using CytoNCA [33], a plugin of Cytoscape for centrality analysis of PPI networks. First, we ranked proteins in descending order based on their LBCC values and other prediction measures; second, we selected the top 100, 200, 300, 400, 500, and 600 proteins as essential proteins; and finally, the number of true essential proteins was determined. The prediction results of the eight methods for the four different networks are shown in  For the YMIPS dataset shown in Fig 2, LIDC, the most recent method, had the best performance, with 66, 124, 177, 224, 265, and 314 true essential proteins identified at six levels from the top 100 to top 600. By comparison, the numbers of true essential proteins predicted by LBCC were 75, 145, 199, 248, 305, and 343, respectively. Compared to LIDC, LBCC exhibited superior performance and increased the prediction precision by more than 13, 16, 12, 10, 15 and 9 percent at six levels from the top 100 to top 600.
For the YMBD dataset shown in   For the YDIP dataset shown in Fig 5, LIDC achieved the best results at the top 100, 200, 300 and 500 levels, and LBCC attained the best results at the top 400 and 600 levels. At six levels, the numbers of true essential proteins identified by LIDC were 76, 152, 209, 260, 313, and 354. By comparison, the numbers of true essential proteins identified by LBCC were 74, 135, 205, 262, 308, and 361, respectively. The results predicted by LBCC were similar to those obtained using LIDC at the top 100, 300 and 500 levels.
Thus, our experiments indicate that LBCC can identify more essential proteins than the other methods in most cases.

Validation using six statistical methods and precision-recall curves
In this section, we compared LBCC and the other seven prediction measures using the six statistical methods described in the Evaluation methods section. We ranked the proteins in descending order based on the values of eight measures and selected the top 20 percent as essential proteins; the remaining proteins were considered nonessential proteins. The results are presented in Table 4, and the values of the six statistical methods for LBCC were consistently higher than those for the other methods on the first two networks, indicating that LBCC can predict essential proteins more accurately. For the YHQ dataset, the results predicted by LBCC were identical to those obtained using LIDC. For the YDIP dataset, the results predicted by LBCC were similar to those obtained using LIDC.
The precision-recall curve is a statistical method used for assessing the stability of the eight prediction measures. This curve is obtained by plotting where TP(n) is the total number of essential proteins correctly identified as essential proteins and FP(n) is the total number of nonessential proteins incorrectly identified as essential proteins among the top n proteins. P is the total number of essential proteins under consideration. The analysis of the six statistical methods and precision-recall curves indicated that LBCC not only has better prediction precisions than the other seven methods but it also delivers more stable performance for the first three networks.

Validation using jackknife methodology
We used the jackknife methodology developed by Holman et al. [34] to assess the generality of our trained predictor. First, we ranked the proteins in descending order based on their values obtained using the eight prediction methods. Then, the jackknife curve was plotted according to the cumulative number of the true essential proteins. As shown in Figs 10-13, the x-axis represents the proteins ranked in descending order from left to right according to the values computed using the corresponding methods, and the y-axis represents the number of true essential proteins among the top n proteins, where n is the number along the x-axis.       As shown in Figs 10-12, the sorted curve of LBCC is significantly better than those of the other prediction measures for the YMIPS, YMBD and YHQ data. For the YDIP network, as shown in Fig 13, LBCC exhibited a performance similar to that of LIDC and superior to those of all the other methods. Hence, the LBCC method is feasible and effective for predicting essential proteins for the first three networks.

Analysis of the differences between LBCC and other measures
To further determine why LBCC performs well on the four datasets for predicting essential proteins, we studied the difference between LBCC and the other prediction measures by predicting a small number of proteins. Let A \ B denote the set of proteins predicted by both methods A and B, A − B denote the set of proteins predicted by method A but not by method B, and A [ B denote the set of proteins predicted by method A or B.
We compared the performances of LBCC and the other seven methods in predicting the top 100 proteins ranked by the corresponding methods. The comparison results are presented in Table 5.
For the YMIPS dataset, as indicated in column |LBCC \ M|, the rates of overlap of the proteins predicted by LBCC and the other six methods (DC, LAC, SC, EC, BC, and NC) were less than 20 percent, and no protein was predicted by LBCC, SC, and EC. The rate of overlap of proteins predicted by LBCC and LIDC was 35 percent. The fifth column is the number of true essential proteins in the set LBCC − M, and the sixth column is the number of true essential proteins in the set M − LBCC. The number of true essential proteins identified by LBCC was the highest among the prediction methods. In particular, LBCC yielded 50 more true essential proteins than DC, LAC, SC, EC, BC and NC. We also plotted the subgraph of the top 100 proteins predicted by DC and the top 100 proteins predicted by LBCC in Fig 14 and the subgraph of the top 100 proteins predicted by SC and the top 100 proteins predicted by LBCC in Fig 15. The node number of the subgraph is less than 200 if LBCC \ DC 6 ¼ ; (or LBCC \ SC 6 ¼ ;). In the two subgraphs, the blue nodes and green nodes form a dense network, whereas the red nodes and yellow nodes form sparse networks in which there are even several isolated nodes. Hence, the essential proteins identified by LBCC exhibit significant modularity.
For the YMBD dataset, the column |LBCC \ M| demonstrates that the rate of overlap of proteins predicted by LBCC and the other seven methods was not greater than 20 percent. The fifth and sixth columns show that LBCC predicted 50 more true essential proteins than the other prediction methods, including LIDC. Similarly, we plotted two subgraphs of LAC [ LBCC and LIDC [ LBCC, shown in Figs 16 and 17, respectively. The blue nodes and green nodes form two dense networks, whereas the red nodes and yellow nodes form sparse networks. Hence, the essential proteins identified by LBCC also exhibit stronger modularity.
For the YHQ dataset, as indicated by column |LBCC \ M|, the rates of overlap of the proteins are less than 40 percent, except for LIDC, for which the rate of overlap is 48 percent. The fifth and sixth columns show that the number of true essential proteins predicted by LBCC is less than those predicted by the other methods due to the less desirable results at the top 100 level (see Fig 4). We also plotted the two subgraphs for BC [ LBCC and NC [ LBCC, shown in Figs 18 and 19, respectively. The blue nodes and green nodes form some dense networks, whereas the red nodes and yellow nodes form four sparse networks. Thus, the essential proteins predicted by LBCC show stronger modularity.
For the YDIP dataset, the column |LBCC \ M| shows that the rate of overlap of the proteins predicted by LBCC and the other six methods (DC, LAC, SC, EC, BC, and NC) is less than 30 percent. As indicated by the fifth and sixth columns, the number of true essential proteins    predicted by LBCC is greater than those predicted by the other methods (DC, LAC, SC, EC, BC, and NC). Compared with LIDC, the rate of overlap is 66 percent, and 6 fewer true essential proteins are predicted by LBCC compared to LIDC. Similarly, we plotted the two subgraphs for EC [ LBCC and DC [ LBCC, shown in Figs 20 and 21, respectively. The blue nodes and green nodes form dense networks, whereas the red nodes and yellow nodes form some sparse networks. Thus, the essential proteins predicted by LBCC exhibit stronger modularity.
The analysis of the differences between these measures demonstrates that LBCC is significantly different from the other measures and is more accurate in terms of the discovery of essential proteins in most cases.

Results on human PPI network
To further evaluate the performance of the proposed method LBCC, we also applied it to identify essential proteins on a human PPI network. The human PPI network data marked HDIP were from the DIP database [26], the essential proteins were collected from DEG [29], and the protein complex set marked HCOM was from CORUM (Comprehensive Resource of Mammalian protein complexes) [35]. HDIP consisted of 4647 interactions and 2914 proteins, including 1887 essential proteins, and HCOM contained 1283 protein complexes.
First, we compared the performances of LBCC and the other seven methods in six levels from the top 100 to top 600. As shown in Fig 22, almost every method achieved more than 70 percent precision due to the large proportion of essential proteins, and LBCC achieved the best results at the top 100-400 levels. However, LBCC tended to provide less desirable results compared with LIDC at the top 500 and 600 levels.
Second, we used six statistical methods and precision-recall curves to evaluate the performance of LBCC and the other methods. As shown in Table 6, the values of the six statistical methods for LBCC were slightly lower than for LIDC. From the precision-recall curves shown in Fig 23, LBCC performed better than the other methods between the recall levels of 0 and 0.22.
Finally, we used the jackknife methodology to assess the generality of LBCC and the other seven methods. The results are presented in Fig 24, in which LBCC exhibited a performance similar to that of LIDC before the top 500 and superior to LAC, SC, EC and NC. Hence, the LBCC method is also effective for predicting essential proteins for the human PPI network HDIP. The green nodes and blue nodes are proteins identified by EC; the former are true essential proteins, and the latter are nonessential proteins. The red nodes and yellow nodes are proteins identified by LBCC; the former are true essential proteins, and the latter are nonessential proteins. doi:10.1371/journal.pone.0161042.g020

Conclusion
The identification of essential proteins is helpful for comprehending the minimal requirements for cellular life, and many approaches based on topological properties have been proposed for discovering essential proteins in PPI networks. Most of the topology-based methods only concentrate on either local or global characteristics and are also sensitive to the network structure.
In 2015, Luo and Qi [15] proposed the method LIDC based on information on protein complexes. LIDC outperformed classical topological centrality measures. In this paper, we propose a new method, LBCC, based on the combination of three characteristics of the protein-protein The green nodes and blue nodes are proteins identified by DC; the former are true essential proteins, and the latter are nonessential proteins. The red nodes and yellow nodes are proteins identified by LBCC; the former are true essential proteins, and the latter are nonessential proteins. The black nodes are the overlapping proteins. interaction network, i.e., Den 1 (v), Den 2 (v), BC(v) and IDC(v), which represent both local and global characteristics and information on protein complexes.
We applied LBCC to four PPI networks of Saccharomyces cerevisiae: YMIPS, YMBD, YHQ and YDIP. We then conducted comprehensive comparisons of LBCC and the other seven previously proposed methods, including DC, BC, SC, EC, NC, LAC and LIDC, in terms of the number of true essential proteins identified. At the six levels from the top 100 to top 600,  LBCC outperformed recent prediction methods on the YMIPS and YMBD datasets. In particular, LBCC improved the prediction precision by more than 10 percent compared to LIDC. Based on the analysis of the six statistical methods, precision-recall curve and jackknife methodology for the four datasets, the experimental results demonstrate that LBCC is more stable and general than the recently developed prediction methods in most cases. Moreover, we also applied LBCC to a human PPI network, HDIP. The experimental results show that LBCC is also effective for predicting essential proteins for the HDIP network. Hence, we conclude that LBCC is a more effective method for predicting essential proteins, occasionally significantly. In future studies, we will integrate additional information, such as domain information, gene ontology and gene expression data, to predict essential proteins more effectively and accurately.
Supporting Information S1 Excel. Essential protein and nonessential protein data.