Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

A Multi-Label Predictor for Identifying the Subcellular Locations of Singleplex and Multiplex Eukaryotic Proteins

  • Xiao Wang,
  • Guo-Zheng Li

    gzli@tongji.edu.cn

    Affiliation The MOE Key Laboratory of Embedded System and Service Computing, Department of Control Science and Engineering, Tongji University, Shanghai, China

A Multi-Label Predictor for Identifying the Subcellular Locations of Singleplex and Multiplex Eukaryotic Proteins

  • Xiao Wang, 
  • Guo-Zheng Li
PLOS
x

Abstract

Subcellular locations of proteins are important functional attributes. An effective and efficient subcellular localization predictor is necessary for rapidly and reliably annotating subcellular locations of proteins. Most of existing subcellular localization methods are only used to deal with single-location proteins. Actually, proteins may simultaneously exist at, or move between, two or more different subcellular locations. To better reflect characteristics of multiplex proteins, it is highly desired to develop new methods for dealing with them. In this paper, a new predictor, called Euk-ECC-mPLoc, by introducing a powerful multi-label learning approach which exploits correlations between subcellular locations and hybridizing gene ontology with dipeptide composition information, has been developed that can be used to deal with systems containing both singleplex and multiplex eukaryotic proteins. It can be utilized to identify eukaryotic proteins among the following 22 locations: (1) acrosome, (2) cell membrane, (3) cell wall, (4) centrosome, (5) chloroplast, (6) cyanelle, (7) cytoplasm, (8) cytoskeleton, (9) endoplasmic reticulum, (10) endosome, (11) extracellular, (12) Golgi apparatus, (13) hydrogenosome, (14) lysosome, (15) melanosome, (16) microsome, (17) mitochondrion, (18) nucleus, (19) peroxisome, (20) spindle pole body, (21) synapse, and (22) vacuole. Experimental results on a stringent benchmark dataset of eukaryotic proteins by jackknife cross validation test show that the average success rate and overall success rate obtained by Euk-ECC-mPLoc were 69.70% and 81.54%, respectively, indicating that our approach is quite promising. Particularly, the success rates achieved by Euk-ECC-mPLoc for small subsets were remarkably improved, indicating that it holds a high potential for simulating the development of the area. As a user-friendly web-server, Euk-ECC-mPLoc is freely accessible to the public at the website http://levis.tongji.edu.cn:8080/bioinfo/Euk-ECC-mPLoc/. We believe that Euk-ECC-mPLoc may become a useful high-throughput tool, or at least play a complementary role to the existing predictors in identifying subcellular locations of eukaryotic proteins.

Introduction

Proteins perform their appropriate functions only when they are located in the correct subcellular locations. Therefore, one of the fundamental goals in cell biology and proteomics is to identify the subcellular locations of these proteins. Although the subcellular localization of a protein may be determined by carrying out various biochemical experiments, the approach by purely doing experiments is both time consuming and high cost. In the post-genomic age, the gap between newly found protein sequences and the information of their subcellular localization is becoming increasingly wide. To bridge such a gap, it is highly desirable to develop computational methods to predict protein subcellular localization automatically and accurately. During the past decade, many efforts have been devoted to deal with such a challenge, and a large number of computational methods have been developed in an attempt to predict the subcellular localization of proteins (see, e.g., [1][16] as well as a long list of references cited in two review papers [17], [18]).

Unfortunately, the aforementioned methods don't take multiple-location or multiplex proteins into account when predicting protein subcellular localization. In general, they were established under the assumption that a protein resides at one, and only one, subcellular location. However, proteins may simultaneously reside at, or move between, two or more different subcellular locations. Proteins with multiple location sites or dynamic feature of this kind are particularly interesting, because they may have some unique biological functions worthy of our special notice [19], [20]. In particular, recent evidences have indicated that an increasing number of proteins have multiple locations in the cell, as indicated by Millar et al. [21].

In this paper, we focus on predicting the subcellular locations of eukaryotic proteins with both singleplex and multiplex sites. So far, only three existing predictors, i.e., Euk-mPLoc [22], Euk-mPLoc 2.0 [23] and iLoc-Euk [24], were developed that can be used to predict the subcellular locations of both singleplex and multiplex eukaryotic proteins. To the best of our knowledge, iLoc-Euk is at present the best predictor with capacity to deal with multiple-location or multiplex proteins when predicting eukaryotic protein subcellular localization. However, ML-KNN prediction engine used by iLoc-Euk is not optimal because it doesn't take correlations among subcellular locations into account.

In this paper, to better reflect the characteristics of multiplex proteins, a new predictor, called Euk-ECC-mPLoc, has been developed that can be used to deal with the systems containing both singleplex and multiplex eukaryotic proteins by introducing a powerful multi-label learning algorithm which exploits correlations between subcellular locations and by hybridizing the gene ontology information with the dipeptide composition information. Our experimental results on a benchmark dataset consisting of 7,766 eukaryotic protein sequences by jackknife cross validation test show that the overall success rates thus obtained by our proposed predictor Euk-ECC-mPLoc outperforms that by iLoc-Euk predictor. Moreover, for some subcellular locations with training proteins of very small size, the success rates achieved by Euk-ECC-mPLoc are higher than those by iLoc-Euk. Therefore, Euk-ECC-mPLoc significantly improve the predictive performance on those “difficult” subcellular locations.

According to a recent comprehensive review [25], to establish a practically useful statistical predictor for a protein system, we need to consider the following procedures: (i) construct or select a valid benchmark dataset to train and test the predictor; (ii) formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target concerned; (iii) introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us describe in detail how to deal with these steps one-by-one.

Materials and Methods

Dataset

In this paper, the dataset from iLoc-Euk [24] is used as the benchmark dataset for the current study. The dataset can be obtained from the Online Supporting Information S1 of [24]. The dataset is constructed specialized for eukaryotic proteins, where none of proteins included in has greater than or equal to 25% pairwise sequence identity to any other one in a same subcellular location compared with most of the other benchmark datasets in this area. Using the dataset will make it more reliable and easier to compare our new predictor with the existing ones.

The dataset contains 7,766 different eukaryotic protein sequences, of which 6,687 belong to one subcellular location, 1,029 to two locations, 48 to three locations, and 2 to four locations. The dataset covers 22 different subcellular locations as shown in Fig. 1, and hence can be represented as(1)where represents the subset for the subcellular location of “acrosome”, for “cell membrane”, for “cell wall”, and so forth. A breakdown of the 7,766 eukaryotic proteins in the benchmark dataset according to their 22 location sites is given in Table 1. To avoid redundancy and homology bias, none of the proteins in has greater than or equal to 25% pairwise sequence identity to any other in a same subset. For convenience, hereafter let us just use the subscripts of Eq.(1) as the codes of the 22 location sites; i.e., “1” for “acrosome”, “2” for “cell membrane”, “3” for “cell wall”, and so forth (Table 1).

thumbnail
Figure 1. Schematic illustration to show the 22 subcellular locations of eukaryotic proteins.

They are: (1) acrosome, (2) cell membrane, (3) cell wall, (4) centrosome, (5) chloroplast, (6) cyanelle, (7) cytoplasm, (8) cytoskeleton, (9) endoplasmic reticulum, (10) endosome, (11) extracellular, (12) Golgi apparatus, (13) hydrogenosome, (14) lysosome, (15) melanosome, (16) microsome (17) mitochondrion, (18) nucleus, (19) peroxisome, (20) spindle pole body, (21) synapse, and (22) vacuole. Adopted from [24] with permission.

https://doi.org/10.1371/journal.pone.0036317.g001

thumbnail
Table 1. Breakdown of the eukaryotic protein benchmark dataset taken from [24].

https://doi.org/10.1371/journal.pone.0036317.t001

Note that because some proteins may occur in two different locations, the 7,766 different proteins actually correspond to 8,897 “locative proteins” (Table 1). For the concept of locative proteins, readers are referred to [22], [26], [27] where the difference between “protein” and “locative protein” and their relationship are elaborated.

Feature Extraction

To develop a powerful method for statistically predicting protein subcellular localization, one of the most important steps is to extract core and essential features of protein samples that are closely correlated with their subcellular locations. To avoid losing important information hidden in protein sequences, the pseudo amino acid composition (PseAAC) was proposed [28], [29] to replace the simple amino acid composition (AAC) for representing the sample of a protein. For a brief introduction about Chou's PseAAC, please visit the Wikipedia web-page at http://en.wikipedia.org/wiki/Pseudo_amino_acid_composition. For a summary about its recent developments and applications, see a comprehensive review [30]. Ever since the concept of PseAAC was proposed by Chou [28] in 2001, it has rapidly penetrated into almost all the fields of protein attribute prediction, such as identifying bacterial virulent proteins [31], predicting homo-oligomeric proteins [32], predicting protein secondary structure content [33], predicting supersecondary structure [34], predicting protein structural classes [35], [36], predicting protein quaternary structure [37], predicting enzyme family and sub-family classes [38][40], predicting protein subcellular location [41][44], predicting subcellular localization of apoptosis proteins [45][48], predicting protein subnuclear location [49], predicting protein submitochondria locations [50][52], identifying cell wall lytic enzymes [53], identifying risk type of human papillomaviruses [54], identifying DNA-binding proteins [55], predicting G-Protein-Coupled Receptor Classes [56], [57], predicting protein folding rates [58], predicting outer membrane proteins [59], predicting cyclin proteins [60], predicting GABA(A) receptor proteins [61], identifying bacterial secreted proteins [62], identifying the cofactors of oxidoreductases [63], identifying lipase types [64], identifying protease family [65], predicting Golgi protein types [66], classifying amino acids [67], among many others. Actually, according to a recent comprehensive review [25], the Chou's PseAAC is generally formulated as(2)where the subscript is an integer, and its value as well as the components depends on how to extract the desired features from the amino acid sequence of P.

In the present study, we adopt Gene Ontology and Dipeptide Composition feature extraction methods to generate features of protein examples, which are widely used in many existing protein subcellular localization systems [22][24], [26], [27], [68][74]. For reader's convenience, a brief introduction on Gene Ontology and Dipeptide Composition is given below.

Gene Ontology.

GO database [75] was established according to the molecular function, biological process, and cellular component. The following questions might be raised by those who do not really understand GO (Gene Ontology): One of the three aspects of GO is ‘Cellular Compartment’ [75], which is just another name for subcellular location. If a protein already has GO annotation, why does one need to predict its subcellular location? Is it merely a procedure of converting the annotation into another format? Is it true that the high success rate obtained via the GO approach was due to a trivial utilization of the subcellular component annotations in the GO database? To really understand these questions, the readers should carefully read the paper [14], particularly the profound and penetrating analysis on the left column of page 155 of that paper [14]. For readers' convenience, it can be briefly summarized as follows: (i) Although GO database is constructed based on protein function and cellular component, for those proteins with ‘subcellular location unknown’ annotation in Swiss-Prot database, most (more than 99%) of their corresponding GO numbers in GO database are also annotated with ‘cellular component unknown’. (ii) Even for those proteins whose subcellular locations are clearly annotated in Swiss-Prot database, their corresponding GO numbers in GO database do not always directly indicate their corresponding subcellular locations. In some cases they are actually annotated with ‘cellular component unknown’. (iii) More important, it should be emphasized that during the course of prediction, only the GO numbers of a query protein but not its GO annotations were used, just like the case of using all the other predictors in identifying the protein subcellular location that only the sequence of a query protein but not its Swiss-Prot annotation was used. (iv) Finally, as shown by the compelling statistical analysis given in Table 6 of the paper [14], the percentage (45.02%) of proteins with GO annotations to indicate their subcellular components is even less than the percentage (51.76%) of proteins with known subcellular location annotation in the Swiss-Prot database. Accordingly, the high success rate obtained by the method via the GO approach was by no means due to a trivial procedure of converting the annotation from one into another format, as often misinterpreted by some people. Furthermore, it can be seen from Table 6 of the paper [14] that there is a huge number of proteins with given accession numbers and the corresponding GO numbers, but their subcellular locations are still unknown. Actually, the essence of why using GO approach to represent protein samples can significantly improve the prediction quality is due to the fact that proteins mapped into the GO database space would be clustered in a way better reflecting their subcellular locations, thus to significantly enhances the success rate of prediction for those proteins that do not have significant sequence homology to proteins with known locations, as elaborated in [18], [76]. So far, there are two main approaches to extract features from GO database space. However, in order to incorporate more information, instead of only using 0 and 1 elements as done in [23], here let us use another better approach [24] as described below.

Step 1.

Compression and reorganization of the existing GO numbers. The GO database (version 94 released on 08 April 2011) contains many GO numbers. However, these numbers do not increase successively and orderly. For easier handling, some reorganization and compression procedures are taken to renumber them. The GO database obtained through such a treatment is called GO_compress database, which contains 18,844 numbers increasing successively from 1 to the last one.

Step 2.

Using Eq.(2) with , the protein P is represented as(3)where are defined via the following steps.

Step 3.

Use BLAST [77] to search the homologous proteins of the protein P from the Swiss-Prot database (version 55.3), with the expect value as the BLAST parameter.

Step 4.

Those proteins which have 60% pairwise sequence identity with the protein P are collected into a set, , called the “homology set” of P. All the elements in are deemed as the “representative proteins” of P, sharing some similar attributes such as structural conformations and biological functions [78][80]. Because they were retrieved from the Swiss-Prot database, these representative proteins must have their own accession numbers.

Step 5.

Search the GO database at http://www.ebi.ac.uk/GOA/ to find the corresponding GO number(s) [81] for each of the accession numbers collected in Step 4, and then convert the GO numbers thus obtained to their GO_compress numbers as described in Step 1. (Note that the relationships between the UniProtKB/Swiss-Port protein entries and the GO numbers may be one-to-many, “reflecting the biological reality that a particular protein may function in several processes, contain domains that carry out diverse molecular functions, and participate in multiple alternative interactions with other proteins, organelles or locations in the cell” [75]. For example, the Uni-ProtKB/Swiss-Prot protein entry “P01040” corresponds to three GO numbers, i.e., “GO:0004866”, “GO:0004869”, and “GO:0005622”).

Step 6.

The elements in Eq.(3) is given by(4)where is the number of representative proteins in , and(5)

Note that the GO feature extraction method may become a naught vector or meaningless under any of the following situations: (1) the protein P does not have significant homology to any protein in the Swiss-Prot database, i.e., meaning the homology set is an empty one; (2) its representative proteins do not contain any useful GO information for statistical prediction based on a given training dataset.

Under such a situation, let us consider using the dipeptide composition method as backup to extract features for the protein P, as described below.

Dipeptide Composition.

Dipeptide composition (abbreviated as DC) represents the co-occurrence frequency of each two adjacent amino acid residues. It is used to describe the global information about each protein sequence in the form of 420-dimensional (420-D) feature vector. An advantage of DC over amino acid composition is that it uses some sequence-order information. Dipeptide composition generates 420 components for each protein sequence, the first 20 components are the conventional amino acid composition(AAC); the following 400 components are the fractions of 400 dipeptides, i.e. AA, AC, AD, … , YV, YW, YY; the 400 components are calculated using the following equation(6)where dip(i) is the i-th dipeptide of the 400 dipeptides, i = 1, 2 ,…, 400.

Prediction Algorithm: Ensemble of Classifier Chains

To enhance the success rate, the powerful ECC (Ensemble of Classifier Chains) classifier [82] is adopted to perform prediction. Below, let us introduce the Ensemble of Classifier Chains classifier.

Without lose of generality, let us consider a system or dataset that contains eukaryotic proteins classified into subcellular location sites. The dataset can be represented by the following matrix:(7)where () if the -th eukaryotic protein belongs to the -th subcellular location site, 0 otherwise.

According to Eq.(7), we know that if , the -th eukaryotic protein is a multiplex protein, while if , the -th eukaryotic protein is a single-location protein. In this study, we deal with the case that there is at least one eukaryotic protein of , that is to say, the systems that contain both single-location and multiple-location eukaryotic proteins.

Before introducing Ensemble of Classifier Chains, we firstly present a simple method, called Binary relevance (BR) [83], which converts a multi-label learning problem into a number of independent binary classification ones. Taking the above system or dataset for example, independent binary classifiers are separately constructed for the eukaryotic subcellular location sites, i.e.,(8)where is the prediction model for the st subcellular location site, for nd and so on. The positive () and negative () training samples for are collected according to the following formula:(9)where represents the label information as shown in Eq.(7), represents the protein that belongs to the -th subcellular location site, is the union symbol in the set theory.

For the prediction of a query protein, BR outputs the union of the class labels that are predicted by the classifiers:(10)where is the result predicted by the -th classifier, representing the query protein belonging to the -th subcellular location site, otherwise not. To provide an intuitive picture, it is shown in Fig. 2 to illustrate the complete process of BR method.

BR is conceptually simple and easy to implement, whereas may be less effective since it don't take label correlations into account. In the experiment below, we will compare our proposed ECC method with the BR method in order for proving the effectiveness of considering label correlations.

Now we begin to introduce ECC algorithm. ECC algorithm is proposed by J.Read in [82], which aggregates multiple CC (Classifier Chain). CC is the core of the ECC algorithm, which is based on the framework of BR and consists of classifiers as in BR. However, in contrast to BR, classifiers are linked along a chain where each classifier is responsible for prediction of presence or absence of one class label. The feature space of each classifier in the chain is extended with the 0/1 class label associations of all previous classifiers. In other words, assuming that the classifier chain ( is a random permutation of ) is constructed, each classifier in the chain is responsible for predicting the binary association of class label given the feature space, augmented by all prior binary relevance associations in the chain . An intuitive illustration is provided in Fig. 3.

The chaining method passes label information between classifiers, allowing CC to take into account label correlations and thus overcoming the label independence problem of BR method. However, the order of the chain itself clearly has an effect on accuracy. In [82], the issue is solved by using an ensemble framework with a random chain ordering for each iteration.

In contrast to the traditional single-label ensemble learning, ECC is an ensemble of multiple multi-label methods, i.e. the CC method. Following the typical strategy of ensemble learning, ECC also has two steps, in which the first is to train CC classifiers and the second is to combine their predictions. In the first step, each is trained with both a random chain ordering and a random subset of original training data set. In the second step, multi-label predictions of each classifier are summed by label so that each label gets some votes, and then, we use a threshold to select the most possible labels which form the final multi-label prediction. Specifically, each classifier predicts a vector . The sums are stored in a vector such that . Hence each represents the sum of the votes for the th label. We then normalize to , which represents a distribution of scores for each label in [0, 1]. A threshold is used to choose the final multi-label set such that class label if for threshold . Here we simply set the threshold to be . Hence the relevant labels in represent the final multi-label prediction.

Support vector machine (SVM) [84] is a powerful binary classifier in the field of machine learning and pattern recognition. The basic ideas behind SVM is to map the input vectors into a high dimensional feature space and then find an Optimal Separating Hyperplane (OSH) which maximizes the margin, i.e., the distances between the hyperplane and the nearest data points of each class in the mapped feature space. SVM classifier has been largely and successfully used in the field of prediction of protein subcellular localization [3][5], [8][11]. In this study, we also use Support vector machine (SVM) as base classifier in both BR and ECC. The software package used to train SVM with default parameters is the very efficient LIBLINEAR library [85] which is specially designed for large scale and high dimensional datasets as the benchmark eukaryotic protein dataset for the current study.

The entire predictor thus established is called Euk-ECC-mPLoc, which can predict the subcellular localization of both singleplex and multiplex eukaryotic proteins. To provide an intuitive picture, a flowchart is provided in Fig. 4 to illustrate the prediction process of Euk-ECC-mPLoc.

Results and Discussion

In statistical prediction, it is needed to evaluate the quality of different prediction methods. The following three commonly used methods, that is, the independent data set test, K-fold cross validation test, and jackknife test, are often used for evaluating the power of a statistical prediction method. Of the three methods, the jackknife test is deemed as the most objective because it always generates a unique result for a given benchmark dataset, as elucidated in a comprehensive review [18]. Therefore, the jackknife test has been increasingly and widely employed by researchers to examine the accuracy of various prediction methods (see, e.g., [23], [24], [26], [86][88]). Accordingly, in the present study, we use jackknife test to evaluate the power of Euk-ECC-mPLoc.

Actually, for such a stringent and complicated dataset containing both single-location and multiple-location eukaryotic proteins distributed among 22 subcellular location sites, so far only three existing predictors, i.e., Euk-mPLoc [22], Euk-mPLoc 2.0 [23] and iLoc-Euk [24], were able to deal with it. It has been reported from [23] that, Euk-mPLoc 2.0, which is an updated version of Euk-mPLoc, can significantly outperform Euk-mPLoc. Moreover, as can be seen from [24], the overall jackknife success rate achieved by iLoc-Euk was about 15% higher than that by Euk-mPLoc 2.0 when tested on the dataset . As a result, iLoc-Euk is currently the best one. Therefore, to demonstrate the power of the proposed predictor, it would suffice to just compare Euk-ECC-mPLoc with iLoc-Euk.

Table 2 reports the detailed results on the 22 eukaryotic subcellular locations obtained with iLoc-Euk and Euk-ECC-mPLoc on the aforementioned benchmark dataset by the jackknife test. For a fair algorithmic comparison between Euk-ECC-mPLoc and iLoc-Euk, we use the same GOA database version that is described in this study to extract GO features for Euk-ECC-mPLoc and iLoc-Euk. As can be seen from Table 2, for such a stringent and complicated dataset, the average jackknife success rate achieved by Euk-ECC-mPLoc is 69.70%, which is about 19% higher than that achieved by iLoc-Euk [24]. Euk-ECC-mPLoc achieves very satisfactory performance on most subcellular locations, whereas iLoc-Euk achieves very poor performance on some subcellular locations, e.g., “acrosome”, “endosome”, “hydrogenosome”, “melanosome” and “microsome”. It is indicated that Euk-ECC-mPLoc is more balanced than iLoc-Euk. Meanwhile, Euk-ECC-mPLoc obtains 81.54% overall jackknife success rate, with about 3% performance improvement against iLoc-Euk. For the benchmark dataset containing both singleplex and multiplex eukaryotic proteins, the prediction accuracy is mainly influenced by the multiplex characteristics of proteins in that location. Roughly speaking, the bigger multiplex protein ratio in a location, the lower success rate will be obtained. For example, there are about 32% and 60% proteins respectively in the “melanosome” and “synapse” location belonging to two or more locations, iLoc-Euk obtains only 2.13% and 38.30% success rates respectively. Euk-ECC-mPLoc, however, achieves 53.19% and 46.81% success rates in the two locations respectively, with largely 51% improvement in the “melanosome” location and over 8% improvement in the “synapse” location. The main reason is that correlations between different subcellular location sites have been taken into account in our proposed Euk-ECC-mPLoc, while iLoc-Euk only transforms the problem of predicting multiplex eukaryotic protein subcellular locations to a number of problems of prediction of singleplex eukaryotic protein subcellular localization, and thus iLoc-Euk lose much important information related to multi-label learning problems, e.g., correlations between different subcellular locations as utilized in Euk-ECC-mPLoc. Therefore, Euk-ECC-mPLoc reaches better performance than iLoc-Euk in predicting multiplex proteins. Moreover, for some subcellular locations with smaller number of training proteins, the success rates achieved by Euk-ECC-mPLoc are higher than those by iLoc-Euk. For example, the success rate by Euk-ECC-mPLoc in “hydrogenosome” is 90% higher than that by iLoc-Euk, and the success rate by Euk-ECC-mPLoc in “acrosome” is about 64% higher than that by iLoc-Euk. This may be caused by the inherent advantage of SVM base classifier used in Euk-ECC-mPLoc.

thumbnail
Table 2. A comparison of the jackknife success rates by iLoc-Euk [24] and the proposed Euk-ECC-mPLoc on the benchmark dataset that covers 22 location sites of eukaryotic proteins in which none of the proteins included has pairwise sequence identity to any other in a same location.

https://doi.org/10.1371/journal.pone.0036317.t002

Table 3 illustrates the “exact match” success rate between predicted outputs and real annotations on the same benchmark dataset by the jackknife test. The “exact match” means that both the predicted number and annotations of the subcellular locations for a query protein are the same as real observations. For a protein belonging to three subcellular locations, if only two of the three are correctly predicted, or the predicted result contains a location not belonging to the three, the prediction score will be counted as 0. In other words, when and only when all the subcellular locations of a query protein are exactly predicted without any underprediction or overprediction, can the prediction be scored with 1. Meanwhile, the success rates by the random predictor are also shown. Because iLoc-Euk didn't provide the accuracy value specific to each subset in terms of the number of subcellular locations, the corresponding values are set to be “-”. As can be seen from Table 3, the overall “exact match” success rate achieved by Euk-ECC-mPLoc is 72.59%, which is slightly higher than 71.27%, the corresponding “exact match” success rate achieved by iLoc-Euk [24]. The “exact match” accuracy of Euk-ECC-mPLoc is significantly superior to the random predictor. Therefore, our approach is quite promising for handling multiplex proteins, or at least play a complementary role to the existing predictors in identifying the subcellular locations of eukaryotic proteins.

thumbnail
Table 3. A comparison of the jackknife “exact match” success rates by iLoc-Euk [24] and the proposed Euk-ECC-mPLoc on the benchmark dataset that covers 22 location sites of eukaryotic proteins in which none of the proteins included has pairwise sequence identity to any other in a same location.

https://doi.org/10.1371/journal.pone.0036317.t003

In order to make the readers understand the superiority of our approach than other existing predictors more easily and intuitively, several typical proteins that are localized in multiple subcellular locations are selected from DBMLoc [89] which is a database of proteins with multiple subcellular localizations, and thus make prediction by inputting them into our Euk-ECC-mPLoc and iLoc-Euk online web servers respectively. Results are listed in Table 4 with the predicted outputs by the two predictors and the corresponding experimental annotations. As can be seen from Table 4, predicted subcellular locations achieved by our approach are all identical to the corresponding true annotations, whereas iLoc-Euk fails to get fully accurate results.

thumbnail
Table 4. the predicted outputs by iLoc-Euk and Euk-ECC-mPLoc as well as the corresponding experimental annotations from DBMLoc [89].

https://doi.org/10.1371/journal.pone.0036317.t004

Conclusion

Prediction of protein subcellular localization is a challenging problem, particularly when the system concerned contains both singleplex and multiplex proteins. In this paper, we have proposed a novel multi-label predictor, called Euk-ECC-mPLoc, for predicting eukaryotic protein subcellular locations based on the powerful ECC algorithm and a hybrid of GO and DC feature extraction methods, which has been demonstrated powerful for dealing with both singleplex and multiplex proteins. Since user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful predictors [90], here we have provided a web-server for the method presented in this paper at http://levis.tongji.edu.cn:8080/bioinfo/Euk-ECC-mPLoc/. The current approach represents a new strategy to deal with the multi-label biological problems, and hence may become a useful tool in the areas of bioinformatics and proteomics.

Acknowledgments

The authors wish to thank anonymous reviewers for their constructive comments, which are very helpful for strengthening the presentation of this paper.

Author Contributions

Conceived and designed the experiments: G-ZL XW. Performed the experiments: XW. Analyzed the data: XW G-ZL . Contributed reagents/materials/analysis tools: XW. Wrote the paper: XW.

References

  1. 1. Reinhardt A, Hubbard T (1998) Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Research 26: 2230–2236.
  2. 2. Chou KC, Elrod DW (1999) Protein subcellular location prediction. Protein Engineering 12: 107–118.
  3. 3. Hua S, Sun Z (2001) Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17: 721–728.
  4. 4. Chou KC, Cai YD (2002) Using functional domain composition and support vector machines for prediction of protein subcellular location. Journal of Biological Chemistry 277: 45765–45769.
  5. 5. Park KJ, Kanehisa M (2003) Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs. Bioinformatics 19: 1656–1663.
  6. 6. Huang Y, Li Y (2004) Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics 20: 21–28.
  7. 7. Lu Z, Szafron D, Greiner R, Lu P, Wishart D, et al. (2004) Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20: 547–556.
  8. 8. Yu CS, Lin CJ, Hwang JK (2004) Predicting subcellular localization of proteins for gram-negative bacteria by support vector machines based on n-peptide compositions. Protein Science 13: 1402–1406.
  9. 9. Bhasin M, Raghava GPS (2004) ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST. Nucleic Acids Research 32: W414–W419.
  10. 10. Wang J, Sung WK, Krishnan A, Li KB (2005) Protein subcellular localization prediction for gram-negative bacteria using amino acid subalphabets and a combination of multiple support vector machines. BMC Bioinformatics 6: 174.
  11. 11. Garg A, Bhasin M, Raghava GPS (2005) Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. Journal of Biological Chemistry 280: 14427–14432.
  12. 12. Chou KC, Shen HB (2006) Predicting eukaryotic protein subcellular location by fusing optimized Evidence-Theoretic K-Nearest neighbor classifiers. Journal of Proteome Research 5: 1888–1897.
  13. 13. Pierleoni A, Martelli PL, Fariselli P, Casadio R (2006) BaCelLo: a balanced subcellular localization predictor. Bioinformatics 22: e408–e416.
  14. 14. Chou KC, Shen HB (2006) Hum-PLoc: a novel ensemble classifier for predicting human protein subcellular localization. Biochemical and Biophysical Research Communications 347: 150–157.
  15. 15. Shen HB, Yang J, Chou KC (2007) Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction. Amino Acids 33: 57–67.
  16. 16. Niu B, Jin YH, Feng KY, Lu WC, Cai YD, et al. (2008) Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins. Molecular Diversity 12: 41–45.
  17. 17. Nakai K (2000) Protein sorting signals and prediction of subcellular localization. Advances in Protein Chemistry 54: 277–344.
  18. 18. Chou KC, Shen HB (2007) Recent progress in protein subcellular location prediction. Analytical Biochemistry 370: 1–16.
  19. 19. Glory E, Murphy RF (2007) Automated subcellular location determination and High-Throughput microscopy. Developmental Cell 12: 7–16.
  20. 20. Smith C (2008) Subcellular targeting of proteins and drugs. URL http://www.biocompare.com/Articles/TechnologySpotlight/976/Subcellular-Target-ing-Of-Proteins-An.
  21. 21. Millar AH, Carrie C, Pogson B, Whelan J (2009) Exploring the Function-Location nexus: Using multiple lines of evidence in defining the subcellular location of plant proteins. The Plant Cell Online 21: 1625–1631.
  22. 22. Chou KC, Shen HB (2007) Euk-mPLoc: a fusion classifier for Large-Scale eukaryotic protein subcellular location prediction by incorporating multiple sites. Journal of Proteome Research 6: 1728–1734.
  23. 23. Chou KC, Shen HB (2010) A new method for predicting the subcellular localization of eukaryotic proteins with both single and multiple sites: Euk-mPLoc 2.0. PLoS ONE 5: e9931.
  24. 24. Chou KC, Wu ZC, Xiao X (2011) iLoc-Euk: a Multi-Label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS ONE 6: e18258.
  25. 25. Chou KC (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of Theoretical Biology 273: 236–247.
  26. 26. Shen HB, Chou KC (2007) Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. Biochemical and Biophysical Research Communications 355: 1006–1011.
  27. 27. Shen HB, Chou KC (2010) Virus-mPLoc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites. Journal of Biomolecular Structure & Dynamics 28: 175–186.
  28. 28. Chou KC (2001) Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure, Function, and Bioinformatics 43: 246–255.
  29. 29. Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21: 10–19.
  30. 30. Chou KC (2009) Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current Proteomics 6: 262–274.
  31. 31. Nanni L, Lumini A, Gupta D, Garg A (2011) Identifying bacterial virulent proteins by fusing a set of classifiers based on variants of chou's pseudo amino acid composition and on evolutionary information. IEEE/ACM Transactions on Computational Biology and Bioinformatics.
  32. 32. Qiu JD, Suo SB, Sun XY, Shi SP, Liang RP (2011) OligoPred: a web-server for predicting homo-oligomeric proteins by incorporating discrete wavelet transform into chou's pseudo amino acid composition. Journal of Molecular Graphics and Modelling 30: 129–134.
  33. 33. Chen C, Chen L, Zou X, Cai P (2009) Prediction of protein secondary structure content by using the concept of chous pseudo amino acid composition and support vector machine. Protein and Peptide Letters 16: 27–31.
  34. 34. Zou D, He Z, He J, Xia Y (2011) Supersecondary structure prediction using chou's pseudo amino acid composition. Journal of Computational Chemistry 32: 271–278.
  35. 35. Li ZC, Zhou XB, Dai Z, Zou XY (2009) Prediction of protein structural classes by chou's pseudo amino acid composition: approached using continuous wavelet transform and principal component analysis. Amino Acids 37: 415–425.
  36. 36. Sahu SS, Panda G (2010) A novel feature representation method based on chou's pseudo amino acid composition for protein structural class prediction. Computational Biology and Chemistry 34: 320–327.
  37. 37. Zhang SW, Chen W, Yang F, Pan Q (2008) Using chou's pseudo amino acid composition to predict protein quaternary structure: a sequence-segmented PseAAC approach. Amino Acids 35: 591–598.
  38. 38. Qiu JD, Huang JH, Shi SP, Liang RP (2010) Using the concept of chous pseudo amino acid composition to predict enzyme family classes: An approach with support vector machine based on discrete wavelet transform. Protein and Peptide Letters 17: 715–722.
  39. 39. Zhou XB, Chen C, Li ZC, Zou XY (2007) Using chou's amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. Journal of Theoretical Biology 248: 546–551.
  40. 40. Wang YC, Wang XB, Yang ZX, Deng NY (2010) Prediction of enzyme subfamily class via pseudo amino acid composition by incorporating the conjoint triad feature. Protein and Peptide Letters 17: 1441–1449.
  41. 41. Li FM, Li QZ (2008) Predicting protein subcellular location using chous pseudo amino acid composition and improved hybrid approach. Protein and Peptide Letters 15: 612–616.
  42. 42. Zhang SW, Zhang YL, Yang HF, Zhao CH, Pan Q (2008) Using the concept of chou's pseudo amino acid composition to predict protein subcellular localization: an approach by incorporating evolutionary information and von neumann entropies. Amino Acids 34: 565–572.
  43. 43. Lin J, Wang Y (2011) Using a novel AdaBoost algorithm and chou's pseudo amino acid composition for predicting protein subcellular localization. Protein and Peptide Letters 18: 1219–1225.
  44. 44. Lin J, Wang Y, Xu X (2011) A novel ensemble and composite approach for classifying proteins based on chou's pseudo amino acid composition. African Journal of Biotechnology 10: 16963–16968.
  45. 45. Ding YS, Zhang TL (2008) Using chou's pseudo amino acid composition to predict subcellular localization of apoptosis proteins: An approach with immune genetic algorithm-based ensemble classifier. Pattern Recognition Letters 29: 1887–1892.
  46. 46. Lin H, Wang H, Ding H, Chen YL, Li QZ (2009) Prediction of subcellular localization of apoptosis protein using chou's pseudo amino acid composition. Acta Biotheoretica 57: 321–330.
  47. 47. Jian X, Wei R, Zhan T, Gu Q (2008) Using the concept of chous pseudo amino acid composition to predict apoptosis proteins subcellular location: An approach by approximate entropy. Protein and peptide letters 15: 392–396.
  48. 48. Kandaswamy KK, Pugalenthi G, Moller S, Hartmann E, Kalies KU, et al. (2010) Prediction of apoptosis protein locations with genetic algorithms and support vector machines through a new mode of pseudo amino acid composition. Protein and Peptide Letters 17: 1473–1479.
  49. 49. Jiang X, Wei R, Zhao Y, Zhang T (2008) Using chou's pseudo amino acid composition based on approximate entropy and an ensemble of AdaBoost classifiers to predict protein subnuclear location. Amino Acids 34: 669–675.
  50. 50. Lin H, Ding H, Guo FB, Zhang AY, Huang J (2008) Predicting subcellular localization of my- cobacterial proteins by using chou's pseudo amino acid composition. Protein and Peptide Letters 15: 739–744.
  51. 51. Zeng Yh, Guo Yz, Xiao Rq, Yang L, Yu Lz, et al. (2009) Using the augmented chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. Journal of Theoretical Biology 259: 366–372.
  52. 52. Nanni L, Lumini A (2008) Genetic programming for creating chou's pseudo amino acid based features for submitochondria localization. Amino Acids 34: 653–660.
  53. 53. Ding H, Luo L, Lin H (2009) Prediction of cell wall lytic enzymes using chou's amphiphilic pseudo amino acid composition. Protein and Peptide Letters 16: 351–355.
  54. 54. Esmaeili M, Mohabatkar H, Mohsenzadeh S (2010) Using the concept of chou's pseudo amino acid composition for risk type prediction of human papillomaviruses. Journal of Theoretical Biology 263: 203–209.
  55. 55. Fang Y, Guo Y, Feng Y, Li M (2008) Predicting DNA-binding proteins: approached from chou's pseudo amino acid composition and other specific sequence features. Amino Acids 34: 103–109.
  56. 56. Gu Q, Ding YS, Zhang TL (2010) Prediction of G-Protein-Coupled receptor classes in low homology using chou's pseudo amino acid composition with approximate entropy and hydrophobicity patterns. Protein and Peptide Letters 17: 559–567.
  57. 57. Qiu JD, Huang JH, Liang RP, Lu XQ (2009) Prediction of g-protein-coupled receptor classes based on the concept of chou's pseudo amino acid composition: An approach from discrete wavelet transform. Analytical Biochemistry 390: 68–73.
  58. 58. Guo J, Rao N, Liu G, Yang Y, Wang G (2011) Predicting protein folding rates using the concept of chou's pseudo amino acid composition. Journal of Computational Chemistry 32: 1612–1617.
  59. 59. Hao L (2008) The modified mahalanobis discriminant for predicting outer membrane proteins by using chou's pseudo amino acid composition. Journal of Theoretical Biology 252: 350–356.
  60. 60. Mohabatkar H (2010) Prediction of cyclin proteins using chou's pseudo amino acid composition. Protein and Peptide Letters 17: 1207–1214.
  61. 61. Mohabatkar H, Beigi MM, Esmaeili A (2011) Prediction of GABAA receptor proteins using the concept of chou's pseudo-amino acid composition and support vector machine. Journal of Theoretical Biology 281: 18–23.
  62. 62. Yu L, Guo Y, Li Y, Li G, Li M, et al. (2010) SecretP: identifying bacterial secreted proteins by fusing new features into chou's pseudo-amino acid composition. Journal of Theoretical Biology 267: 1–6.
  63. 63. Zhang GY, Fang BS (2008) Predicting the cofactors of oxidoreductases based on amino acid composition distribution and chou's amphiphilic pseudo-amino acid composition. Journal of Theoretical Biology 253: 310–315.
  64. 64. Zhang GY, Li HC, Gao JQ, Fang BS (2008) Predicting lipase types by improved chou's Pseudo-Amino acid composition. Protein and Peptide Letters 15: 1132–1137.
  65. 65. Hu L, Zheng L, Wang Z, Li B, Liu L (2011) Using pseudo amino acid composition to predict protease families by incorporating a series of protein biological features. Protein and Peptide Letters 18: 552–558.
  66. 66. Ding H, Liu L, Guo FB, Huang J, Lin H (2011) Identify golgi protein types with modified mahalanobis discriminant algorithm and pseudo amino acid composition. Protein and Peptide Letters 18: 58–63.
  67. 67. Georgiou D, Karakasidis T, Nieto J, Torres A (2009) Use of fuzzy clustering technique and matrices to classify amino acids and its impact to chou's pseudo amino acid composition. Journal of Theoretical Biology 257: 17–26.
  68. 68. Shen HB, Chou KC (2009) Gpos-mPLoc: a top-down approach to improve the quality of predicting subcellular localization of gram-positive bacterial proteins. Protein and Peptide Letters 16: 1478–1484.
  69. 69. Shen HB, Chou KC (2009) A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0. Analytical Biochemistry 394: 269–274.
  70. 70. Shen HB, Chou KC (2010) Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of gram-negative bacterial proteins. Journal of Theoretical Biology 264: 326–333.
  71. 71. Chou KC, Shen HB (2010) Plant-mPLoc: a Top-Down strategy to augment the power for predicting plant protein subcellular localization. PLoS ONE 5: e11335.
  72. 72. Khan A, Majid A, Hayat M (2011) CE-PLoc: an ensemble classifier for predicting protein subcellular locations by fusing different modes of pseudo amino acid composition. Computational Biology and Chemistry 35: 218–229.
  73. 73. Xiao X, Wu ZC, Chou KC (2011) A Multi-Label classifier for predicting the subcellular localization of Gram-Negative bacterial proteins with both single and multiple sites. PLoS ONE 6: e20592.
  74. 74. Xiao X, Wu ZC, Chou KC (2011) iLoc-Virus: a multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. Journal of Theoretical Biology 284: 42–51.
  75. 75. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Gene ontology: tool for the unification of biology. Nature genetics 25: 25–29.
  76. 76. Chou KC, Shen HB (2008) Cell-PLoc: a package of web servers for predicting subcellular localization of proteins in various organisms. Nature Protocols 3: 153–162.
  77. 77. Schffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, et al. (2001) Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Research 29: 2994–3005.
  78. 78. Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, et al. (2009) Protein function annotation by homology-based inference. Genome Biology 10: 207.
  79. 79. Gerstein M, Honig B (2001) Sequences and topology. Current Opinion in Structural Biology 11: 327–329.
  80. 80. Chou KC (2004) Structural bioinformatics and its impact to biomedical science. Current Medicinal Chemistry 11: 2105–2134.
  81. 81. Camon E, Magrane M, Barrell D, Binns D, Fleischmann W, et al. (2003) The gene ontology annotation (GOA) project: Implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Research 13: 662–672.
  82. 82. Read J, Pfahringer B, Holmes G, Frank E (2009) Classifier chains for multi-label classification. pp. 254–269.
  83. 83. Tsoumakas G, Katakis I, Vlahavas I (2010) Mining multi-label data. Data Mining and Knowledge Discovery Handbook. Boston, MA: Springer US. pp. 667–685.
  84. 84. Cortes C, Vapnik V (1995) Support-vector networks. Machine Learning 20: 273–297.
  85. 85. Fan R, Chang K, Hsieh C, Wang X, Lin C (2008) LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research 9: 18711874.
  86. 86. Lin WZ, Fang JA, Xiao X, Chou KC (2011) iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PLoS ONE 6: e24756.
  87. 87. Wang P, Xiao X, Chou KC (2011) NR-2L: a Two-Level predictor for identifying nuclear receptor subfamilies based on Sequence-Derived features. PLoS ONE 6: e23505.
  88. 88. Xiao X, Wang P, Chou KC (2011) GPCR-2L: predicting g protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions. Molecular BioSystems 7: 911–919.
  89. 89. Zhang S, Xia X, Shen J, Zhou Y, Sun Z (2008) DBMLoc: a database of proteins with multiple subcellular localizations. BMC Bioinformatics 9: 127.
  90. 90. Chou KC, Shen HB (2009) Recent advances in developing web-servers for predicting protein attributes. Natural Science 1: 6392.