A New Method for Predicting the Subcellular Localization of Eukaryotic Proteins with Both Single and Multiple Sites: Euk-mPLoc 2.0

Kuo-Chen Chou; Hong-Bin Shen

doi:10.1371/journal.pone.0009931

Abstract

Information of subcellular locations of proteins is important for in-depth studies of cell biology. It is very useful for proteomics, system biology and drug development as well. However, most existing methods for predicting protein subcellular location can only cover 5 to 12 location sites. Also, they are limited to deal with single-location proteins and hence failed to work for multiplex proteins, which can simultaneously exist at, or move between, two or more location sites. Actually, multiplex proteins of this kind usually posses some important biological functions worthy of our special notice. A new predictor called “Euk-mPLoc 2.0” is developed by hybridizing the gene ontology information, functional domain information, and sequential evolutionary information through three different modes of pseudo amino acid composition. It can be used to identify eukaryotic proteins among the following 22 locations: (1) acrosome, (2) cell wall, (3) centriole, (4) chloroplast, (5) cyanelle, (6) cytoplasm, (7) cytoskeleton, (8) endoplasmic reticulum, (9) endosome, (10) extracell, (11) Golgi apparatus, (12) hydrogenosome, (13) lysosome, (14) melanosome, (15) microsome (16) mitochondria, (17) nucleus, (18) peroxisome, (19) plasma membrane, (20) plastid, (21) spindle pole body, and (22) vacuole. Compared with the existing methods for predicting eukaryotic protein subcellular localization, the new predictor is much more powerful and flexible, particularly in dealing with proteins with multiple locations and proteins without available accession numbers. For a newly-constructed stringent benchmark dataset which contains both single- and multiple-location proteins and in which none of proteins has pairwise sequence identity to any other in a same location, the overall jackknife success rate achieved by Euk-mPLoc 2.0 is more than 24% higher than those by any of the existing predictors. As a user-friendly web-server, Euk-mPLoc 2.0 is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/euk-multi-2/. For a query protein sequence of 400 amino acids, it will take about 15 seconds for the web-server to yield the predicted result; the longer the sequence is, the more time it may usually need. It is anticipated that the novel approach and the powerful predictor as presented in this paper will have a significant impact to Molecular Cell Biology, System Biology, Proteomics, Bioinformatics, and Drug Development.

Citation: Chou K-C, Shen H-B (2010) A New Method for Predicting the Subcellular Localization of Eukaryotic Proteins with Both Single and Multiple Sites: Euk-mPLoc 2.0. PLoS ONE 5(4): e9931. https://doi.org/10.1371/journal.pone.0009931

Editor: Darren P. Martin, Institute of Infectious Disease and Molecular Medicine, South Africa

Received: February 1, 2010; Accepted: March 8, 2010; Published: April 1, 2010

Copyright: © 2010 Chou, Shen. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Funding: This work was supported by the National Natural Science Foundation of China (Grant No. 60704047), Science and Technology Commission of Shanghai Municipality (Grant No. 08ZR1410600, 08JC1410600), sponsored by Shanghai Pujiang Program and Innovation Program of Shanghai Municipal Education Commission (10ZZ17). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

With the avalanche of protein sequences generated in the post-genomic era, numerous efforts have been made to develop various methods for predicting protein subcellular localization based on the sequence information (see, e.g., [1], [2], [3], [4], [5], [6], [7], [8] as well as a long list of references cited in two comprehensive review articles [9], [10]). However, relatively much less efforts have been made to address those proteins which may simultaneously exist at, or move between, two or more different subcellular locations. Actually, proteins with multiple locations or dynamic feature of this kind are particularly interesting because they may have some very special biological functions worthy of our notice [11], [12]. Particularly, as pointed out by Millar et al. [13], recent evidences indicate that an increasing number of proteins have multiple locations in the cell.

About two years ago, a web-server predictor [14] was developed for dealing with the eukaryotic systems that contain both single-location and multiple-location proteins. The predictor is called Euk-mPLoc, where “m” stands for “multiple” meaning it can be used to deal with multiplex proteins as well. The Euk-mPLoc predictor was established by hybridizing the “higher-level” GO (gene ontology [15]) approach and PseAAC (pseudo amino acid composition [16], [17]) approach. Its power mainly came from the GO approach because proteins formulated in the GO database space would be clustered in a manner much better reflecting the distribution of their subcellular locations, as elucidated in [18].

However, the existing version of Euk-mPLoc has the following shortcomings. (1) In order to make the prediction engine able to use the advantage of the GO approach, the accession number for a query protein is required as a part of input; many proteins, such as synthetic and hypothetical proteins, or newly-discovered sequences without being deposited into databanks yet, do not have accession numbers, and hence cannot be treated with the GO approach. (2) Even though their accession numbers are available, it is not always certain for them to be meaningfully formulated in a GO space because the current GO database is far from complete yet. (3) Although the PseAAC approach, a complement to the GO approach in Euk-mPLoc, can take into account some partial sequence order effects, the original PseAAC [16], [19] missed the functional domain and sequential evolution information that may considerably affect the prediction quality.

The present study was devoted to develop a new and more powerful predictor for predicting eukaryotic protein subcellular localization by addressing the above three problems.

Materials and Methods

Protein sequences were collected from the Swiss-Prot database at http://www.ebi.ac.uk/swissprot/. The detailed procedures are basically the same as described in [14]; the only difference is: in order to establish a more updated benchmark dataset, instead of version 50.7 of the Swiss-Prot database released on 9-Sept-2006, the version 55.3 released on 29-Apr-2008 was adopted. After strictly following the procedures as described in [14], we finally obtained a benchmark dataset containing 7,766 different protein sequences that are distributed among 22 subcellular locations (Fig. 1); i.e.,(1)where represents the subset for the subcellular location of “acrosome”, for “cell membrane”, for “cell wall”, and so forth; while represents the symbol for “union” in the set theory. A breakdown of the 7,766 eukaryotic proteins in the benchmark dataset according to their 22 location sites is given in Table 1. To avoid redundancy and homology bias, none of the proteins in has pairwise sequence identity to any other in a same subset. The corresponding accession numbers and protein sequences are given in Online Supporting Information S1.

Download:

Figure 1. Illustration to show the 22 subcellular locations of eukaryotic proteins.

The 22 location sites are: (1) acrosome, (2) cell wall, (3) centriole, (4) chloroplast, (5) cyanelle, (6) cytoplasm, (7) cytoskeleton, (8) endoplasmic reticulum, (9) endosome, (10) extracell, (11) Golgi apparatus, (12) hydrogenosome, (13) lysosome, (14) melanosome, (15) microsome (16) mitochondria, (17) nucleus, (18) peroxisome, (19) plasma membrane, (20) plastid, (21) spindle pole body, and (22) vacuole. Reprinted from [14] with permission.

https://doi.org/10.1371/journal.pone.0009931.g001

Download:

Table 1. Breakdown of the eukaryotic protein benchmark dataset

derived from Swiss-Prot database (release 55.3) according to the procedures described in the Materials section.

https://doi.org/10.1371/journal.pone.0009931.t001

Because the system investigated now contains both the single-location and the multiple-location proteins, some of the proteins in may occur in two or more location sites. Therefore, it is instructive to introduce the concept of “virtual sample”, as illustrated as follows. A protein sample coexisting at two different location sites will be counted as 2 virtual samples even though they have an identical sequence; if coexisting at three different sites, 3 virtual samples; and so forth. Accordingly, the total number of the different virtual protein samples is generally greater than that of the total different sequence samples. Their relationship can be formulated as follows(2)where is the number of total different virtual protein samples in , the number of total different protein sequences, the number of proteins with one location, the number of proteins with two locations, and so forth; while is the number of total subcellular location sites (for the current case, as shown in Fig. 1 and Table 1).

For the current 7,766 different protein sequences, 6,687 occur in one subcellular location, 1,029 in two locations, 48 in three locations, 2 in four locations, and none in five or more locations. Substituting these data into Eq.2, we have(3)which is fully consistent with the figures in Table 1 and the data in Online Supporting Information S1.

As stated in a recent comprehensive review [20], to develop a powerful method for statistically predicting protein subcellular localization, one of the most important things is to formulate the sample of a protein with the core features that have intrinsic correlation with its localization in a cell. Since the concept of pseudo amino acid composition (PseAAC) was proposed [16], it has provided a very flexible mathematical frame for investigators to incorporate their desired information into the representation of protein samples. According to its original definition, the PseAAC is actually formulated by a set of discrete numbers [16] as long as it is different from the classical amino acid composition (AAC) and that it is derived from a protein sequence that is able to harbor some sort of its sequence order and pattern information, or able to reflect some physicochemical and biochemical properties of the constituent amino acids. Since the concept of PseAAC was proposed, it has been widely used to deal with many protein-related problems and sequence-related systems (see, e.g., [21], [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40], [41], [42] and a long list of PseAAC-related references cited in a recent review [20]). As summarized in [20], until now 16 different PseAAC modes have been used to represent the samples of proteins for predicting their attributes. Each of these modes has its own advantage and disadvantage. In this study, we are to formulate the protein samples by hybridizing the following three different modes of PseAAC.

1. GO (Gene Ontology) Representation Mode

GO database [15] was established according to the molecular function, biological process, and cellular component. Accordingly, protein samples defined in a GO database space would be clustered in a way better reflecting their subcellular locations [10], [18]. However, the way of using GO mode to represent a protein sample in the original Euk-mPLoc predictor [14] was derived through its accession number from the GO database [43]. Thus, when using Euk-mPLoc to perform prediction, the accession number of a query protein would be indispensable. To avoid such a requirement, the following different procedures are proposed to derive the GO representation mode.

Step 1.

Use BLAST [44] to search the homologous proteins of the query protein from the Swiss-Prot database (version 55.3), with the expect value for the BLAST parameter.

Step 2.

Those proteins which have pairwise sequence identity with the query protein are collected into a set, , called the “homology set” of . All the elements in can be deemed as the “representative proteins” of . Because they were retrieved from the Swiss-Prot database, these representative proteins must each have their own accession numbers.

Step 3.

Search each of these accession numbers collected in Step 2 against the GO database at http://www.ebi.ac.uk/GOA/ to find the corresponding GO numbers [43].

Step 4.

The current GO database (version 70.0 released 10 March 2008) contains 60,020 GO numbers, thus the query protein can be expressed via its representative proteins in by the following formulation(4)where is the transposing operator, and(5)

Through the above steps, we can use the GO information derived from its representative proteins in to formulate the query protein . The rationale of so doing is based on the fact that homology proteins generally share similar attributes, such as structural conformations and biological functions [45], [46], [47]. Thus, the accession number is no longer indispensable for the input of the query protein even if using the high-level GO approach to predict its subcellular localization as required in Euk-mPLoc [14].

The above homology-based GO extraction method is particularly useful for studying those proteins which do not have UniProt accession numbers. However, it would still fail to work under any one of the following situations: (1) the query protein does not have significant homology to any protein in the Swiss-Prot database, i.e., meaning the homology set is an empty one; (2) its representative proteins do not contain any useful GO information for statistical prediction based on a given training dataset.

Therefore, it is necessary to consider the following representation modes for those proteins which fail to be meaningfully defined in the GO space.

2. FunD (Functional Domain) Representation Mode

FunD is the core of a protein that plays the major role for its function. That is why in determining the 3-D (dimensional) structure of a protein by experiments (see, e.g., [48], [49]) or by computational modeling (see, e.g., [47], [50]) the first priority was always focused on its FunD. Actually, using the FunD information to formulate protein samples for statistical predictions was originally proposed in [51], [52], and quite encouraged results were achieved. In that time, the 2005 FunDs in the SBASE-A database [53] were used as bases to formulate the protein samples. Since then, a series of follow-up protein FunD databases were established, such as COG [54], KOG [54], SMART [55], Pfam [56], and CDD [57]. Of these databases, CDD contains the domains imported from COG, Pfam and SMART, and hence is relatively much more complete [57]. The version 2.11 of CDD contains 17,402 characteristic domains. Using each of these domains as a base vector, we can define a FunD space with 17,402 dimensions. Thus, by following the similar procedures in [51], a protein sample can be uniquely defined through the steps described below: