Subcellular localization of a protein is important to understand proteins’ functions and interactions. There are many techniques based on computational methods to predict protein subcellular locations, but it has been shown that many prediction tasks have a training data shortage problem. This paper introduces a new method to mine proteins with non-experimental annotations, which are labeled by non-experimental evidences of protein databases to overcome the training data shortage problem. A novel active sample selection strategy is designed, taking advantage of active learning technology, to actively find useful samples from the entire data pool of candidate proteins with non-experimental annotations. This approach can adequately estimate the “value” of each sample, automatically select the most valuable samples and add them into the original training set, to help to retrain the classifiers. Numerical experiments with for four popular multi-label classifiers on three benchmark datasets show that the proposed method can effectively select the valuable samples to supplement the original training set and significantly improve the performances of predicting classifiers.
Citation: Cao J, Liu W, He J, Gu H (2013) Mining Proteins with Non-Experimental Annotations Based on an Active Sample Selection Strategy for Predicting Protein Subcellular Localization. PLoS ONE 8(6): e67343. https://doi.org/10.1371/journal.pone.0067343
Editor: Marinus F.W. te Pas, Wageningen UR Livestock Research, The Netherlands
Received: January 23, 2013; Accepted: May 16, 2013; Published: June 26, 2013
Copyright: © 2013 Cao et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This work was supported by a Specialized Research Fund for the Doctoral Program of Higher Education (Grant No. 20120041110008), National Science and Technology Mega-Project Program of China (Grant No. 2011ZX09101-008-09), National Natural Science Foundation of China (Grant No. 61174027), and the Program for Liaoning Excellent Talents in University (Grant No. LJQ2012005). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
A good understanding of protein subcellular location is a key for deducing protein functions, revealing disease pathogenesis, and identifying drag targets. In the last ten years, the rapid growth of protein data has made it faster and more economical to predict subcellular localization via computational methods. Since the first protein location prediction system emerged , many prediction approaches and predictors have been proposed. These methods are mostly based on classification algorithms, e.g. k-nearest neighbor (KNN) –, support vector machine (SVM) –, Bayesian methods , , and neural network , , etc. A comprehensive review  provides the process to establish a robust predictor of protein subcellular localization, with following aspects: (a) selecting or constructing an effective benchmark dataset to train and test the predictor; (b) formulating the protein samples with a valid mathematical expression; (c) proposing a powerful algorithm (classifier) for prediction tasks; and (d) performing proper tests to objectively evaluate the performance of the predictor. Among these aspects, one key factor of building a high-accuracy prediction method is to obtain a valid dataset with sufficient useful information to train a powerful classifier.
Normally, the training data of a subcellular localization predictor are acquired from the “proteins with experimental annotations (referred as PEAs hereafter)” in protein databases, which are labeled by sufficient experimental evidences. However, as we know, experimental methods require a long time to obtain conclusive evidence to assign an annotation. Therefore, these experimental protein sequences are just a small part of the overall sequences. According to the record (version 2012_05) of the central protein databank UniProtKB/Swiss-Prot, the PEAs only occupy 13.22% of the total reviewed protein contained therein. In this study, we also counted the number of the protein sequences over the past ten years in UniProtKB/Swiss-Prot and summarized the statistics in Table 1. Over the last decade, there was a tenfold increase in the amount of all protein sequences, but the growth of the experimental sequences was less than doubled. While more PEAs of all types are needed to provide useful information for increasing undetermined proteins, the gap between the amount of PEAs and the entire protein data are becoming larger and larger. In addition, for computational prediction methods, excess of homologous or similar protein data will cause the over-fitting problem and these data are redundant for training, consequently, most of these PEAs have to be abandoned in practice. Besides, some special subcellular locations are correlated with very few PEAs and it also restricts the number of data used for learning. Therefore, there are often insufficient PEAs when constructing a proper dataset for a prediction task. For instance, the virus benchmark dataset in paper  merely consists of 207 proteins, and there are only eight proteins located in “viral capsid”. The problem of lacking high-quality training data nearly occurs within each species and it has been a major problem in many bioinformatics researches because the prediction with sparse data would mostly obtain disappointing results .
To overcome this shortage of training data, seeking extra protein training data becomes a very natural idea. Besides the PEAs, we recently find that we can take advantage of the huge number of “proteins with non-experimental annotations (referred as PNEAs hereafter)” in the central protein database UniProtKB/Swiss-Prot. Since the observations are not marked from direct experiments, non-experimental annotations are labeled based on non-experimentally proven findings such as logical or conclusive evidences, sequence analysis results, biological events and characteristics . A PNEA has at least one non-experimental label in its “Subcellular location” item, and a non-experimental label corresponds to one of the following three types : “Probable” - from non-direct experimental evidences; “Potential” - from computer prediction, logical or conclusive evidences; “By similarity” - from experimental evidences in a close member of the family. The details of the three non-experimental labels can be found in the UniProtKB/Swiss-Prot manual at http://www.uniprot.org/manual. For protein subcellular location prediction based on computational methods, the PNEAs who are being ignored are important and valuable. Unlike unknown protein data, the PNEAs provide a lot of high reliable reference location information. Additionally, as shown in Table 1, PNEAs have a much larger number and grow much faster than PEAs. If such abundant PNEAs can be effectively exploited, they would provide a huge supplement to PEAs for training more powerful predictors. Despite the big advantage of PNEAs, not all of them can be indiscriminately used as supplementary training data. The reason is that the non-experimental evidence is still weaker than the experimental proof, so some portion of PNEAs may have inaccurate non-experimental labels. Therefore, a feasible rule is needed to select the useful members of the PNEAs with a low risk and high quality for training a classifier.
In order to develop a proper rule for the active selection process, a machine learning technique named “active learning” is adopted in our study. This active learning method is a paradigm for using unlabeled data to complement labeled data, as it can actively select and learn from the most informative unlabeled data. The idea of actively selecting new samples is suitable for our work. However, there are some issues with the active learning process that need to be resolved before it can be properly used in this study. The active learner always actively asks the user to label the unlabeled data so that it can learn a good classifier with as few manual labeled samples as possible ; while in our study, the candidate PNEA samples are not unlabeled but rather have special non-experimental labels, and the proposed algorithm should automatically pick out enough but not redundant samples from the whole PNEA dataset. Therefore, inspired by an active learning algorithm , this paper proposes such a novel active sample selection strategy for PNEAs to increase the amount of training data available. For the weak basic classifiers learned via only the original data, this strategy measures the usefulness of all candidate PNEAs, and picks out these most useful PNEAs as supplementary training data. The weak classifiers are then retrained on the new training set to obtain improved prediction performances.
The effectiveness of the proposed approach is tested on three protein benchmark datasets from virus, plant and gram-negative bacteria cells, by four popular multi-label learning classification algorithms which are based on KNN, SVM, Bayesian method and neural network. The results show that the proposed method can effectively pick out the useful PNEAs and there are obvious enhancements for the prediction performances of each basic classifier.
Materials and Methods
Three existing benchmark experimental datasets of different species are used for cross-validation tests, which include a virus dataset  consisting of 207 proteins and 6 different subcellular location classifications, a plant dataset  consisting of 978 proteins and 12 different subcellular location classifications, and a Gram-negative bacteria (referred as Gneg hereafter) dataset  consisting of 1392 proteins and 8 different subcellular location classifications. In order to obtain effective candidates for supplementary training data, we extracted numerous PNEAs of the three species by parsing the “Subcellular location” section of the “Comments” field in UniProtKB/Swiss-Prot database (release 2012_05). Protein fragments and those containing less than 50 amino acid residues were discarded. Similarly, we also collected several new PEAs which were not included in the above-mentioned benchmark datasets for an independent test. In order to reduce the redundancy and avoid homology bias, we used a public server PISCES  based on PSI-BLAST alignments to identify and cull protein sequences from all the sequence data extracted to ensure that none of these proteins have a ≥25% sequence similarity to one another as well as any sequence in the benchmark dataset for the same species.
After culling, we created three supplementary training sample pools as candidates for active selection, which consist of 238 virus PNEAs, 758 plant PNEAs and 248 Gneg PNEAs. We also constructed three additional independent test sets, consisting of 69 virus PEAs, 261 plant PEAs and 207 Gneg PEAs. Note that, because some proteins occur in more than one location, the concept of “locative protein” in the literature  is employed to compute performance indexes of the classifiers. This concept means that a protein coexisting at N () different location sites will be counted as N locative proteins even if they have an identical sequence. The amounts of active/locative proteins in the three groups of datasets are shown in Table 2. More details about the datasets can be found in Table S1–S3 in Material S1. The new PNEAs and new PEAs used in our research are all listed in Material S2 (Supplementary Dataset S1–S6).
Active Sample Selection Strategy
In this study, because some proteins have multiple subcellular localization sites, the final prediction task is also a multi-label learning problem. Accordingly, the active sample selection strategy should have the ability to deal with the multi-label cases. Let denote the original training data set consisting of PEAs classified in different subcellular locations, where each protein can be represented by a feature vector of dimensions as , and the label set denotes the protein subcellular locations of . For each protein , if it inhabits the subcellular location, mark , otherwise . The basic classifier is trained by to output a set of labels for each unseen protein. Let denote the supplementary training sample pool containing PNEAs, where a protein has the label set . For each protein , if it has the subcellular location labeled by experimental/non-experimental evidences, we mark that , otherwise . Note that, both and mean , and the subscripts are merely used for recognizing that this positive label is obtained by corresponding experimental or non-experimental annotations.
In order to actively pick out the useful samples from the supplementary training sample pool, the key is to create a feasible evaluation function to measure the usefulness of a non-experimental sample and decide which samples should be added into the original training set. In this paper, the classification risk of a sample is used for reflecting the sample’s usefulness, where a lower risk means a higher usefulness. For a sample in , let be the classification risk which is brought by adding into the original training set, and the evaluation function of is defined by its maximum risk. Our motivation is to evaluate the risks, and pick out the optimal by minimizing the maximum risk, that leads to the following min-max combinatorial optimization problem:(1)(2)where, represents the unknown actual label set for , where, for each label , if , if , but if then may be 1 or −1. is the regularization item which measures the model complexity of the classifier, here is a reproducing kernel Hilbert space endowed with kernel function . is a quadratic loss function and is a weighted quadratic loss function, i.e.,(3)(4)
where, is the weight function. For a PNEA, its associated label set is uncertain because its non-experimental label may not be the active label, and it is hard to directly calculate its loss. Therefore, the weight function is added to reflect the probability that a non-experimental label is the active label, which can be written as:(5)here is the posterior probability of the event that just equals when has a non-experimental label . According to the previous description of , it can be deduced that when or . Therefore, it only needs to estimate the posterior probability for a non-experimental label . We use the Parzen-window estimation with the Gaussian kernel  to estimate the posterior probability of as:(6)where, the prior probability is the confidence of the event that if then , and it is set as the parameter related to the type of the corresponding non-experimental label , is the complementary set of , and are short for and respectively, which are defined as:(7)(8)
where, consisting of samples is the set of all samples with certain labels; consisting of samples defined as the set of all samples with non-experimental labels because a PNEA sample will get the maximum loss when all the actual label are opposite to the corresponding non-experimental positive label, i.e. and .
Except for , all other parts in Eq.(14) can be determined and the min-max optimization problem described as Eq.(8) can be solved through using all feasible values of to find the optimal with the smallest . Similarly, we can pick out other PNEA samples one by one.
Since the usefulness of all the PNEAs are being measured, the algorithm needs to decide how many samples in should be added to to help to retrain the classifier. We observe that there is a high correlation between the usefulness of PNEA samples in and the change rates of the evaluation values. If the change becomes stable, it means the latest added supplementary training samples have little or no effect. Based on this point, this paper presents a simple algorithm, which can output a proper proportion of all samples in the supplementary training sample pool. First, rank all the samples within in ascending order according to their evaluation to compose a new ordered set . Next, denote the evaluation value of a sample in by . Then the change rate of its evaluation value can be written as:(15)
For a given step of proportion and the corresponding number of intervals , the algorithm needs to decide which proportion is preferred for helping to retrain the basic classifier (e.g. , then , where the preferred proportion is one of following percentages: 10%, 20%, 30%, …, and 100%). Let be the number of the samples in the t-th interval, and the preferred proportion can be calculated as:(16)(17)
Note that, it is hard to theoretically prove whether the output proportion is the global optimum or not, but it can be seen that can indeed provide excellent results in subsequent simulation experiments.
After selecting the top of the samples in and adding them into the original training set, the initial classifier is updated according to the new training set and its performance is improved. An illustration of the work process of the proposed active example selection strategy is shown in Fig. 1.
In order to comprehensively evaluate the active sample selection method and compare the classifier performances with/without the proposed approach, some common evaluation metrics are used. Here, denotes a test set, returns a set of proper labels of ; returns a probability indicating the confidence for to be a proper label of ; is the rank of derived from . Let and represent the complementary sets of and , respectively. Therefore, we have:
- True Positives:
- False Positives:
- True Negatives:
- False Negatives:
Based on the above, three global indices: accuracy (Accu), Matthews correlation coefficient (MCC) and F1-scroe, and three multi-label evaluation metrics: average precision (Avgprec), ranking loss (Rloss) and coverage are computed as follows:(18)(19)(20)(21)(22)(23)
Results and Discussion
We performed several simulation experiments to evaluate the performance of the proposed approach through both the sub-sampling (10-fold cross validation) and independent dataset test methods using the three groups of datasets mentioned in section “Material and Methods”. In the sub-sampling tests, we performed multiple rounds of randomizations of the original training and testing data on each benchmark dataset. In the independent dataset tests, the benchmark datasets were directly used as the original training sets, and the new independent test sets were adopted for testing. The amphiphilic pseudo amino acid composition  was employed as the feature extraction technology to represent a protein sequence. The protein sequences were formulated with a valid mathematical expression by this method through a public online server named PseAAC at: http://www.csbio.sjtu.edu.cn/bioinf/PseAA/. The details of PseAAC can be found in reference . In this study, amino acid characters were empirically chosen to be Hydrophobicity, Hydrophilicity and Mass; the weight factor was 0.4, and the lambda parameter was 5. Four different types of multi-label classification models including IMKNN , SVM , Gaussian process  and ML-RBF , were used as basic classifiers to test our algorithm. The parameters of these classifiers were assigned the same values as the original papers and all these parameters were fixed in the whole experiments for an objective comparison.
The overall performances of the above classification algorithms following three kinds of conditions were compared. These conditions were: not using the proposed active sample selection (using no supplementary training samples), using the proposed active sample selection with a preferred proportion (top ) of the supplementary training samples, and directly using the whole PNEA samples in the supplementary training sample pool. In the experiments, the kernel function of was the same Gaussian kernel used for estimating the posterior probability in Eq.(6). The prior probabilities of the three levels of non-experimental labels were set according to the strength of the evidences of the three non-experimental label types: the prior probability with “Probable” label was set to be the largest, the value of “Potential’ was medium and “By Similarity” was the smallest. We tested several values for the prior probabilities and finally choose a group of values with the best results as: 0.85 for “Probable”, 0.8 for “Potential” and 0.75 for “By Similarity”. The step of proportion was set to 10% and the number of intervals was 10.
Through the numerical experiments, we observe the preferred proportions of active sample selection for various datasets are different. The preferred proportion of virus PNEA samples is , for plant, and for Gram-negative bacteria. The comparisons of the performances of these classification models by using none, preferred proportion and all of the samples in the supplementary training sample pool are shown in Table 3, Table 4, Table 5, Table 6. Table 3, Table 4 and Table 5 show the average values of 10 randomizations, 10-fold cross-validation measures and their standard deviations, and Table 6 shows the results of the independent dataset test. For each evaluation metric, “↑” means the bigger the metric value the better the performance, and “↓” means the smaller the metric value the better the performance. It can be seen, for each case, the classifier using the supplementary training data selected by the proposed approach always performs better than the basic classifier using no supplementary training sample. Additionally, the results under the proposed approach are superior to that of indiscriminately using the whole data in the supplementary training data pool. From the simulation results, it can be concluded that, on one hand, the improvements to the original prediction indicates that the selected PNEA samples are useful and indeed provide helpful information for prediction; on the other hand, the better performance of the active sample selection over directly using all the samples in the supplementary training sample pool indicates that a part of the PNEA samples disrupts the prediction because they may have some inaccurate information. Therefore, an effective active sample selection is important to select a proper amount of valuable PNEA samples and reduce the possibility of prediction disturbance brought by the redundant supplementary training data. We also observed that the performance improvements of all the classification models are related to the size of the original training set. For the virus cases with the least original training data, each classifier’s performance improvement is superior to that of the other two datasets. On the contrary, for the Gneg cases with the most original training data, the improvement effect is the smallest. We attribute this fact to the original training set with less data having a greater data shortage, so the basic classifiers are better improved by incrementally adding useful supplementary training data. Without dependence on the original classification model, the experiment results show that the proposed active sample selection strategy provides a generic approach for the existing prediction algorithms.
It is worth noting that, the inherent problem with PNEAs is that they can only be experimentally validated. To validate that the proposed strategy is more useful than conventional analysis based simply on PEA, it is better to test it via additional biological experiments. If we can show on PNEA data that the strategy finds true positives and rejects true negatives validated against biological observation of the characteristics of these proteins, the effectiveness of this approach will be further verified. However, in our work, it is difficult to directly conduct biological experiments to validate PNEAs. In a different way, we tried to find true positive and true negative PNEA samples which can be validated against a biological observation in the Swiss-Prot databank. Unfortunately, we found few true positives (e.g. the non-experimental annotation “Golgi apparatus” of the plant protein with entry number “Q9M2T1” has been verified experimentally) and no true negatives. Although the true positives can be successfully found using this strategy, we think the amount of samples identified is too small to provide enough support for this study. Therefore, the related results of these few protein samples are not included in this paper. Moreover, the objective of this study is not to identify true positive proteins, but to make protein subcellular localization prediction tools with better performance in accuracy with the help of non-experimental proteins. According to the results in Table Table 3, Table 4, Table 5, Table 6, the increase in accuracy over the conventional algorithms after training with these PNEAs indicates the proposed strategy works. Therefore, the proposed method could be thought of as potentially significant, even without the experimental biological validation. However, it is still worth to perform a biological validation for our algorithm, and we hope to cooperate with biochemists to improve this method in the future.
In summary, in order to overcome the shortage of experimental training data in the prediction of protein subcellular location, we mined the proteins with non-experimental annotations and designed a novel active sample selection strategy to find useful PNEA samples. As supplementary training data, these selected samples helped retrain and improve the original basic classifiers. This approach based on the min-max view provides a systematic way for measuring the usefulness of a sample with multiple labels. From the results, it can be clearly seen that the proposed algorithm is significant and valid to increase the predicting performance of all four types of classifiers. We believe that active sample selection techniques in machine learning can be used as a powerful and useful tool to alleviate the data shortage problem and it could be extended to other real-world data mining applications. We also expect that the information of a huge number of proteins with non-experimental annotations can be applied to other biological problems. Furthermore, in order to make the presented method available to compare with the predictors by other interested users, we will make efforts to provide an online prediction web-server with practical value in our future work.
This includes Tables S1–S3.
This includes Datasets S1-S6.
The authors wish to thank the editor and anonymous reviewers for their helpful comments and suggestions.
Conceived and designed the experiments: JC WL. Performed the experiments: JC JH. Analyzed the data: JC WL. Contributed reagents/materials/analysis tools: HG. Wrote the paper: JC. Designed the program used in experiments: JC JH.
- 1. Nakai K, Kanehisa M (1991) Expert system for predicting protein localization sites in gram-negative bacteria. Proteins: Struct Funct Bioinf 11: 95–110.
- 2. Chou KC, Shen HB (2008) Cell-PLoc: a package of web servers for predicting subcellular localization of proteins in various organisms. Nat Protocols 2: 153–162.
- 3. Chou KC, Shen HB (2010) Cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms. Natural Science 2: 1090–1103.
- 4. Horton P, Park KJ, Obayashi T, Fujita N, Harada H, et al. (2007) WoLF PSORT: protein localization predictor. Nucleic Acids Res 35: W585–W587.
- 5. Cao JZ, Liu WQ, Gu H (2012) Predicting viral protein subcellular localization with Chou’s pseudo amino acid composition and imbalance-weighted multi-label k-nearest neighbor algorithm. Protein Pept Lett 19: 1163–1169.
- 6. Gray A, Bhasin M, Raghava GPS (2005) Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. J Biol Chem 280: 14427–14432.
- 7. Shatkay H, Höglund A, Brady S, Blum T, Dönnes P, et al. (2007) SherLoc: high-accuracy prediction of protein subcellular localization by integrating text and protein sequence data. Bioinformatics 23: 1410–1417.
- 8. Briesemeister S, Blum T, Brady S, Lam Y, Kohlbacher O, et al. (2009) SherLoc2: a high-accuracy hybrid method for predicting subcellular localization of proteins. J Proteome Res 8: 5393–5366.
- 9. Bulashevska A, Eils R (2006) Predicting protein subcellular locations using hierarchical ensemble of Bayesian classifiers based on Markov chains. BMC Bioinformatics 7: 298.
- 10. He JJ, Gu H, Liu WQ (2012) Imbalanced multi-modal multi-label learning for subcellular localization prediction of human proteins with both single and multiple sites. PLoS ONE 7: e37155.
- 11. Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300: 1005–1016.
- 12. Ma JW, Liu WQ, Gu H (2010) Using elman networks ensemble for protein subnuclear location prediction. Int J Innov Comput I 6: 5093–5103.
- 13. Chou KC (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 273: 236–247.
- 14. Shen HB, Chou KC (2010) Virus-mPLoc: a fusion classifier for viral protein subcellular location prediction by incorporating multiple sites. J Biomol Struct Dyn 28: 175–186.
- 15. Xu Q, Pan SJ, Xue HH, Yang Q (2011) Multitask learning for protein subcellular location prediction. IEEE/ACM Trans Comput Biol Bioinform 8: 748–759.
- 16. Junker VL, Apweiler R, Bairoch A (1999) Representation of functional information in the SWISS-PROT Data Bank. Bioinformatics 15: 1066–1067.
- 17. Boutet E, Lieberherr D, Tognolli M, Schneider M, Bairoch A (2007) UniProtKB/Swiss-Prot. Methods Mol Biol 406: 89–112.
- 18. Settles B (2009) Active Learning Literature Survey. Computer Sciences Technical Report 2009: 1648.
- 19. Hoi SCH, Jin R, Zhu J, Lyu MR (2007) Semi-supervised SVM batch mode active learning with applications to image retrieval. ACM T Inform Syst 27: 1–29.
- 20. Chou KC, Shen HB (2010) Plant-mPLoc: A top-down strategy to augment the power for predicting plant protein subcellular localization. PLoS ONE 5: e11335.
- 21. Shen HB, Chou KC (2010) Gneg-mPLoc: a top-down strategy to enhance the quality of predicting subcellular localization of Gram-negative bacterial proteins. J Theor Biol 264: 326–333.
- 22. Wang GL, Dunbrack RL (2003) PISCES: a protein sequence culling server. Bioinformatics 19: 1589–1591.
- 23. Li B, Chen YW, Chen YQ (2008) The nearest neighbor algorithm of local probability centers. IEEE T Syst Man Cy B 38: 141–154.
- 24. Chou KC (2005) Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21: 10–19.
- 25. Shen HB, Chou KC (2008) PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. Anal Biochem 373: 386–388.
- 26. Huang J, Shi F (2005) Support vector machines for predicting apoptosis proteins types. Acta Biotheor 53: 39–47.
- 27. Zhang ML (2009) ML-RBF : RBF neural networks for multi-label learning. Neural Process Lett 29: 61–74.