Mining Proteins with Non-Experimental Annotations Based on an Active Sample Selection Strategy for Predicting Protein Subcellular Localization

Subcellular localization of a protein is important to understand proteins’ functions and interactions. There are many techniques based on computational methods to predict protein subcellular locations, but it has been shown that many prediction tasks have a training data shortage problem. This paper introduces a new method to mine proteins with non-experimental annotations, which are labeled by non-experimental evidences of protein databases to overcome the training data shortage problem. A novel active sample selection strategy is designed, taking advantage of active learning technology, to actively find useful samples from the entire data pool of candidate proteins with non-experimental annotations. This approach can adequately estimate the “value” of each sample, automatically select the most valuable samples and add them into the original training set, to help to retrain the classifiers. Numerical experiments with for four popular multi-label classifiers on three benchmark datasets show that the proposed method can effectively select the valuable samples to supplement the original training set and significantly improve the performances of predicting classifiers.


Introduction
A good understanding of protein subcellular location is a key for deducing protein functions, revealing disease pathogenesis, and identifying drag targets. In the last ten years, the rapid growth of protein data has made it faster and more economical to predict subcellular localization via computational methods. Since the first protein location prediction system emerged [1], many prediction approaches and predictors have been proposed. These methods are mostly based on classification algorithms, e.g. k-nearest neighbor (KNN) [2][3][4][5], support vector machine (SVM) [6][7][8], Bayesian methods [9,10], and neural network [11,12], etc. A comprehensive review [13] provides the process to establish a robust predictor of protein subcellular localization, with following aspects: (a) selecting or constructing an effective benchmark dataset to train and test the predictor; (b) formulating the protein samples with a valid mathematical expression; (c) proposing a powerful algorithm (classifier) for prediction tasks; and (d) performing proper tests to objectively evaluate the performance of the predictor. Among these aspects, one key factor of building a high-accuracy prediction method is to obtain a valid dataset with sufficient useful information to train a powerful classifier.
Normally, the training data of a subcellular localization predictor are acquired from the ''proteins with experimental annotations (referred as PEAs hereafter)'' in protein databases, which are labeled by sufficient experimental evidences. However, as we know, experimental methods require a long time to obtain conclusive evidence to assign an annotation. Therefore, these experimental protein sequences are just a small part of the overall sequences. According to the record (version 2012_05) of the central protein databank UniProtKB/Swiss-Prot, the PEAs only occupy 13.22% of the total reviewed protein contained therein. In this study, we also counted the number of the protein sequences over the past ten years in UniProtKB/Swiss-Prot and summarized the statistics in Table 1. Over the last decade, there was a tenfold increase in the amount of all protein sequences, but the growth of the experimental sequences was less than doubled. While more PEAs of all types are needed to provide useful information for increasing undetermined proteins, the gap between the amount of PEAs and the entire protein data are becoming larger and larger. In addition, for computational prediction methods, excess of homologous or similar protein data will cause the over-fitting problem and these data are redundant for training, consequently, most of these PEAs have to be abandoned in practice. Besides, some special subcellular locations are correlated with very few PEAs and it also restricts the number of data used for learning. Therefore, there are often insufficient PEAs when constructing a proper dataset for a prediction task. For instance, the virus benchmark dataset in paper [14] merely consists of 207 proteins, and there are only eight proteins located in ''viral capsid''. The problem of lacking high-quality training data nearly occurs within each species and it has been a major problem in many bioinformatics researches because the prediction with sparse data would mostly obtain disappointing results [15].
To overcome this shortage of training data, seeking extra protein training data becomes a very natural idea. Besides the PEAs, we recently find that we can take advantage of the huge number of ''proteins with non-experimental annotations (referred as PNEAs hereafter)'' in the central protein database UniProtKB/ Swiss-Prot. Since the observations are not marked from direct experiments, non-experimental annotations are labeled based on non-experimentally proven findings such as logical or conclusive evidences, sequence analysis results, biological events and characteristics [16]. A PNEA has at least one non-experimental label in its ''Subcellular location'' item, and a non-experimental label corresponds to one of the following three types [17]: ''Probable''from non-direct experimental evidences; ''Potential'' -from computer prediction, logical or conclusive evidences; ''By similarity'' -from experimental evidences in a close member of the family. The details of the three non-experimental labels can be found in the UniProtKB/Swiss-Prot manual at http://www. uniprot.org/manual. For protein subcellular location prediction based on computational methods, the PNEAs who are being ignored are important and valuable. Unlike unknown protein data, the PNEAs provide a lot of high reliable reference location information. Additionally, as shown in Table 1, PNEAs have a much larger number and grow much faster than PEAs. If such abundant PNEAs can be effectively exploited, they would provide a huge supplement to PEAs for training more powerful predictors. Despite the big advantage of PNEAs, not all of them can be indiscriminately used as supplementary training data. The reason is that the non-experimental evidence is still weaker than the experimental proof, so some portion of PNEAs may have inaccurate non-experimental labels. Therefore, a feasible rule is needed to select the useful members of the PNEAs with a low risk and high quality for training a classifier.
In order to develop a proper rule for the active selection process, a machine learning technique named ''active learning'' is adopted in our study. This active learning method is a paradigm for using unlabeled data to complement labeled data, as it can actively select and learn from the most informative unlabeled data. The idea of actively selecting new samples is suitable for our work. However, there are some issues with the active learning process that need to be resolved before it can be properly used in this study. The active learner always actively asks the user to label the unlabeled data so that it can learn a good classifier with as few manual labeled samples as possible [18]; while in our study, the candidate PNEA samples are not unlabeled but rather have special non-experimental labels, and the proposed algorithm should automatically pick out enough but not redundant samples from the whole PNEA dataset. Therefore, inspired by an active learning algorithm [19], this paper proposes such a novel active sample selection strategy for PNEAs to increase the amount of training data available. For the weak basic classifiers learned via only the original data, this strategy measures the usefulness of all candidate PNEAs, and picks out these most useful PNEAs as supplementary training data. The weak classifiers are then retrained on the new training set to obtain improved prediction performances.
The effectiveness of the proposed approach is tested on three protein benchmark datasets from virus, plant and gram-negative bacteria cells, by four popular multi-label learning classification algorithms which are based on KNN, SVM, Bayesian method and neural network. The results show that the proposed method can effectively pick out the useful PNEAs and there are obvious enhancements for the prediction performances of each basic classifier.

The Datasets
Three existing benchmark experimental datasets of different species are used for cross-validation tests, which include a virus dataset [14] consisting of 207 proteins and 6 different subcellular location classifications, a plant dataset [20] consisting of 978 proteins and 12 different subcellular location classifications, and a Gram-negative bacteria (referred as Gneg hereafter) dataset [21] consisting of 1392 proteins and 8 different subcellular location classifications. In order to obtain effective candidates for supplementary training data, we extracted numerous PNEAs of the three species by parsing the ''Subcellular location'' section of the ''Comments'' field in UniProtKB/Swiss-Prot database (release 2012_05). Protein fragments and those containing less than 50 amino acid residues were discarded. Similarly, we also collected several new PEAs which were not included in the abovementioned benchmark datasets for an independent test. In order to reduce the redundancy and avoid homology bias, we used a public server PISCES [22] based on PSI-BLAST alignments to identify and cull protein sequences from all the sequence data extracted to ensure that none of these proteins have a $25% sequence similarity to one another as well as any sequence in the benchmark dataset for the same species.
After culling, we created three supplementary training sample pools as candidates for active selection, which consist of 238 virus PNEAs, 758 plant PNEAs and 248 Gneg PNEAs. We also constructed three additional independent test sets, consisting of 69 virus PEAs, 261 plant PEAs and 207 Gneg PEAs. Note that, because some proteins occur in more than one location, the concept of ''locative protein'' in the literature [21] is employed to compute performance indexes of the classifiers. This concept means that a protein coexisting at N (Nw1) different location sites will be counted as N locative proteins even if they have an identical sequence. The amounts of active/locative proteins in the three groups of datasets are shown in Table 2. More details about the datasets can be found in Table S1-S3 in Material S1. The new Table 1. Number of protein sequences over the past ten years (2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012) in the UniProtKB/Swiss-Prot protein knowledgebase.

Active Sample Selection Strategy
In this study, because some proteins have multiple subcellular localization sites, the final prediction task is also a multi-label learning problem. Accordingly, the active sample selection strategy should have the ability to deal with the multi-label cases. Let D l~xk ,y k ð ÞD1ƒkƒn l f gdenote the original training data set consisting of n l PEAs classified in m different subcellular locations, where each protein x k can be represented by a feature vector of d dimensions as x k1 ,x k2 , Á Á Á ,x kd ½ T , and the label set y k~yk1 ,y k2 , Á Á Á ,y km ½ T denotes the protein subcellular locations of x k . For each protein x k , if it inhabits the ith subcellular location, mark y ki~1 , otherwise y ki~{ 1. The basic classifier f : R d ?2 m is trained by D l to output a set of labels for each unseen protein.
For each protein x' s , if it has the jth subcellular location labeled by experimental/non-experimental evidences, we mark that y' sj~1e =y' sj~1ne , otherwise y' sj~{ 1. Note that, both y' sj~1e and y' sj~1ne mean y' sj~1 , and the subscripts are merely used for recognizing that this positive label is obtained by corresponding experimental or non-experimental annotations.
In order to actively pick out the useful samples from the supplementary training sample pool, the key is to create a feasible evaluation function to measure the usefulness of a non-experimental sample and decide which samples should be added into the original training set. In this paper, the classification risk of a sample is used for reflecting the sample's usefulness, where a lower risk means a higher usefulness. For a sample Þbe the classification risk which is brought by adding x' s into the original training set, and the evaluation function Our motivation is to evaluate the risks, and pick out the optimal x' s ,y' s ð Þ Ã by minimizing the maximum risk, that leads to the following min-max combinatorial optimization problem: where,ỹ y s~ỹ y s1 ,ỹ y s2 , Á Á Á ,ỹ y sm ½ T [ +1 f g m represents the unknown actual label set for x' s , where, for each labelỹ y sj ,ỹ y sj~1 if y' sj~1e , y y sj~{ 1 if y' sj~{ 1, but if y' sj~1ne thenỹ y sj may be 1 or 21. Df D 2 H is the regularization item which measures the model complexity of the classifier, here H is a reproducing kernel Hilbert space endowed with kernel function K f : , : ð Þ is a quadratic loss function and L 2 z ð Þ is a weighted quadratic loss function, i.e., where, w x' s ,ỹ y sj ,y' sj is the weight function. For a PNEA, its associated label set is uncertain because its non-experimental label may not be the active label, and it is hard to directly calculate its loss. Therefore, the weight function w x' s ,ỹ y sj ,y' sj is added to reflect the probability that a non-experimental label is the active label, which can be written as:  w x' s ,ỹ y sj ,y' sj pỹ y sj~y ' sj Dx' s ,y' sj ,ỹ y sj :y' sj 1{pỹ y sj~y ' sj Dx' s ,y' sj ,ỹ y sj =y' sj here pỹ y sj~y ' sj Dx' s ,y' sj 1ƒjƒm ð Þis the posterior probability of the event thatỹ y sj just equals y' sj when x' s has a non-experimental label y' sj . According to the previous description ofỹ y sj , it can be deduced that pỹ y sj~y ' sj Dx' s ,y' sj 1 when y' sj~1e or y' sj~{ 1.
Therefore, it only needs to estimate the posterior probability for a non-experimental label y' sj~1ne . We use the Parzen-window estimation with the Gaussian kernel [23] to estimate the posterior probability of x' s ,y' sj~1ne as: where, the prior probability pỹ y sj Dy' sj is the confidence of the event that if y' sj~1ne thenỹ y sj~y ' sj , and it is set as the parameter related to the type of the corresponding non-experimental label y' sj , p y y sj Dy' sj ~1 {pỹ y sj Dy' sj is the complementary set of pỹ y sj Dy' sj , pỹ y sj Dx' s ,y' sj and p y y sj Dx' s ,y' sj are short for pỹ y sj~y ' sj Dx' s ,y' sj and pỹ y sj =y' sj Dx' s ,y' sj respectively, which are defined as: where,  (2), this active sample selection can be written as following the min-max optimization problem: arg min From the derivation, we have: where, Df D 2 H~f Thus, the evaluation function E f ,D l ,x' s ,y' s ð Þis simplified as: Let L~L ll L ls L sl L ss ! , then Except forỹ y s , all other parts in Eq. (14) can be determined and the min-max optimization problem described as Eq.(8) can be solved through using all feasible values ofỹ y s to find the optimal x' s ,y' s ð Þ Ã with the smallest E f ,D l ,x' s ,y' s ð Þ . Similarly, we can pick out other PNEA samples one by one.
Since the usefulness of all the PNEAs are being measured, the algorithm needs to decide how many samples in D u should be added to D l to help to retrain the classifier. We observe that there is a high correlation between the usefulness of PNEA samples in D r and the change rates of the evaluation values. If the change becomes stable, it means the latest added supplementary training samples have little or no effect. Based on this point, this paper presents a simple algorithm, which can output a proper proportion of all samples in the supplementary training sample pool. First, rank all the samples within D u in ascending order according to their evaluation to compose a new ordered set D r . Next, denote the evaluation value of a sample x' i ,y' i ð Þ in D r by E i 1ƒiƒn u ð Þ . Then the change rate of its evaluation value R i can be written as: For a given step of proportion a and the corresponding number of intervals T~1=a, the algorithm needs to decide which proportion is preferred for helping to retrain the basic classifier (e.g. a~10%, then T~10, where the preferred proportion is one of following percentages: 10%, 20%, 30%, …, and 100%). Let Num t 1ƒtƒT ð Þbe the number of the samples in the t-th interval, and the preferred proportion h can be calculated as: h~a : arg min t~1,2,ÁÁÁ,T 1 Num t Note that, it is hard to theoretically prove whether the output proportion h is the global optimum or not, but it can be seen that h can indeed provide excellent results in subsequent simulation experiments.
After selecting the top h of the samples in D r and adding them into the original training set, the initial classifier is updated according to the new training set and its performance is improved. An illustration of the work process of the proposed active example selection strategy is shown in Fig. 1.

Evaluation Metrics
In order to comprehensively evaluate the active sample selection method and compare the classifier performances with/without the proposed approach, some common evaluation metrics are used. Here, D~x i ,y i ð ÞD1ƒiƒn f g denotes a test set, g x i ð Þ returns a set of proper labels of x i ; h x i ,y ð Þ returns a probability indicating the confidence for y to be a proper label of x i ; rank h x i ,y ð Þ is the rank of y derived from h x i ,y ð Þ. Let y y i and g g x i ð Þ represent the complementary sets of y i and g x i ð Þ , respectively. Therefore, we have: Based on the above, three global indices: accuracy (Accu), Matthews correlation coefficient (MCC) and F1-scroe, and three multi-label evaluation metrics: average precision (Avgprec), ranking loss (Rloss) and coverage are computed as follows: Coverage~1 n

Results and Discussion
We performed several simulation experiments to evaluate the performance of the proposed approach through both the subsampling (10-fold cross validation) and independent dataset test methods using the three groups of datasets mentioned in section ''Material and Methods''. In the sub-sampling tests, we performed multiple rounds of randomizations of the original training and testing data on each benchmark dataset. In the independent dataset tests, the benchmark datasets were directly used as the original training sets, and the new independent test sets were adopted for testing. The amphiphilic pseudo amino acid composition [24] was employed as the feature extraction technology to represent a protein sequence. The protein sequences were formulated with a valid mathematical expression by this method through a public online server named PseAAC at: http:// www.csbio.sjtu.edu.cn/bioinf/PseAA/. The details of PseAAC can be found in reference [25]. In this study, amino acid characters were empirically chosen to be Hydrophobicity, Hydrophilicity and Mass; the weight factor was 0.4, and the lambda parameter was 5. Four different types of multi-label classification models including IMKNN [5], SVM [26], Gaussian process [10] and ML-RBF [27], were used as basic classifiers to test our algorithm. The parameters of these classifiers were assigned the same values as the original papers and all these parameters were fixed in the whole experiments for an objective comparison.
The overall performances of the above classification algorithms following three kinds of conditions were compared. These conditions were: not using the proposed active sample selection (using no supplementary training samples), using the proposed active sample selection with a preferred proportion (top h) of the supplementary training samples, and directly using the whole PNEA samples in the supplementary training sample pool. In the experiments, the kernel function of K f was the same Gaussian kernel used for estimating the posterior probability in Eq. (6). The prior probabilities of the three levels of non-experimental labels were set according to the strength of the evidences of the three non-experimental label types: the prior probability with ''Probable'' label was set to be the largest, the value of ''Potential' was medium and ''By Similarity'' was the smallest. We tested several values for the prior probabilities and finally choose a group of values with the best results as: 0.85 for ''Probable'', 0.8 for ''Potential'' and 0.75 for ''By Similarity''. The step of proportion a was set to 10% and the number of intervals T was 10.
Through the numerical experiments, we observe the preferred proportions of active sample selection for various datasets are different. The preferred proportion of virus PNEA samples is h~40%, h~70% for plant, and h~50% for Gram-negative bacteria. The comparisons of the performances of these classification models by using none, preferred proportion and all of the Table 5. Results for different basic classifiers (mean6SD) by using varied numbers of supplementary training data, trained and tested in 10-fold cross-validation on the Gram-negative bacteria dataset. samples in the supplementary training sample pool are shown in Table 3, Table 4, Table 5, Table 6. Table 3, Table 4 and Table 5 show the average values of 10 randomizations, 10-fold cross-validation measures and their standard deviations, and Table 6 shows the results of the independent dataset test. For each evaluation metric, ''q'' means the bigger the metric value the better the performance, and ''Q'' means the smaller the metric value the better the performance. It can be seen, for each case, the classifier using the supplementary training data selected by the proposed approach always performs better than the basic classifier using no supplementary training sample. Additionally, the results under the proposed approach are superior to that of indiscriminately using the whole data in the supplementary training data pool. From the simulation results, it can be concluded that, on one hand, the improvements to the original prediction indicates that the selected PNEA samples are useful and indeed provide helpful information for prediction; on the other hand, the better performance of the active sample selection over directly using all the samples in the supplementary training sample pool indicates that a part of the PNEA samples disrupts the prediction because they may have some inaccurate information. Therefore, an effective active sample selection is important to select a proper amount of valuable PNEA samples and reduce the possibility of prediction disturbance brought by the redundant supplementary training data. We also observed that the performance improvements of all the classification models are related to the size of the original training set. For the virus cases with the least original training data, each classifier's performance improvement is superior to that of the other two datasets. On the contrary, for the Gneg cases with the most original training data, the improvement effect is the smallest. We attribute this fact to the original training set with less data having a greater data shortage, so the basic classifiers are better improved by incrementally adding useful supplementary training data. Without dependence on the original classification model, the experiment results show that the proposed active sample selection strategy provides a generic approach for the existing prediction algorithms. It is worth noting that, the inherent problem with PNEAs is that they can only be experimentally validated. To validate that the proposed strategy is more useful than conventional analysis based simply on PEA, it is better to test it via additional biological experiments. If we can show on PNEA data that the strategy finds true positives and rejects true negatives validated against biological observation of the characteristics of these proteins, the effectiveness of this approach will be further verified. However, in our work, it is difficult to directly conduct biological experiments to validate PNEAs. In a different way, we tried to find true positive and true negative PNEA samples which can be validated against a biological observation in the Swiss-Prot databank. Unfortunately, we found few true positives (e.g. the non-experimental annotation ''Golgi apparatus'' of the plant protein with entry number ''Q9M2T1'' has been verified experimentally) and no true negatives. Although the true positives can be successfully found using this strategy, we think the amount of samples identified is too small to provide enough support for this study. Therefore, the related results of these few protein samples are not included in this paper. Moreover, the objective of this study is not to identify true positive proteins, but to make protein subcellular localization prediction tools with better performance in accuracy with the help of non-experimental proteins. According to the results in Table  Table 3, Table 4, Table 5, Table 6, the increase in accuracy over the conventional algorithms after training with these PNEAs indicates the proposed strategy works. Therefore, the proposed method could be thought of as potentially significant, even without the experimental biological validation. However, it is still worth to perform a biological validation for our algorithm, and we hope to cooperate with biochemists to improve this method in the future. In summary, in order to overcome the shortage of experimental training data in the prediction of protein subcellular location, we mined the proteins with non-experimental annotations and designed a novel active sample selection strategy to find useful PNEA samples. As supplementary training data, these selected samples helped retrain and improve the original basic classifiers. This approach based on the min-max view provides a systematic way for measuring the usefulness of a sample with multiple labels. From the results, it can be clearly seen that the proposed algorithm is significant and valid to increase the predicting performance of all four types of classifiers. We believe that active sample selection techniques in machine learning can be used as a powerful and useful tool to alleviate the data shortage problem and it could be extended to other real-world data mining applications. We also expect that the information of a huge number of proteins with non-experimental annotations can be applied to other biological problems. Furthermore, in order to make the presented method available to compare with the predictors by other interested users, we will make efforts to provide an online prediction web-server with practical value in our future work.

Supporting Information
Material S1 This includes Tables S1-S3.