HybridGO-Loc: Mining Hybrid Features on Gene Ontology for Predicting Subcellular Localization of Multi-Location Proteins

Protein subcellular localization prediction, as an essential step to elucidate the functions in vivo of proteins and identify drugs targets, has been extensively studied in previous decades. Instead of only determining subcellular localization of single-label proteins, recent studies have focused on predicting both single- and multi-location proteins. Computational methods based on Gene Ontology (GO) have been demonstrated to be superior to methods based on other features. However, existing GO-based methods focus on the occurrences of GO terms and disregard their relationships. This paper proposes a multi-label subcellular-localization predictor, namely HybridGO-Loc, that leverages not only the GO term occurrences but also the inter-term relationships. This is achieved by hybridizing the GO frequencies of occurrences and the semantic similarity between GO terms. Given a protein, a set of GO terms are retrieved by searching against the gene ontology database, using the accession numbers of homologous proteins obtained via BLAST search as the keys. The frequency of GO occurrences and semantic similarity (SS) between GO terms are used to formulate frequency vectors and semantic similarity vectors, respectively, which are subsequently hybridized to construct fusion vectors. An adaptive-decision based multi-label support vector machine (SVM) classifier is proposed to classify the fusion vectors. Experimental results based on recent benchmark datasets and a new dataset containing novel proteins show that the proposed hybrid-feature predictor significantly outperforms predictors based on individual GO features as well as other state-of-the-art predictors. For readers' convenience, the HybridGO-Loc server, which is for predicting virus or plant proteins, is available online at http://bioinfo.eie.polyu.edu.hk/HybridGoServer/.


Introduction
Proteins located in appropriate physiological contexts within a cell are of paramount importance to exert their biological functions. Subcellular localization of proteins is essential to the functions of proteins and has been suggested as a means to maximize functional diversity and economize on protein design and synthesis [1]. Aberrant protein subcellular localization is closely correlated to a broad range of human diseases, such as Alzheimer's disease [2], kidney stone [3], primary human liver tumors [4], breast cancer [5], pre-eclampsia [6] and Bartter syndrome [7]. Knowing where a protein resides within a cell can give insights on drug targets identification and drug design [8,9]. Wet-lab experiments such as fluorescent microscopy imaging, cell fractionation and electron microscopy are the gold standard for validating subcellular localization and are essential for the design of high quality localization databases such as The Human Protein Atlas (http://www.proteinatlas.org/). However, wet-lab experiments are time-consuming and laborious. With the avalanche of newly discovered protein sequences in the post-genomic era, computational methods are required to assist biologists to deal with large-scale proteomic data to determine the subcellular localization of proteins.
Conventionally, subcellular-localization predictors can be roughly divided into sequence-based and annotation-based. Sequence-based methods use (1) amino-acid compositions [10,11], (2) sequence homology [12,13], and (3) sorting signals [14,15] as features. Annotation-based menthods use information beyond the protein sequences, such as Gene Ontology (GO) terms [16][17][18][19][20][21], Swiss-Prot keywords [22], and PubMed abstracts [23,24]. A number of studies have demonstrated that methods based on GO information are superior to methods based on sequence-based features [25][26][27][28]. Note that the GO database contains not only experimental data but also predicted data (http://www. geneontology.org/GO.evidence.shtml), which may be determined by sequence-based methods. From this point of view, the GObased prediction, which uses the GO annotation database to retrieve GO terms, is a filtering method for sequence-based predictions.
The GO comprises three orthogonal taxonomies whose terms describe the cellular components, biological processes, and molecular functions of gene products. The GO terms in each taxonomy are organized within a directed acyclic graph. These terms are placed within structural relationships, of which the most important being the 'is-a' relationship (parent and child) and the 'part-of' relationship (part and whole) [29,30]. Recently, the GO consortium has been enriched with more structural relationships, such as 'positively-regulates', 'negatively-regulates' and 'has-part' [31,32]. These relationships reflect that the GO hierarchical tree for each taxonomy contains redundant information, for which semantic similarity over GO terms can be found.
Instead of only determining subcellular localization of singlelabel proteins, recent studies have been focusing on predicting both single-and multi-location proteins. Since there exist multilocation proteins that can simultaneously reside at, or move between, two or more subcellular locations, it is important to include these proteins in the predictors. Actually, multi-location proteins play important roles in some metabolic processes that take place in more than one cellular compartment, e.g., fatty acid boxidation in the peroxisome and mitochondria, and antioxidant defense in the cytosol, mitochondria and peroxisome [33].
Since the relationship between GO terms reflects the association between different gene products, protein sequences annotated with GO terms can be compared on the basis of semantic similarity measures. The semantic similarity over GO has been extensively studied and have been applied to many biological problems, including protein function prediction [42,43], subnuclear localization prediction [44], protein-protein interaction inference [45][46][47] and microarray clustering [48]. The performance of these predictors depends on whether the similarity measure is relevant to the biological problems. Over the years, a number of semantic similarity measures have been proposed, some of which have been used in natural language processing. Semantic similarity measures can be applied at the GO-term level or the gene-product level. At the GO-term level, methods are roughly categorized as node-based and edge-based. The nodebased measures basically rely on the concept of information content of terms, which was proposed by Resnik [49] for natural language processing. Later, Lord et al. [50] applied this idea to measure the semantic similarity among GO terms. Lin et al. [51] proposed a method based on information theory and structural information. Subsequently, more node-based measures [52][53][54] were proposed. Edge-based measures are based on using the length or the depth of different paths between terms and/or their common ancestors [55][56][57][58]. At the gene-product level, two most common methods are pairwise approaches [59][60][61][62][63] and groupwise approaches [64][65][66][67]. Pairwise approaches measure similarity between two gene products by combining the semantic similarities between their terms. Groupwise approaches, on the other hand, directly group the GO terms of a gene product as a set, a graph or a vector, and then calculate the similarity by set similarity techniques, graph matching techniques or vector similarity techniques. More recently, Pesquita et al. [68] reviewed the semantic similarity measures applied to biomedical ontologies, and Guzzi et al. [69] provides a comprehensive review on the relationship between semantic similarity measures and biological features.
This paper proposes a multi-label predictor based on hybridizing frequency of occurrences of GO terms and semantic similarity between the terms for protein subcellular localization prediction. Compared to existing multi-label subcellular-localization predictors, our proposed predictor has the following advantages: (1) it formulates the feature vectors by hybridizing GO frequency of occurrences and GO semantic similarity features which contain richer information than only GO term frequencies; (2) it adopts a new strategy to incorporate richer and more useful homologous information from more distant homologs rather than using the top homologs only; (3) it adopts an adaptive decision strategy for multi-label SVM classifiers so that it can effectively deal with datasets containing both single-label and multi-label proteins. Results on two recent benchmark datasets and a new dataset containing novel proteins demonstrate that these three properties enable the proposed predictor to accurately predict multi-location proteins and outperform several state-of-the-art predictors.

Legitimacy of Using GO Information
Despite their good performance, GO-based methods have received some criticisms from the research community. The main argument of these criticisms is that the cellular component GO terms already have the cellular component categories, i.e., if the GO terms are known, the subcelluar locations will also be known. The prediction problem can therefore be easily solved by creating a lookup table using the cellular component GO terms as the keys and the cellular component categories as the hashed values. Such a naive solution, however, will lead to very poor prediction performance, as demonstrated and explained in our previous studies [28,39]. A number of studies [70][71][72] by other groups also strongly support the legitimacy of using GO information for subcellular localization. For example, as suggested by [72], the good performance of GO-based methods is due to the high representation power of the GO space as compared to the Euclidean feature spaces used by the conventional sequence-based methods.

Retrieval of GO Terms
The proposed predictor can use either the accession numbers (AC) or amino acid (AA) sequences of query proteins as input. Specifically, for proteins with known ACs, their respective GO terms are retrieved from the Gene Ontology annotation (GOA) database (http://www.ebi.ac.uk/GOA) using the ACs as the searching keys. For proteins without ACs, their AA sequences are presented to BLAST [73] to find their homologs, whose ACs are then used as keys to search against the GOA database.
While the GOA database allows us to associate the AC of a protein with a set of GO terms, for some novel proteins, neither their ACs nor the ACs of their top homologs have any entries in the GOA database; in other words, no GO terms can be retrieved by using their ACs or the ACs of their top homologs. In such case, the ACs of the homologous proteins, as returned from BLAST search, will be successively used to search against the GOA database until a match is found. With the rapid progress of the GOA database, it is reasonable to assume that the homologs of the query proteins have at least one GO term [17]. Thus, it is not necessary to use back-up methods to handle the situation where no GO terms can be found. The procedures are outlined in Fig 1.

GO Frequency Features
Let W denote a set of distinct GO terms corresponding to a data set. W is constructed in two steps: (1) identifying all of the GO terms in the dataset and (2) removing the repetitive GO terms.
Suppose W distinct GO terms are found, i.e., jWj~W ; these GO terms form a GO Euclidean space with W dimensions. For each sequence in the dataset, a GO vector is constructed by matching its GO terms against W, using the number of occurrences of individual GO terms in W as the coordinates. Specifically, the GO vector p i of the i-th protein i is defined as: where f i,j is the number of occurrences of the j-th GO term (termfrequency) in the i-th protein sequence. The rationale is that the term-frequencies contain important information for classification. Note that b i,j 's are analogous to the term-frequencies commonly used in document retrieval. Similarly, for the t-th query protein t , the GO frequency vector is defined as: In the following sections, we use the superscript F to denote the GO frequency features in Eq. 2.

Semantic-Similarity Features
Semantic similarity (SS) is a measure for quantifying the similarity between categorical data (e.g., words in documents), where the notion of similarity is based on the likeliness of meanings in the data. It is originally developed by Resnik [49] for natural language processing. The idea is to evaluate semantic similarity in an 'is-a' taxonomy using the shared information contents of categorical data. In the context of gene ontology, the semantic similarity between two GO terms is based on their most specific common ancestor in the GO hierarchy. The relationships between GO terms in the GO hierarchy, such as 'is-a' ancestorchild, or 'part-of' ancestor-child can be obtained from the SQL database through the link: http://archive.geneontology.org/ latest-termdb/go_daily-termdb-tables.tar.gz. Note here only the 'is-a' relationship is considered for semantic similarity analysis [51]. Specifically, the semantic similarity between two GO terms x and y is defined as [49]: where A(x, y) is the set of ancestor GO terms of both x and y, and p(c) is the probability of the number of gene products annotated to the GO term c divided by the total number of gene products annotated in the GO taxonomy. While Resnik's measure is effective in quantifying the shared information between two GO terms, it ignores the distance between the terms and their common ancestors in the GO hierarchy. To further incorporate structural information from the GO hierarchy into the similarity measure, we have explored three extension of Resnik's measure, namely Lin's measure [51], Jiang's measure [74], and relevance similarity (RS) [52].
Given two GO terms x and y, the similarity by Lin's measure is: The similarity by Jiang's measure is: The similarity by RS is calculated as: Among the three measures, sim Lin (x,y) and sim Jiang (x,y) are relative measures that are proportional to the difference in information content between the terms and their common ancestors, which is independent of the absolute information content of the ancestors. On the other hand, sim RS (x,y) incorporates the probability of annotating the common ancestors as a weighing factor to Lin's measure. To simplify notations, we refer sim Lin (x,y), sim Jiang (x,y) and sim RS (x,y) as sim 1 (x,y), sim 2 (x,y) and sim 3 (x,y), respectively.
Based on the semantic similarity between two GO terms, we adopted a continuous measure proposed in [48] to calculate the similarity between two proteins. Specifically, given two proteins i and j , we retrieved their corresponding GO terms P i and P j as described in the subsection ''Retrieval of GO Terms''. (Note that strictly speaking, P i should be P i,ki , where k i is the k i -th homolog used to retrieve the GO terms for the i-th protein. To simplify notations, we write it as P i .) Then, we computed the semantic similarity between two sets of GO terms fP i ,P j g as follows: where l[f1,2,3g, and sim l (x,y) is defined in Eq. 4 to Eq. 6. S l (P j ,P i ) is computed in the same way by swapping P i and P j . Finally, the overall similarity between the two proteins is given by: where l[f1,2,3g. In the sequel, we refer the SS measures by Lin, Jiang and RS to as SS1, SS2 and SS3, respectively. Thus, for a testing protein t with GO term set t , a GO semantic similarity (SS) vector q Sl t can be obtained by computing the semantic similarity between t and each of the training where N is the number of training proteins. Thus, t can be represented by an N-dimensional vector: where l[f1,2,3g. In other words, q S l t represents the SS vector by using the l-th SS measure.

Hybridization of Two GO Features
As can be seen from the subsections ''GO Frequency Features'' and ''Semantic-Similarity Features'', we know that the GO frequency features (Eq. 2) use the frequency of occurrences of GO terms, while GO SS features (Eq. 4 to Eq. 6) use the semantic similarity between GO terms. These two features are developed from two different perspectives. It is therefore reasonable to believe that these two kinds of information complement each other. Based on this assumption, we combine these two GO features and form a hybridized vector as: where l[f1,2,3g. In other words, q H l t represents the hybridizingfeature vector by combining the GO frequency features and the SS features derived from the l-th SS measure. We refer them to as Hybrid1, Hybrid2 and Hybrid3, respectively.

Multi-label Multi-class SVM Classification
The hybridized-feature vectors obtained from the previous subsection are used for training multi-label one-vs-rest support vector machines (SVMs). Specifically, for an M-class problem (here M is the number of subcellular locations), M independent binary SVMs are trained, one for each class. Denote the hybrid GO vectors of the t-th query protein using the l-th SS measure as q H l t . Given the t-th query protein t , the score of the m-th SVM using the l-th SS measure is where q H l t is the hybrid GO vector derived from (See Eq. 10), S m,l is the set of support vector indexes corresponding to the m-th SVM, a m,r are the Lagrange multipliers, y m,r [f{1,z1g indicates whether the r-th training protein belongs to the m-th class or not, and K( : , : ) is a kernel function. Here, the linear kernel was used.
Unlike the single-label problem where each protein has one predicted label only, a multi-label protein could have more than one predicted labels. In this work, we compared two different decision schemes for this multi-label problem. In the first scheme, the predicted subcellular location(s) of the i-th query protein are given by arg max M m~1 s m,l ( t ), otherwise: The second scheme is an improved version of the first one in that the decision threshold is dependent on the test protein.
Specifically, the predicted subcellular location(s) of the i-th query protein are given by: HybridGO-Loc In Eq. 13, f (s max ,l ( t )) is a function of s max ,l (Q t ), where s max ,l (Q t )~max M m~1 s m,l (Q t ). In this work, we used a linear function as follows: where h[½0:0,1:0 is a hyper-parameter that can be optimized through cross-validation. In fact, besides SVMs, many other machine learning models, such as hidden Markov models (HMMs) and neural networks (NNs) [75,76], have been used in protein subcellular-localization predictors. However, HMMs and NNs are not suitable for GObased predictors because of the high dimensionality of GO vectors. The main reason is that under such condition, HMMs and NNs can be easily overtrained and thus lead to poor performance. On the other hand, linear SVMs can well handle high-dimensional data because even if the number of training samples is smaller than the feature dimension, linear SVMs are still able to find an optimal solution.

Datasets
In this paper, a virus dataset [35,37] and a plant dataset [36] were used to evaluate the performance of the proposed predictor. The virus and the plant datasets were created from Swiss-Prot 57.9 and 55.3, respectively. The virus dataset contains 207 viral proteins distributed in 6 locations. Of the 207 viral proteins, 165 belong to one subcellular locations, 39 to two locations, 3 to three locations and none to four or more locations. This means that about 20% of the proteins in the dataset are located in more than one subcellular location. The plant dataset contains 978 plant proteins distributed in 12 locations. Of the 978 plant proteins, 904 belong to one subcellular locations, 71 to two locations, 3 to three locations and none to four or more locations. The sequence identity of both datasets was cut off at 25%.
The breakdown of these two datasets are listed in Figs. 2(a) and 2(b). Fig. 2(a) shows that the majority (68%) of viral proteins in the virus dataset are located in host cytoplasm and host nucleus while proteins located in the rest of the subcellular locations totally account only around one third. This means that this multi-label dataset is imbalanced across the six subcellular locations. Similar conclusions can be drawn from Fig. 2(b), where most of the plant proteins exist in chloroplast, cytoplasm, nucleus and mitochondrion while proteins in other 8 subcellular locations totally account for less than 30%. This imbalanced property makes the prediction of these two multi-label datasets difficult. These two benchmark datasets are downloadable from the hyperlinks in the HybridGO-Loc server.

Performance Metrics
Compared to traditional single-label classification, multi-label classification requires more complicated performance metrics to better reflect the multi-label capabilities of classifiers. Conventional single-label measures need to be modified to adapt to multi-label classification. These measures include Accuracy, Precision, Recall, F1score (F1) and Hamming Loss (HL) [77,78]. Specifically, denote L( i ) and M( i ) as the true label set and the predicted label set for the i-th protein i (i~1, . . . ,N), respectively. Here, N~207 for the virus dataset and N~978 for the plant dataset. Then the five measurements are defined as follows: Precision~1 N Recall~1 N where j : j means counting the number of elements in the set therein and \ represents the intersection of sets. Accuracy, Precision, Recall and F1 indicate the classification performance. The higher the measures, the better the prediction performance. Among them, Accuracy is the most commonly used criteria. F1-score is the harmonic mean of Precision and Recall, which allows us to compare the performance of classification systems by taking the trade-off between Precision and Recall into account. The Hamming Loss (HL) [77,78] is different from other metrics. As can be seen from Eq. 20, when all of the proteins are correctly predicted, i.e., jM( i )|L( i )j~jM( i )\L( i )j (i~1, . . . ,N), then HL~0; whereas, other metrics will be equal to 1. On the other hand, when the predictions of all proteins are completely wrong, i.e., jM( i )|L( i )j~M and jM( i )\L( i )j~0, then HL~1; whereas, other metrics will be equal to 0. Therefore, the lower the HL, the better the prediction performance.
Two additional measurements [37,39] are often used in multilabel subcellular localization prediction. They are overall locative accuracy (OLA) and overall actual accuracy (OAA). The former is given by: and the overall actual accuracy (OLA) is: where According to Eq. 21, a locative protein is considered to be correctly predicted if any of the predicted labels matches any labels in the true label set. On the other hand, Eq. 22 suggests that an actual protein is considered to be correctly predicted only if all of the predicted labels match those in the true label set exactly. For example, for a protein coexist in, say three subcellular locations, if only two of the three are correctly predicted, or the predicted result contains a location not belonging to the three, the prediction is considered to be incorrect. In other words, when and only when all of the subcellular locations of a query protein are exactly predicted without any overprediction or underprediction, can the prediction be considered as correct. Therefore, OAA is a more stringent measure as compared to OLA. OAA is also more objective than OLA. This is because locative accuracy is liable to give biased performance measures when the predictor tends to over-predict, i.e., giving large jM( i )j for many i . In the extreme case, if every protein is predicted to have all of the M subcellular locations, according to Eq. 20, the OLA is 100%. But obviously, the predictions are wrong and meaningless. On the contrary, OAA is 0% in this extreme case, which definitely reflects the real performance.
Among all the metrics mentioned above, OAA is the most stringent and objective. This is because if only some (but not all) of the subcellular locations of a query protein are correctly predict, the numerators of the other 4 measures (Eqs. 16 to 21) are nonzero, whereas the numerator of OAA in Eq. 22 is 0 (thus contribute nothing to the frequency count).
In statistical prediction, there are three methods that are often used for testing the generalization capabilities of predictors: independent tests, sub-sampling tests (or K-fold cross-validation) and leave-one-out cross validation (LOOCV). For independent tests, the selection of independent dataset often bears some sort of arbitrariness [79]; for the K-fold cross validation, different partitioning of a dataset will lead to different results, thus still being liable to statistical arbitrariness; for LOOCV, it will yield a unique outcome and is considered to be the most rigorous and bias-free method [80]. Hence, LOOCV was used to examine the performance of all predictors in this work. More detailed analysis of the statistical methods can be found in the supplementary materials. Note that the jackknife cross validation in iLoc-Plant and its variants is the same as LOOCV, as mentioned in [36,79]. Because the term jackknife also refers to the methods that estimate the bias and variance of an estimator [81], to avoid confusion, we only use the term LOOCV in this paper. Fig. 3(a) shows the performance of individual and hybridized GO features on the virus dataset based on leave-one-out cross validation (LOOCV). In the figure, SS1, SS2 and SS3 represent Lin's, Jiang's and RS similarity measures, respectively. Hybrid1, Hybrid2 and Hybrid3 represent the hybridized features obtained from these measures. As can be seen, in terms of all the six performance metrics, the performance of the hybrid features is remarkably better than the performance of individual features, regardless of which of the GO frequency features or the three GO SS features were used. Specifically, the OAAs (the most stringent and objective metric) of all of the three hybrid features are at least 3% (absolute) higher than that of the individual features, which suggests that hybridizing the two features can significantly boost the prediction performance. Moreover, among the hybridized features, the performance of Hybrid2, namely combining GO frequency features and GO SS features by Jiang's measure, outperforms Hybrid1 and Hybrid3. Another interesting thing is that although all of the individual GO SS features perform much worse than the GO frequency features, the performance of the three hybridized features is still better that of any of the individual features. This suggests that the GO frequency features and SS features are complementary to each other. Similar conclusions can be drawn from the plant dataset shown in Fig. 3(b). However, comparison between Fig. 3(a) and Fig. 3(b) reveals that for the plant dataset, the performance of hybridized features outperforms all of the individual features in terms of all metrics except OLA and Recall, while for the virus dataset, the former is superior to the latter in terms of all metrics. However, the losses in these two metrics do not outweigh the significant improvement on other metrics, especially on OAA, which has around 3% (absolute) improvement in terms of hybridized features as opposed to using individual features. Among the hybridizing features, Hybrid2 also outperforms Hybrid1 and Hybrid3 in terms of OLA, Accuracy, Recall and F1-score, whereas Hybrid1 performs better than others in terms of OAA and Precision. These results demonstrate that the GO SS features obtained by Lin's measure and Jiang's measure are better candidates than the RS measure for combining with the GO frequency features; however, there is no evidence suggesting which measure is better. It is also interesting to see that the performance of the three individual GO SS features is better than that of GO frequency features, in contrary to the results shown in Fig 3(a). Table 1 and Table 2 compare the performance of the proposed predictor against several state-of-the-art multi-label predictors on the virus and plant dataset based on leave-one-out cross validation. Note that we used the best performing hybridizing features with the adaptive decision strategy. Specifically, for both the virus and plant datasets, the best performance was achieved when Hybrid2 and the adaptive decision strategy with h~0:3 were used. h was determined by cross-validation as stated previously. Unless stated otherwise, we used Hybrid2 to represent HybridGO-Loc in subsequent experiments. Our proposed predictor use the GO frequency features and GO semantic similarity features, whereas other predictors use only the GO frequency of occurrences as features. From the classification perspective, Virus-mPLoc [35] uses an ensemble OET-KNN (optimized evidence-theoretic Knearest neighbors) classifier; iLoc-Virus [37] uses a multi-label KNN classifier; KNN-SVM [38] uses an ensemble of classifiers combining KNN and SVM; mGOASVM [39] uses a multi-label SVM classifier; and the proposed predictor use a multi-label SVM classifier incorporated with the adaptive decision scheme.

Comparing with State-of-the-Art Predictors
As shown in Table 1, the proposed predictor perform significantly better than the other predictors. The OAA and OLA of the proposed predictor are more than 15% (absolute) higher than that of iLoc-Virus and Virus-mPLoc. It also performs significantly better than KNN-SVM in terms of OLA. When comparing with mGOASVM, the proposed predictor performs remarkably better in of all of the performance metrics, especially for the OAA (0.937 vs 0.889). These results demonstrate that hybridizing the GO frequency features and GO SS features can significantly boost prediction performance, which also suggests that these two kinds of information are proved to be complementary to each other in terms of predicting subcellular localization. Similar conclusions can be drawn for the plant dataset from Table 2 except that the OLA of the proposed predictor is slightly worse than that of mGOASVM, and the Recall is equivalent to that of mGOASVM. Nevertheless, the small losses do not outweigh the impressive improvement in the other metrics, especially in the OAA (0.936 vs 0.874).

Prediction of Novel Proteins
To further demonstrate the effectiveness of HybridGO-Loc, a newer plant dataset constructed for mGOASVM [39] was used to compare with state-of-the-art multi-label predictors using independent tests. Specifically, this new plant dataset contains 175 plant proteins, of which 147 belong to one subcellular location, 27 belong to two locations, 1 belong to three locations and none to four or more locations. These plant proteins were added to Swiss-Prot between 08-Mar-2011 and 18-Apr-2012. Because the plant dataset used for training the predictors was created on 29-Apr-2008, there is an almost 3-year time gap between the training data and test data in our experiments. Table 3 compare the performance of HybridGO-Loc against several state-of-the-art multi-label plant predictors on the new plant dataset. All the predictors use the 978 proteins of the plant dataset (See Fig. 2(b)) for training the classifier and make independent test on the new 175 proteins. As can be seen, HybridGO-Loc performs significantly better than all the other predictors in terms of all of the performance metrics. Similar conclusions can also be drawn from the performance in individual subcellular locations. Fig. 4 shows the distribution of the E-values of the test proteins, which were obtained by using the training proteins as the repository and the test proteins as the query proteins in the BLAST search. If we use a common criteria that homologous proteins should have E-value less than 10 {4 , then 74 out of 175 test proteins are homologs of the training proteins, which account for 42% of the test set. Note that this homologous relationship does not mean that using BLAST's homology transfers can predict all of the 74 test proteins correctly. In fact, BLAST's homology transfers (based on the CC field of the homologous proteins) can only achieve a prediction accuracy of 26.9% (47/175). As the prediction accuracy of HybridGO-Loc on this test set (see Table 3) is significantly higher than this percentage, the extra information available from the GOA database plays a very important role in the prediction.

Semantic Similarity Measures
In this paper, we have compared three of the most common semantic similarity measures for subcellular localization, including Table 1. Comparing the proposed predictor with state-of-the-art multi-label predictors based on leave-one-out cross validation (LOOCV) using the virus dataset.

Label Subcellular Location LOOCV Locative Accuracy (LA)
Virus-mPLoc [35] KNN-SVM [38] iLoc-Virus [37] mGOASVM [39] HybridGO-Loc Lin's measure [51], Jiang's measure [74], and relevance similarity measure [52]. We excluded Resnik's measure because it ignores the distance between the terms and their common ancestors in the GO hierarchy. In addition to these measures, many online tools are also available for computing the semantic similarity at the GOterm level and gene-product level [44,[82][83][84]. However, these measures are discrete measures whereas the measures that we used are continuous. Research has shown that continuous measures are better than discrete measures in many applications [48].

GO-Frequency Features versus SS Features
Note that we do not replace the GO frequency vectors. Instead, we augment the GO frequency feature with a more sophisticated feature, i.e. the GO SS vectors, which are to be combined with the GO frequency vectors. A GO frequency vector is found by counting the number of occurrences of every GO term in a set of distinct GO terms obtained from the training dataset, whereas an SS vector is constructed by computing the semantic similarity between a test protein with each of the training proteins at the gene-product level. That is, each element in an SS vector represents the semantic similarity of two GO-term groups. This can be easily seen from their definitions in Eq. 2 and Eq. 4-9, respectively.
The GO frequency vectors and the GO SS vectors are different in two fundamental ways.
N A). GO frequency vectors are more primitive in the sense that their elements are based on individual GO terms without considering the inter-term relationship, i.e., the elements in a GO frequency vectors are independent of each other. N B). GO SS vectors are more sophisticated in the following two N B1) Inter-term relationship. SS vectors are based on interterm relationships. They are defined on a space in which each basis corresponds to one training protein and the coordinate along that basis is defined by the semantic s i m i l a r i t y b e t w e e n a t e s t i n g p r o t e i n a n d t h e corresponding training protein.
N B2) Inter-group relationship. The pairwise relationships between a test protein and the training proteins are hierarchically structured. This is because each basis of the SS space depends on a group of GO terms of the corresponding training protein, and the terms are arranged in a hierarchical structure (parent-child relationship). Because the GO terms in different groups are not mutually exclusive, the bases in the SS space are not independent of each other.

Bias Analysis
Except for the new plant dataset, we adopted LOOCV to examine the performance of all predictors in this work, which is considered to be the most rigorous and bias-free [80]. Nevertheless, determining the set of distinct GO terms W from a dataset is by no means without bias, which may favor the LOOCV performance. This is because the set of distinct GO terms W derived from a given dataset may not be representative for other datasets; in other words, the generalization capabilities of the predictors may be weakened when new GO terms outside W are found in the test proteins.
However, we have the following strategies to minimize the bias. First, the two benchmark datasets used in this paper were constructed based on the whole Swiss-Prot database (although in different years), which, to some extent, incorporated all the senses.
GO terms corresponding to the whole Swiss-Prot database, which enables W to be representative for all of the distinct GO terms. Second, these two benchmark datasets were collected according to strict criteria. Details of the procedures can be found in the supplementary materials. and the sequence similarity of both datasets was cut off at 25%, which enables us to use a small set of representative proteins to represent all of the proteins of the corresponding species (i.e., virus or plant) in the whole database. In other words, W will vary from species to species, yet still be statistically representative for all of the useful GO terms for the corresponding species. Third, using W for statistical performance evaluation is equivalent or at least approximate to using all of the distinct GO terms in the GOA database. This is because other GO terms that do not correspond to the training proteins will not participate in training the linear SVMs, nor will they play essential roles in contributing to the final predictions. In other words, the generalization capabilities of HybridGO-Loc will not be weakened even if some new GO terms are found in the test proteins. A mathematical proof of this statement can be found in the supplementary materials available in the HybridGO-Loc server. One may argue that the performance bias might arise when the whole W was used to construct the hybrid GO vectors for both training and testing during cross validation. This is because, in each fold of the LOOCV, the training proteins and the singled-out test protein will use the same W to construct the GO vectors, meaning that the SVM training algorithm can see some information of the test protein indirectly through the GO vector space defined by W. It is possible that for a particular fold of LOOCV, the GO terms of a test protein do not exist in any of the training proteins. However, we have mathematically proved that this bias will not exist during LOOCV (see the accompanying supplementary materials for the proof). Furthermore, the results of the independent tests (See Table 3) for which no such bias occurs also strongly suggest that HybridGO-Loc outperforms other predictors by a large margin.

Conclusions
This paper proposes a new multi-label predictor by hybridizing GO frequency features and semantic similarity features to predict the subcellular locations of multi-label proteins. Three different semantic similarity measures have been investigated to be combined with GO frequency features to formulate GO hybrid feature vectors. The feature vectors are subsequently recognized by multi-label multi-class support vectors machine (SVM) classifiers equipped with an adaptive decision strategy that can produce multiple class labels for a query protein. Compared to existing multi-label subcellular-localization predictors, our proposed predictor has the following advantages: (1) it formulates the feature vectors by hybridizing GO frequency of occurrences and GO semantic similarity features which contains richer information than only GO term frequencies; (2) it adopts a new strategy to incorporate richer and more useful homologous information from more distant homologs rather than using the top homologs only; (3) it adopts an adaptive decision strategy for multi-label SVM classifiers so that it can effectively deal with datasets containing both single-label and multi-label proteins. Experimental results demonstrate the superiority of the proposed hybrid features over each individual features. It was also found that the proposed predictor performs remarkably better than existing state-of-the-art predictors. For readers' convenience, HybridGO-Loc is available online at http://bioinfo.eie.polyu.edu.hk/HybridGoServer/.