SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity

Ying Hong Li; Jing Yu Xu; Lin Tao; Xiao Feng Li; Shuang Li; Xian Zeng; Shang Ying Chen; Peng Zhang; Chu Qin; Cheng Zhang; Zhe Chen; Feng Zhu; Yu Zong Chen

doi:10.1371/journal.pone.0155290

Abstract

Knowledge of protein function is important for biological, medical and therapeutic studies, but many proteins are still unknown in function. There is a need for more improved functional prediction methods. Our SVM-Prot web-server employed a machine learning method for predicting protein functional families from protein sequences irrespective of similarity, which complemented those similarity-based and other methods in predicting diverse classes of proteins including the distantly-related proteins and homologous proteins of different functions. Since its publication in 2003, we made major improvements to SVM-Prot with (1) expanded coverage from 54 to 192 functional families, (2) more diverse protein descriptors protein representation, (3) improved predictive performances due to the use of more enriched training datasets and more variety of protein descriptors, (4) newly integrated BLAST analysis option for assessing proteins in the SVM-Prot predicted functional families that were similar in sequence to a query protein, and (5) newly added batch submission option for supporting the classification of multiple proteins. Moreover, 2 more machine learning approaches, K nearest neighbor and probabilistic neural networks, were added for facilitating collective assessment of protein functions by multiple methods. SVM-Prot can be accessed at http://bidd2.nus.edu.sg/cgi-bin/svmprot/svmprot.cgi.

Citation: Li YH, Xu JY, Tao L, Li XF, Li S, Zeng X, et al. (2016) SVM-Prot 2016: A Web-Server for Machine Learning Prediction of Protein Functional Families from Sequence Irrespective of Similarity. PLoS ONE 11(8): e0155290. https://doi.org/10.1371/journal.pone.0155290

Editor: Bin Liu, Harbin Institute of Technology Shenzhen Graduate School, CHINA

Received: March 21, 2016; Accepted: April 27, 2016; Published: August 15, 2016

Copyright: © 2016 Li et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper and its Supporting Information files.

Funding: FZ is supported by grants from the Fundamental Research Funds for the Central Universities (CDJZR14468801, CDJKXB14011, 2015CDJXY); Ministry of Science and Technology, 863 Hi-Tech Program (2007AA02Z160); Key Special Project Grant 2009ZX09501-004 China; and Singapore Academic Research Fund grant R-148-000-208-112. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The knowledge of protein function is essential for studying biological processes [1], understanding disease mechanisms [2], and exploring novel therapeutic targets [3,4]. Apart from experimental methods, a number of in-silico approaches have been developed and extensively used for protein function prediction. These methods include sequence similarity [5], sequence clustering [6], evolutionary analysis [7], gene fusion [8], protein interaction [9], protein remote homology detection [10,11], protein functional family classification based on sequence-derived [12,13] or domain [1] features, and the integrated approaches that combine multiple methods, algorithms and/or data sources for enhanced functional predictions [5,14–16]. A protein functional family is a group of proteins with specific type of molecular functions (e.g. proteases [17]), binding activities (e.g. RNA-binding [18]), or involved in specific biological processes defined by the Gene Ontology [19] (e.g. DNA repair [20]). Moreover, models of protein function prediction have been constructed for more broadly-defined functional families such as transmembrane [21], virulent [22] and secretory [23] proteins, and a large-scale community-based critical assessment of protein function annotation (CAFA) revealed that the improvements of current protein function prediction tools were in urgent need [24]. Despite the development and extensive exploration of these methods, there is still a huge gap between proteins with and without functional characterizations. Continuous efforts are therefore needed for developing new methods and improving existing methods. These efforts have been made possible by the rapidly expanding knowledge of protein sequence [25], structural [26], functional [19] and other [27–30] data.

The uncharacterized proteins comprise a substantial percentage of the predicted proteins in many genomes, and some of these proteins are of no clear sequence or structural similarity to a protein of known function [31,32]. A particular challenge is to predict the function of these proteins from their sequence without the knowledge of similarity, clustering or interaction relationship with a known protein. As part of the collective efforts in developing such prediction methods, we have developed a web-based software SVM-Prot that employs a machine learning method, support vector machines (SVM), for predicting protein functional families from protein sequences irrespective of sequence or structural similarity [12], which have shown good predictive performances [33–40] to complement other methods or as part of the integrated approaches in predicting the function of diverse classes of proteins including the distantly-related proteins and homologous proteins of different functions.

The previous version of SVM-Prot covered 54 functional families. Its predictive accuracies of these families were ranging from 53.03% to 99.26% in sensitivity and from 82.06% to 99.92% in specificity [12]. Since the early 2000s, the number of proteins with sequence information had dramatically expanded from 2 million to more than 48.7 million entries in the UniProt database, and the number of annotated functional families with more than 100 sequence entries had significantly increased from 54 to 192 [25]. Our analysis on all “reviewed” protein entries in the UniProt database revealed that the overwhelming majority (80.23%) of these entries were from those 192 families. The enriched protein sequence data could be employed to expand the coverage and improve the predictive performance of SVM-Prot. Moreover, our earlier study suggested that the prediction performance of SVM could be substantially enhanced by the use of a more diverse set of protein descriptors for representing more comprehensive classes of proteins [41]. Thus, SVM-Prot was upgraded by using the enriched protein data and more diverse protein descriptors to train models for all 192 functional families and to improve the predictive performance of SVM-Prot. The prediction models for an additional set of Gene Ontology [19] functional families will be developed and added into SVM-Prot in the near future.

To facilitate the analysis of specific proteins of the SVM-Prot predicted functional families that might be relevant to a query protein, a new option conducting BLAST sequence alignment was provided to [42] search proteins of the SVM-Prot predicted functional families that were similar to the query protein. Moreover, a batch submission option for loading multiple protein sequences was also included. Given that the functional prediction capacity could be enhanced by the integration of multiple methods [5,14,15], two machine learning prediction tools, K nearest neighbor (kNN) and probabilistic neural networks (PNN), were integrated into this version of SVM-Prot to facilitate the collective assessment of protein functional families. These two tools had been explored for functional prediction of proteins [43–46] and other biomolecules [47]. Since these two tools had been extensively used for developing over 39 protein functional family prediction models (S1 Table), and because of their potential utility in complementing SVM from the nearest neighbor and neural network perspectives, SVM-Port could serve the community by providing the alternative protein functional family prediction tools based on these and other machine learning methods.

Results and Discussion

To evaluate the predictive performance of models in the SVM-Prot, the sensitivity (SE), precision (PR), and specificity (SP) of the independent evaluation datasets were calculated and demonstrated in Table 1 and S2 Table. SE, PR and SP of the SVM model were in the range of 50.00~99.99%, 5.31~99.99% and 82.06~99.99%, respectively. In the kNN model, the performances were 51.06~99.99% for SE, 17.86~94.49% for PR and 90.19~99.99% for SP. Moreover, SE, PR and SP of the PNN model were in the range of 60.49~99.99%, 25.00~99.75% and 97.34~99.99%, respectively. The SEs and PRs of the SVM classifier were generally lower and with larger variations than the SPs. This was partly due to the imbalanced training sets with the numbers of non-members greatly surpassing those of the members. Imbalanced training sets were known to adversely affect the machine learning prediction performance, particularly the minority class [48,49]. Moreover, not all functional families were sufficiently covered by the known proteins, particularly those with < 100 known protein members, the inadequate coverage of the respective training sets likely affect SEs to varying degrees.

Download:

Table 1. Partial list of the protein functional families covered by SVM-Prot and the prediction performance of the SVM, kNN and PNN models on the independent testing sets.

The complete list is provided in S2 Table. The predicted results are given in Sensitivity SE = TP/(TP+FN), Specificity SP = TN/(TN+FP), Precision PR = TP/(TP + FP), where TP = true positive, FN = false negative, TN = true negative, and FP = false positive respectively.

https://doi.org/10.1371/journal.pone.0155290.t001

To further evaluate the capability of SVM-Prot in predicting the functional families of novel proteins, a comprehensive literature search for recently reported novel proteins was conducted using the keyword “novel” in combination with “protein”, “enzyme”, “transporter”, “DNA binding”, “RNA-binding”, “viral”, or “bacterial”. As a result, 42 novel proteins published in 2015 or 2014 that had been explicitly described as novel in the literature were identified. These proteins were not in the SVM-Prot training datasets but with available sequence in the literature or public databases.

S3 Table summarized the prediction results of those 42 novel proteins by SVM-Prot, FFPred 3 [50] and NCBI BLAST [51], and the detailed prediction results were further provided in S4 Table. The function of a novel protein was considered as matched to a computer identified functional family when these two exactly matched at a specifically defined class level. Take the formate-nitrite transporter as an example, it belongs to the formate transporter family, the major intrinsic protein superfamily and the transporter TC1.A class. This families or classes are considered as specifically defined class levels, but the transporter family is too broadly defined. Overall, the number of functional families predicted or outputted by SVM-Prot for each novel protein was in the range of 3~18, and that by FFPred was in the range of 16~55 (if predictions of low reliability were included, the number should change to 45~101). Moreover, the function of 13 out of those 42 novel proteins was correctly assigned to one functional family predicted by SVM-Prot, and 7 (if prediction results of low reliability were included, the number should change to 12) were correctly matched by FFPred (S3 Table). In particular, amongst those 13 proteins predicted by SVM-Prot, 7 were ranked as top-1 in the list of predicted functional families, 2 were ranked as top-2, and 4 were ranked as top-5. However, for FFPred, only one protein was ranked as top-1, another one was ranked as top-2, and 2 more proteins were ranked as top-10. The majority (8 proteins) of the predicted proteins by FFPred were ranked within the range of top-27 to top-70. Thus, SVM-Prot is capable of predicting the functional families of novel proteins at comparable yield and reduced false hit rates with respect to FFPred. It should be strongly cautioned that these two servers for protein function prediction were upgraded at different times with varying coverage of training datasets, so the difference in the prediction results may not reflect the true prediction capability of these servers.

As a further comparison, the performance of BLAST on those 42 novel proteins was also evaluated. The number of similarity proteins with E-value < 0.05 for each novel protein was in the range of 0~112, and the function of 30 out of those 42 novel proteins were correctly matched to one of the BLAST identified similarity proteins (20, 2, 6, 2 are ranked as top-1, top-2, top-4 and top-10, respectively) (S3 Table). However, caution needs to be raised about the straightforward comparison of the BLAST results with those of the SVM-Prot and FFPred. BLAST searched proteins may cover the previously or recently deposited similarity proteins that are of the same or similar functions with respect to our tested novel proteins, while some of these similarity proteins may not be in the training set of both SVM-Prot and FFPred. Nonetheless, the better prediction performance of BLAST on these novel proteins suggests a need for more frequent upgrade of the SVM-Prot and FFPred by enriched up-to-date training datasets.

One useful strategy for overcoming the imbalanced datasets problem is to re-construct the training sets into more balanced ones by either over sampling the minority class [48] or under sampling the majority one [49], which might compromise the training datasets by introducing noises to the minority class or reducing the diversity of the majority one. In SVM-Prot, the training sets of the non-members were constructed from the minimal set of representative proteins from the Pfam domain families. Our study showed that further reduction of the training sets by one protein per Pfam family significantly reduced the SPs without much improvement of the SEs. Therefore, no further reduction of the training data was made. Another effective strategy for reducing the negative influence of imbalanced data is to separately optimize the pair of cost parameters of SVM models at the same time [52], particularly the cost for the errors on the positive samples compared to negative ones. In the development of SVM models, due to the very high diversity of each training dataset (containing 7613~46,223 proteins), both the separate and uniform cost parameter optimization scheme led to very high cost parameters for both positive and negative samples that achieve similar levels of prediction performance.

The capability of protein function prediction can be affected by multiple factors, including insufficient diversity of proteins in some functional families, inadequate coverage or representation of certain important structural and/or physicochemical features by the current datasets and protein descriptors, deficiency of the computational algorithms and parameter optimization procedures. The capability of the machine learning functional prediction tools has been enhanced by the expanded protein data, improvement of computational algorithms and exploration of integrated prediction strategies using multiple methods [53]. In addition to the employment of the continuously expanding protein data, SVM-Prot may be improved by exploring the newly developed computational methods. In particular, there have been new progresses in the development and the use of a new machine learning method, deep learning, for predicting protein secondary structure and other local structural properties [54–56], which may be potentially extended for protein function prediction. SVM-Prot can also be improved by integrating multiple methods and algorithms for enhanced functional predictions [5,14,15].

As an effective ensemble classifier, LibD3C [57] was widely cited by the recent publications aiming at identifying the DNA-binding proteins [58,59], predicting the cytokine-receptor interactions [60] and discovering immunoglobulins [61]. S5 Table summarized the prediction performances of the SVM, LibD3C, kNN and PNN on the independent testing sets of 10 randomly selected representative families covered by the SVM-Prot. These 10 protein families included 4 enzyme families (EC1.5, EC2.9, EC4.4 and EC5.1), actin capping family, DNA recombination family, DNA repair family, elongation factor activity family, GPCR family and lipid-binding protein family. PR, SE and SP of the SVM model were in the range of 53.9~99.3%, 53.00~97.5% and 96.8~99.99%, respectively. In the LibD3C models the corresponding performances were 52.39~90.51% for PR, 79.23~99.03% for SE and 96.86~99.89% for SP. The kNN method resulted in the performances of 55.0~83.7% for PR, 67.5~96.6% for SE and 93.8~99.9% for SP. Moreover, PR, SE and SP of the PNN model were in the range of 71.0~94.5%, 64.2~94.1% and 98.7~99.9%, respectively. As demonstrated in S5 Table, prediction performances (PR, SE and SP) were comparable among SVM, LibD3C, kNN and PNN, indicating that each method was an effective complement to other methods. It should be strongly cautioned that those 10 randomly tested families may not be enough in representing the prediction performances of all protein families covered by the current SVM-Prot. Therefore, a comprehensive analysis on all SVM-Prot families using above classifiers is needed for the next update of the SVM-Prot.

Methods

Instead of direct alignment or clustering of sequences, the SVM-Prot classification models classifies a protein into functional families based on the analysis of sequence-derived structural and physicochemical properties [33,34]. Proteins known to be in a functional family (e.g., proteases) and those outside the family (e.g., representatives of all non-protease proteins) are used to train a classification model, which recognizes specific sequence-derived features for classifying proteins either into or outside the functional family. Proteins of specific functional family share common structural and physicochemical features [62,63], which may be recognized by a machine learning classification model given the availability of sufficiently diverse training datasets [64].

Data collection

Table 1 and S2 Table provided a partial and complete list of the protein functional families covered by the upgraded SVM-Prot and the predictive performances of the SVM, kNN and PNN models. These families included G-protein coupled receptor family from GPCRDB [63], nuclear receptor family from NucleaRDB [63], 50 enzyme families from BRENDA [62], 20 transporter families from TCDB [65], 1 channel family from LGICdb [66], 24 molecular binding families (e.g. DNA-binding, RNA-binding, iron-binding), 67 Gene Ontology (30 molecular function and 37 biological process) families, and 28 broadly defined functional families from the UniProt database [25]. The 19 broadly defined functional families were selected on the following basis: either the prediction models for these families have been developed (e.g. allergen proteins [47]), or the relevant functions have some common features exploitable for developing prediction models (e.g.. cAMP binding). As illustrated above, the reason why protein functional families were derived from multiple sources was partly because of their complementary coverage and different functional perspectives. For instance, 122 functional families predictable in SVM-Prot were not covered by FFPred [50], while 391 functional families provided in FFPred were not covered by SVM-Prot. Thus, SVM-Prot may serve to complement other prediction servers by providing different coverage of protein functional families.

Datasets construction

To prepare datasets for constructing the model of each functional family, the training, testing and independent datasets were carefully prepared by following a strict procedure. Firstly, protein names of members in each family were collected from the UniProt [25], and protein members of the same name but different species origin were grouped together. Secondly, protein members in each group were iteratively selected and put into the training, testing, and independent datasets as positive samples. Thirdly, to generate negative samples, protein members in each functional family were mapped into the pfam [67] protein families. The pfam families with at least one member of the functional family were named as “positive family”, while the rest of the pfam families were named as “negative family”. Fourthly, 3 representative proteins from each “negative family” were randomly selected and iteratively put into the training, testing, and independent datasets as negative samples.

During the model construction, the parameter optimization for each training set was tested by testing set. When the optimized parameter was found, the training and testing sets were combined together to form a new training set, and the optimized parameter was further applied to train a new model. Then, independent dataset was used to evaluate the performance of the newly constructed model and to detect the overfitting problem. Once the optimized parameter passed the evaluation, it was used to train a final model by integrating training, testing, and independent datasets. All duplicated proteins in each training, testing, independent evaluation dataset or among them were removed before the model construction.

Protein representation

Extensive efforts were applied to the exploration of web-based or stand-alone tools for extracting the features from protein sequences [68,69]. For example, the Pse-in-One is a server for generating various modes of pseudo components of DNA, RNA, and protein sequences [68]. In this work, each sequence is represented by various physicochemical properties including 9 properties of the early version SVM-Prot (amino acid composition, polarity, hydrophobicity, surface tension, charge, normalized Van der Waals volume, polarizability, secondary structure and solvent accessibility) and 4 additional properties in this version SVM-Prot (molecular weight, solubility, number of hydrogen bond donor in side chain, and number of hydrogen bond acceptor in side chain) [69]. All properties are encoded in 3 descriptors, named as composition (C), transition (T), and distribution (D) [70]. C is the fraction of amino acids with a particular property. T characterizes the percent frequency of amino acids of a particular property neighbored by amino acids of another specific property. D measures the fractional chain length within which the first, 25%, 50%, 75% and 100% of the amino acids of a particular property is located.

Take a hypothetical protein (AEAAAEAEEAAAAAEAEEEAAEEAEEEAAE) with 16 alanines (n₁ = 16) and 14 glutamic acids (n₂ = 14) as an example. The composition (C) for these two amino acids are n₁/(n₁ + n₂) = 0.53 and n₂/(n₁ + n₂) = 0.47, respectively. Moreover, this protein contains 15 A-to-E and E-to-A transitions (T) with percent frequency of 15/29 = 0.52. Furthermore, the first, 25%, 50%, 75% and 100% of amino acid A are located within the first, 5, 12, 20, and 29 residues, respectively. Therefore, the distribution (D) for amino acid A can be calculated as (1/30 = 0.03, 5/30 = 0.17, 12/30 = 0.40, 20/30 = 0.67, 29/30 = 0.97, and that for amino acid E can also be calculated in the same way. Overall, the amino acid descriptors for this sequence are C = (0.53, 0.47), T = (0.51), and D = (0.03, 0.17, 0.40, 0.67, 0.97, 0.07, 0.27, 0.60, 0.77, 1.00), respectively. In most studies, amino acids are divided into three classes for each property. The combined descriptors for each property consist of 21 elements (3 for C, 3 for T and 15 for D). Moreover, the Moreau-Broto autocorrelation [71] of amino acid index and Pseudo-amino acid composition [72] are added for presenting correlation of the structural and physicochemical properties within each protein sequence.

Protein functional family prediction models

Three types of classification models were developed for predicting protein functional families in SVM-Prot. The first model is SVM, which is based on the structural risk minimization (SRM) principle from statistical learning theory [64]. In linearly separable cases, SVM constructs a hyperplane to separate two different classes of feature vectors with a maximum margin. A feature vector x_i is composed of protein descriptors which were described in the previous section. The hyperplane is constructed by finding another vector w and a parameter b that minimizes ‖w‖² and satisfies the following conditions: (1) (2) where y_i is the class index, w is a vector normal to the hyperplane, |b|/‖w‖ is the perpendicular distance from the hyperplane to the origin and ‖w‖² is the Euclidean norm of w. After the determination of w and b, a feature vector x can be classified by: (3)

In non-linearly separable cases, SVM maps the input variable into a high dimensional feature space using a kernel function K(x_i,x_j). In SVM-Prot, Libsvm-3.20 [73] was used for developing the SVM models using the Gaussian kernel: (4)

The second model is kNN [74], which computes the Euclidean distance between the query vector x of a query protein and the vector x_i of every protein in the training set, then selects k vectors nearest to the query vector x, and predicts the class of the query vector x based on the class of the majority of the k nearest neighbors.

The third model is PNN, which is a form of neural network that uses Bayes optimal decision rule h_ic_if_i(x) > h_jc_jf_j(x) for classification [75], where h_i and h_j are the prior probabilities, c_i and c_j are the costs of misclassification and f_i(x) and f_j(x) are the probability density function for class i and j respectively. A query vector x is classified into class i if the product of all the three terms is greater for class i than for any other class j (j ≠ i). The probability density function for each class can be estimated by using the Parzen’s nonparametric estimator: (5) where n is the number of proteins in a class, p is the number of features, x_j is the j^th feature of a query protein, x_ij is the j^th feature of the i^th protein in the class, and σ_j is the smoothing factor of this feature. PNN uses a single adjustable parameter, a smoothing factor σ for the radial basis function in the Parzen’s nonparameteric estimator, to speed-up the training process orders of magnitude faster than the traditional neural networks.

After the prediction of the functional families of a query protein, an option is provided for the user to align their query protein sequence with the sequences of the seed proteins in the SVM-Prot predicted functional families by using the BLAST sequence alignment program obtained from NCBI [42]. The top-ranked proteins (up to 20 sequences) of each SVM-Prot predicted family that are with the highest sequence similarity (the lowest E-values) to the query protein are provided in a separate output page. As the knowledge of protein functional family may not be specific enough to analyze the function of a query protein, this option facilitates the convenient and quick assessment of potential specific functions of a query protein. Fig 1 illustrates an example of SVM-Prot prediction of an EGFR protein sequence, which predicted EC2.7 Transferases transferring phosphorus-containing group family as the top family for this protein, a click of the BLAST search further indicated that this protein is a receptor protein-tyrosine kinase.

Download:

Fig 1. An example of SVM-Prot prediction of an EGFR protein sequence and the its subsequent BLAST sequence alignment analysis of the similarity proteins of the SVM-Prot predicted functional family.

https://doi.org/10.1371/journal.pone.0155290.g001

Performance measurement

The performance of SVM, kNN and PNN models were assessed by three different measurements. The first one is using sensitivity (SE), specificity (SP, also known as recall) and precision (PR) to evaluate the predictive performance of the independent validation datasets, which are defined as below: (6) (7) (8) where TP, TN, FP and FN are the number of true positives, true negatives, false positives and false negatives, respectively. In the real world, the number of proteins outside a specific functional family should significantly surpass that within the family. Thus, a slight decline of specificity (SP) would induce tremendous false positive prediction results, which reminds us to primarily focus on the SP when evaluating the model’s prediction performance.

The second measurement is the use of platt’s posterior class probability [50,76] for scoring the predicted functional families of a query protein. This probability has been used for scoring the machine learning classification of protein functional families [50], fold classes [77], transmembrane topology [78], secondary structures [79], and the effect of missense mutations on protein function [80]. It has also been built into such popular machine learning software as LibSVM [73], in which the posterior probability takes the form of a sigmoid function: (9) where f = f(x) is the output of the SVM and the parameters A and B are optimized via cross validation of the training sets.

The last measurement is the test of these models by a set of newly published novel proteins (reported in 2014 and 2015) with their functions reported in the respective publications, and a comparative analysis between SVM-Prot and two popular protein function prediction tools were provided.

Supporting Information

S1 Table. List of literature-reported protein functional family prediction models developed by using kNN and PNN methods.

https://doi.org/10.1371/journal.pone.0155290.s001

(DOCX)

S2 Table. Complete list of the protein functional families covered by SVMProt and the prediction performance of the SVM, kNN and PNN models on the independent testing sets.

https://doi.org/10.1371/journal.pone.0155290.s002

(DOCX)

S3 Table. List of the novel proteins published in 2015 and 2014 that are not in the SVMProt training sets and have available sequence in the literature or public databases.

https://doi.org/10.1371/journal.pone.0155290.s003

(DOCX)

S4 Table. The detailed results of the prediction of the functional families of the 42 novel proteins by SVMProt, FFPred and NCBI BLAST.

https://doi.org/10.1371/journal.pone.0155290.s004

(DOCX)

S5 Table. 10 representative protein functional families covered by SVM-Prot and the prediction performance of the LibD3C, SVM, kNN and PNN models on the independent testing sets.

https://doi.org/10.1371/journal.pone.0155290.s005

(DOCX)

Acknowledgments

This work was funded by the Fundamental Research Funds for the Central Universities (CDJZR14468801, CDJKXB14011, 2015CDJXY); Ministry of Science and Technology, 863 Hi-Tech Program (2007AA02Z160); Key Special Project Grant 2009ZX09501-004 China; and Singapore Academic Research Fund grant R-148-000-208-112.

Author Contributions

Conceived and designed the experiments: FZ YZC.
Performed the experiments: YHL JYX LT XFL.
Analyzed the data: YHL JYX LT XFL.
Contributed reagents/materials/analysis tools: YHL JYX LT XFL SL XZ SYC PZ CQ CZ ZC.
Wrote the paper: FZ YZC.
Design the web interface: YHL LT XFL.

References

1. Das S, Sillitoe I, Lee D, Lees JG, Dawson NL, Ward J, et al. CATH FunFHMMer web server: protein functional annotations using functional family assignments. Nucleic Acids Res. 2015; 43: W148–153. pmid:25964299
- View Article
- PubMed/NCBI
- Google Scholar
2. Jackson SP, Bartek J. The DNA-damage response in human biology and disease. Nature. 2009; 461: 1071–1078. pmid:19847258
- View Article
- PubMed/NCBI
- Google Scholar
3. Weinberg SE, Chandel NS. Targeting mitochondria metabolism for cancer therapy. Nat Chem Biol. 2015; 11: 9–15. pmid:25517383
- View Article
- PubMed/NCBI
- Google Scholar
4. Yang H, Qin C, Li YH, Tao L, Zhou J, Yu CY, et al. Therapeutic target database update 2016: enriched resource for bench to clinical drug target and targeted pathway information. Nucleic Acids Res. 2016; 44: D1069–1074. pmid:26578601
- View Article
- PubMed/NCBI
- Google Scholar
5. Piovesan D, Giollo M, Leonardi E, Ferrari C, Tosatto SC. INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res. 2015; 43: W134–140. pmid:26019177
- View Article
- PubMed/NCBI
- Google Scholar
6. Rentzsch R, Orengo CA. Protein function prediction using domain families. BMC Bioinformatics. 2013; 14 Suppl 3: S5. pmid:23514456
- View Article
- PubMed/NCBI
- Google Scholar
7. Sahraeian SM, Luo KR, Brenner SE. SIFTER search: a web server for accurate phylogeny-based protein function prediction. Nucleic Acids Res. 2015; 43: W141–147. pmid:25979264
- View Article
- PubMed/NCBI
- Google Scholar
8. Date SV, Marcotte EM. Protein function prediction using the Protein Link EXplorer (PLEX). Bioinformatics. 2005; 21: 2558–2559. pmid:15701682
- View Article
- PubMed/NCBI
- Google Scholar
9. Kotlyar M, Pastrello C, Pivetta F, Lo Sardo A, Cumbaa C, Li H, et al. In silico prediction of physical protein interactions and characterization of interactome orphans. Nat Methods. 2015; 12: 79–84. pmid:25402006
- View Article
- PubMed/NCBI
- Google Scholar
10. Liu B, Chen J, Wang X. Application of learning to rank to protein remote homology detection. Bioinformatics. 2015; 31: 3492–3498. pmid:26163693
- View Article
- PubMed/NCBI
- Google Scholar
11. Liu B, Zhang D, Xu R, Xu J, Wang X, Chen Q, et al. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics. 2014; 30: 472–479. pmid:24318998
- View Article
- PubMed/NCBI
- Google Scholar
12. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 2003; 31: 3692–3697. pmid:12824396
- View Article
- PubMed/NCBI
- Google Scholar
13. Lobley AE, Nugent T, Orengo CA, Jones DT. FFPred: an integrated feature-based function prediction server for vertebrate proteomes. Nucleic Acids Res. 2008; 36: W297–302. pmid:18463141
- View Article
- PubMed/NCBI
- Google Scholar
14. Wass MN, Barton G, Sternberg MJ. CombFunc: predicting protein function using heterogeneous data sources. Nucleic Acids Res. 2012; 40: W466–470. pmid:22641853
- View Article
- PubMed/NCBI
- Google Scholar
15. Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014; 30: 1236–1240. pmid:24451626
- View Article
- PubMed/NCBI
- Google Scholar
16. Xue W, Wang P, Li B, Li Y, Xu X, Yang F, et al. Identification of the inhibitory mechanism of FDA approved selective serotonin reuptake inhibitors: an insight from molecular dynamics simulation study. Phys Chem Chem Phys. 2016; 18: 3260–3271. pmid:26745505
- View Article
- PubMed/NCBI
- Google Scholar
17. Dobson PD, Doig AJ. Predicting enzyme class from protein structure without alignments. J Mol Biol. 2005; 345: 187–199. pmid:15567421
- View Article
- PubMed/NCBI
- Google Scholar
18. Han LY, Cai CZ, Lo SL, Chung MC, Chen YZ. Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA. 2004; 10: 355–368. pmid:14970381
- View Article
- PubMed/NCBI
- Google Scholar
19. Gene Ontology C. Gene Ontology Consortium: going forward. Nucleic Acids Res. 2015; 43: D1049–1056. pmid:25428369
- View Article
- PubMed/NCBI
- Google Scholar
20. Guan Y, Myers CL, Hess DC, Barutcuoglu Z, Caudy AA, Troyanskaya OG. Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biol. 2008; 9 Suppl 1: S3. pmid:18613947
- View Article
- PubMed/NCBI
- Google Scholar
21. Wang M, Yang J, Liu GP, Xu ZJ, Chou KC. Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. Protein Eng Des Sel. 2004; 17: 509–516. pmid:15314209
- View Article
- PubMed/NCBI
- Google Scholar
22. Garg A, Gupta D. VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens. BMC Bioinformatics. 2008; 9: 62. pmid:18226234
- View Article
- PubMed/NCBI
- Google Scholar
23. Garg A, Raghava GP. A machine learning based method for the prediction of secretory proteins using amino acid composition, their order and similarity-search. In Silico Biol. 2008; 8: 129–140. pmid:18928201
- View Article
- PubMed/NCBI
- Google Scholar
24. Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, et al. A large-scale evaluation of computational protein function prediction. Nat Methods. 2013; 10: 221–227. pmid:23353650
- View Article
- PubMed/NCBI
- Google Scholar
25. UniProt C. UniProt: a hub for protein information. Nucleic Acids Res. 2015; 43: D204–212. pmid:25348405
- View Article
- PubMed/NCBI
- Google Scholar
26. Rose PW, Prlic A, Bi C, Bluhm WF, Christie CH, Dutta S, et al. The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Res. 2015; 43: D345–356. pmid:25428375
- View Article
- PubMed/NCBI
- Google Scholar
27. Mitchell A, Chang HY, Daugherty L, Fraser M, Hunter S, Lopez R, et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 2015; 43: D213–221. pmid:25428371
- View Article
- PubMed/NCBI
- Google Scholar
28. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic Acids Res. 2014; 42: D222–230. pmid:24288371
- View Article
- PubMed/NCBI
- Google Scholar
29. Zhu F, Shi Z, Qin C, Tao L, Liu X, Xu F, et al. Therapeutic target database update 2012: a resource for facilitating target-oriented drug discovery. Nucleic Acids Res. 2012; 40: D1128–1136. pmid:21948793
- View Article
- PubMed/NCBI
- Google Scholar
30. Zhu F, Han B, Kumar P, Liu X, Ma X, Wei X, et al. Update of TTD: Therapeutic Target Database. Nucleic Acids Res. 2010; 38: D787–791. pmid:19933260
- View Article
- PubMed/NCBI
- Google Scholar
31. Bork P. Powers and pitfalls in sequence analysis: the 70% hurdle. Genome Res. 2000; 10: 398–400. pmid:10779480
- View Article
- PubMed/NCBI
- Google Scholar
32. Hu P, Janga SC, Babu M, Diaz-Mejia JJ, Butland G, Yang W, et al. Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. PLoS Biol. 2009; 7: e96. pmid:19402753
- View Article
- PubMed/NCBI
- Google Scholar
33. Cai CZ, Han LY, Ji ZL, Chen YZ. Enzyme family classification by support vector machines. Proteins. 2004; 55: 66–76. pmid:14997540
- View Article
- PubMed/NCBI
- Google Scholar
34. Han LY, Cai CZ, Ji ZL, Cao ZW, Cui J, Chen YZ. Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucleic Acids Res. 2004; 32: 6437–6444. pmid:15585667
- View Article
- PubMed/NCBI
- Google Scholar
35. Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q. nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification. BMC Bioinformatics. 2014; 15: 298. pmid:25196432
- View Article
- PubMed/NCBI
- Google Scholar
36. Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, et al. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One. 2013; 8: e56499. pmid:23437146
- View Article
- PubMed/NCBI
- Google Scholar
37. Cheng XY, Huang WJ, Hu SC, Zhang HL, Wang H, Zhang JX, et al. A global characterization and identification of multifunctional enzymes. PLoS One. 2012; 7: e38979. pmid:22723914
- View Article
- PubMed/NCBI
- Google Scholar
38. Zou Q, Wang Z, Guan X, Liu B, Wu Y, Lin Z. An approach for identifying cytokines based on a novel ensemble classifier. Biomed Res Int. 2013; 2013: 686090. pmid:24027761
- View Article
- PubMed/NCBI
- Google Scholar
39. Wei L, Liao M, Gao X, Zou Q. An Improved Protein Structural Prediction Method by Incorporating Both Sequence and Structure Information. IEEE Trans Nanobioscience. 2014; 14: 339–349.
- View Article
- Google Scholar
40. Wei L, Liao M, Gao X, Zou Q. Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique. IEEE Trans Nanobioscience. 2015; 14: 649–659. pmid:26335556
- View Article
- PubMed/NCBI
- Google Scholar
41. Ong SA, Lin HH, Chen YZ, Li ZR, Cao Z. Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinformatics. 2007; 8: 300. pmid:17705863
- View Article
- PubMed/NCBI
- Google Scholar
42. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10: 421. pmid:20003500
- View Article
- PubMed/NCBI
- Google Scholar
43. Nath N, Mitchell JB. Is EC class predictable from reaction mechanism? BMC Bioinformatics. 2012; 13: 60. pmid:22530800
- View Article
- PubMed/NCBI
- Google Scholar
44. Naveed M, Khan A. GPCR-MPredictor: multi-level prediction of G protein-coupled receptors using genetic ensemble. Amino Acids. 2012; 42: 1809–1823. pmid:21505826
- View Article
- PubMed/NCBI
- Google Scholar
45. Khan ZU, Hayat M, Khan MA. Discrimination of acidic and alkaline enzyme using Chou's pseudo amino acid composition in conjunction with probabilistic neural network model. J Theor Biol. 2015; 365: 197–203. pmid:25452135
- View Article
- PubMed/NCBI
- Google Scholar
46. Wang P, Yang F, Yang H, Xu X, Liu D, Xue W, et al. Identification of dual active agents targeting 5-HT1A and SERT by combinatorial virtual screening methods. Biomed Mater Eng. 2015; 26 Suppl 1: S2233–2239. pmid:26406003
- View Article
- PubMed/NCBI
- Google Scholar
47. Cui J, Han LY, Li H, Ung CY, Tang ZQ, Zheng CJ, et al. Computer prediction of allergen proteins from sequence-derived protein structural and physicochemical properties. Mol Immunol. 2007; 44: 514–520. pmid:16563508
- View Article
- PubMed/NCBI
- Google Scholar
48. Majid A, Ali S, Iqbal M, Kausar N. Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput Methods Programs Biomed. 2014; 113: 792–808. pmid:24472367
- View Article
- PubMed/NCBI
- Google Scholar
49. Dai HL. Imbalanced Protein Data Classification Using Ensemble FTM-SVM. IEEE Trans Nanobioscience. 2015; 14: 350–359.
- View Article
- Google Scholar
50. Minneci F, Piovesan D, Cozzetto D, Jones DT. FFPred 2.0: improved homology-independent prediction of gene ontology terms for eukaryotic protein sequences. PLoS One. 2013; 8: e63754. pmid:23717476
- View Article
- PubMed/NCBI
- Google Scholar
51. Boratyn GM, Camacho C, Cooper PS, Coulouris G, Fong A, Ma N, et al. BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013; 41: W29–33. pmid:23609542
- View Article
- PubMed/NCBI
- Google Scholar
52. Cao P, Zhao D, Zaiane O. Measure oriented cost-sensitive SVM for 3D nodule detection. Conf Proc IEEE Eng Med Biol Soc. 2013; 2013: 3981–3984. pmid:24110604
- View Article
- PubMed/NCBI
- Google Scholar
53. Bernardes JS, Pedreira CE. A review of protein function prediction under machine learning perspective. Recent Pat Biotechnol. 2013; 7: 122–141. pmid:23848274
- View Article
- PubMed/NCBI
- Google Scholar
54. Lyons J, Dehzangi A, Heffernan R, Sharma A, Paliwal K, Sattar A, et al. Predicting backbone Calpha angles and dihedrals from protein sequences by stacked sparse auto-encoder deep neural network. J Comput Chem. 2014; 35: 2040–2046. pmid:25212657
- View Article
- PubMed/NCBI
- Google Scholar
55. Spencer M, Eickholt J, Cheng J. A Deep Learning Network Approach to Protein Secondary Structure Prediction. IEEE/ACM Trans Comput Biol Bioinform. 2015; 12: 103–112. pmid:25750595
- View Article
- PubMed/NCBI
- Google Scholar
56. Heffernan R, Paliwal K, Lyons J, Dehzangi A, Sharma A, Wang J, et al. Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning. Sci Rep. 2015; 5: 11476. pmid:26098304
- View Article
- PubMed/NCBI
- Google Scholar
57. Lin C, Chen WQ, Qiu C, Wu YF, Krishnan S, Zou Q. LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing. 2014; 123: 424–435.
- View Article
- Google Scholar
58. Xu RF, Zhou JY, Wang HP, He YL, Wang XL, Liu B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst Biol. 2015; 9: S10. pmid:25708928
- View Article
- PubMed/NCBI
- Google Scholar
59. Liu B, Wang SY, Wang XL. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep. 2015; 5: 15479. pmid:26482832
- View Article
- PubMed/NCBI
- Google Scholar
60. Wei LY, Zou Q, Liao MH, Lu HJ, Zhao YM. A Novel Machine Learning Method for Cytokine-Receptor Interaction Prediction. Comb Chem High Throughput Screen. 2016; 19: 144–152. pmid:26552440
- View Article
- PubMed/NCBI
- Google Scholar
61. Tang H, Chen W, Lin H. Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. Mol Biosyst. 2016; 12: 1269–1275. pmid:26883492
- View Article
- PubMed/NCBI
- Google Scholar
62. Schomburg I, Chang A, Schomburg D. BRENDA, enzyme data and metabolic information. Nucleic Acids Res. 2002; 30: 47–49. pmid:11752250
- View Article
- PubMed/NCBI
- Google Scholar
63. Horn F, Vriend G, Cohen FE. Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems. Nucleic Acids Res. 2001; 29: 346–349. pmid:11125133
- View Article
- PubMed/NCBI
- Google Scholar
64. Karchin R, Karplus K, Haussler D. Classifying G-protein coupled receptors with support vector machines. Bioinformatics. 2002; 18: 147–159. pmid:11836223
- View Article
- PubMed/NCBI
- Google Scholar
65. Saier MH, Jr. A functional-phylogenetic classification system for transmembrane solute transporters. Microbiol Mol Biol Rev. 2000; 64: 354–411. pmid:10839820
- View Article
- PubMed/NCBI
- Google Scholar
66. Le Novere N, Changeux JP. LGICdb: the ligand-gated ion channel database. Nucleic Acids Res. 2001; 29: 294–295. pmid:11125117
- View Article
- PubMed/NCBI
- Google Scholar
67. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, et al. The Pfam Protein Families Database. Nucleic Acids Res. 2002; 30: 276–280. pmid:11752314
- View Article
- PubMed/NCBI
- Google Scholar
68. Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015; 43: W65–71. pmid:25958395
- View Article
- PubMed/NCBI
- Google Scholar
69. Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ. Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2011; 39: W385–390. pmid:21609959
- View Article
- PubMed/NCBI
- Google Scholar
70. Dubchak I, Muchnik I, Holbrook SR, Kim SH. Prediction of Protein-Folding Class Using Global Description of Amino-Acid-Sequence. Proc Natl Acad Sci U S A. 1995; 92: 8700–8704. pmid:7568000
- View Article
- PubMed/NCBI
- Google Scholar
71. Broto P, Moreau G, Vandycke C. Molecular-Structures—Perception, Auto-Correlation Descriptor and Sar Studies—Use of the Auto-Correlation Descriptor in the Qsar Study of 2 Non-Narcotic Analgesic Series. Eur J Med Chem. 1984; 19: 79–84.
- View Article
- Google Scholar
72. Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2006; 34: W32–37. pmid:16845018
- View Article
- PubMed/NCBI
- Google Scholar
73. Chang CC, Lin CJ. LIBSVM: A Library for Support Vector Machines. ACM Trans Intell Syst Technol. 2011; 2: 1–27.
- View Article
- Google Scholar
74. Fix E, Hodges JL. Discriminatory analysis: Non-parametric discrimination: Consistency properties. Texas: USAF School of Aviation Medicine; 1951. pp. 261–279.
75. Specht DF. Probabilistic neural networks. Neural Networks. 1990; 3: 109–118.
- View Article
- Google Scholar
76. Lin HT, Lin CJ, Weng RC. A note on Platt’s probabilistic outputs for support vector machines. Mach Learn. 2007; 68: 267–276.
- View Article
- Google Scholar
77. Grassmann J, Reczko M, Suhai S, Edler L. Protein fold class prediction: new methods of statistical classification. Proc Int Conf Intell Syst Mol Biol. 1999; 1999: 106–112.
- View Article
- Google Scholar
78. Reynolds SM, Kall L, Riffle ME, Bilmes JA, Noble WS. Transmembrane topology and signal peptide prediction using dynamic bayesian networks. PLoS Comput Biol. 2008; 4: e1000213. pmid:18989393
- View Article
- PubMed/NCBI
- Google Scholar
79. Guermeur Y, Geourjon C, Gallinari P, Deleage G. Improved performance in protein secondary structure prediction by inhomogeneous score combination. Bioinformatics. 1999; 15: 413–421. pmid:10366661
- View Article
- PubMed/NCBI
- Google Scholar
80. Needham CJ, Bradford JR, Bulpitt AJ, Care MA, Westhead DR. Predicting the effect of missense mutations on protein function: analysis with Bayesian networks. BMC Bioinformatics. 2006; 7: 405. pmid:16956412
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Das S, Sillitoe I, Lee D, Lees JG, Dawson NL, Ward J, et al. CATH FunFHMMer web server: protein functional annotations using functional family assignments. Nucleic Acids Res. 2015; 43: W148–153. pmid:25964299
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Jackson SP, Bartek J. The DNA-damage response in human biology and disease. Nature. 2009; 461: 1071–1078. pmid:19847258
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Weinberg SE, Chandel NS. Targeting mitochondria metabolism for cancer therapy. Nat Chem Biol. 2015; 11: 9–15. pmid:25517383
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Yang H, Qin C, Li YH, Tao L, Zhou J, Yu CY, et al. Therapeutic target database update 2016: enriched resource for bench to clinical drug target and targeted pathway information. Nucleic Acids Res. 2016; 44: D1069–1074. pmid:26578601
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Piovesan D, Giollo M, Leonardi E, Ferrari C, Tosatto SC. INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity. Nucleic Acids Res. 2015; 43: W134–140. pmid:26019177
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Rentzsch R, Orengo CA. Protein function prediction using domain families. BMC Bioinformatics. 2013; 14 Suppl 3: S5. pmid:23514456
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Sahraeian SM, Luo KR, Brenner SE. SIFTER search: a web server for accurate phylogeny-based protein function prediction. Nucleic Acids Res. 2015; 43: W141–147. pmid:25979264
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Date SV, Marcotte EM. Protein function prediction using the Protein Link EXplorer (PLEX). Bioinformatics. 2005; 21: 2558–2559. pmid:15701682
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Kotlyar M, Pastrello C, Pivetta F, Lo Sardo A, Cumbaa C, Li H, et al. In silico prediction of physical protein interactions and characterization of interactome orphans. Nat Methods. 2015; 12: 79–84. pmid:25402006
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Liu B, Chen J, Wang X. Application of learning to rank to protein remote homology detection. Bioinformatics. 2015; 31: 3492–3498. pmid:26163693
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Liu B, Zhang D, Xu R, Xu J, Wang X, Chen Q, et al. Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection. Bioinformatics. 2014; 30: 472–479. pmid:24318998
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Cai CZ, Han LY, Ji ZL, Chen X, Chen YZ. SVM-Prot: Web-based support vector machine software for functional classification of a protein from its primary sequence. Nucleic Acids Res. 2003; 31: 3692–3697. pmid:12824396
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Lobley AE, Nugent T, Orengo CA, Jones DT. FFPred: an integrated feature-based function prediction server for vertebrate proteomes. Nucleic Acids Res. 2008; 36: W297–302. pmid:18463141
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Wass MN, Barton G, Sternberg MJ. CombFunc: predicting protein function using heterogeneous data sources. Nucleic Acids Res. 2012; 40: W466–470. pmid:22641853
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. Jones P, Binns D, Chang HY, Fraser M, Li W, McAnulla C, et al. InterProScan 5: genome-scale protein function classification. Bioinformatics. 2014; 30: 1236–1240. pmid:24451626
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref16] 16. Xue W, Wang P, Li B, Li Y, Xu X, Yang F, et al. Identification of the inhibitory mechanism of FDA approved selective serotonin reuptake inhibitors: an insight from molecular dynamics simulation study. Phys Chem Chem Phys. 2016; 18: 3260–3271. pmid:26745505
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref17] 17. Dobson PD, Doig AJ. Predicting enzyme class from protein structure without alignments. J Mol Biol. 2005; 345: 187–199. pmid:15567421
View Article
PubMed/NCBI
Google Scholar

[66] View Article

[67] PubMed/NCBI

[68] Google Scholar

[ref18] 18. Han LY, Cai CZ, Lo SL, Chung MC, Chen YZ. Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA. 2004; 10: 355–368. pmid:14970381
View Article
PubMed/NCBI
Google Scholar

[70] View Article

[71] PubMed/NCBI

[72] Google Scholar

[ref19] 19. Gene Ontology C. Gene Ontology Consortium: going forward. Nucleic Acids Res. 2015; 43: D1049–1056. pmid:25428369
View Article
PubMed/NCBI
Google Scholar

[74] View Article

[75] PubMed/NCBI

[76] Google Scholar

[ref20] 20. Guan Y, Myers CL, Hess DC, Barutcuoglu Z, Caudy AA, Troyanskaya OG. Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biol. 2008; 9 Suppl 1: S3. pmid:18613947
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref21] 21. Wang M, Yang J, Liu GP, Xu ZJ, Chou KC. Weighted-support vector machines for predicting membrane protein types based on pseudo-amino acid composition. Protein Eng Des Sel. 2004; 17: 509–516. pmid:15314209
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

[ref22] 22. Garg A, Gupta D. VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens. BMC Bioinformatics. 2008; 9: 62. pmid:18226234
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref23] 23. Garg A, Raghava GP. A machine learning based method for the prediction of secretory proteins using amino acid composition, their order and similarity-search. In Silico Biol. 2008; 8: 129–140. pmid:18928201
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref24] 24. Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, et al. A large-scale evaluation of computational protein function prediction. Nat Methods. 2013; 10: 221–227. pmid:23353650
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref25] 25. UniProt C. UniProt: a hub for protein information. Nucleic Acids Res. 2015; 43: D204–212. pmid:25348405
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref26] 26. Rose PW, Prlic A, Bi C, Bluhm WF, Christie CH, Dutta S, et al. The RCSB Protein Data Bank: views of structural biology for basic and applied research and education. Nucleic Acids Res. 2015; 43: D345–356. pmid:25428375
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref27] 27. Mitchell A, Chang HY, Daugherty L, Fraser M, Hunter S, Lopez R, et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 2015; 43: D213–221. pmid:25428371
View Article
PubMed/NCBI
Google Scholar

[106] View Article

[107] PubMed/NCBI

[108] Google Scholar

[ref28] 28. Finn RD, Bateman A, Clements J, Coggill P, Eberhardt RY, Eddy SR, et al. Pfam: the protein families database. Nucleic Acids Res. 2014; 42: D222–230. pmid:24288371
View Article
PubMed/NCBI
Google Scholar

[110] View Article

[111] PubMed/NCBI

[112] Google Scholar

[ref29] 29. Zhu F, Shi Z, Qin C, Tao L, Liu X, Xu F, et al. Therapeutic target database update 2012: a resource for facilitating target-oriented drug discovery. Nucleic Acids Res. 2012; 40: D1128–1136. pmid:21948793
View Article
PubMed/NCBI
Google Scholar

[114] View Article

[115] PubMed/NCBI

[116] Google Scholar

[ref30] 30. Zhu F, Han B, Kumar P, Liu X, Ma X, Wei X, et al. Update of TTD: Therapeutic Target Database. Nucleic Acids Res. 2010; 38: D787–791. pmid:19933260
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref31] 31. Bork P. Powers and pitfalls in sequence analysis: the 70% hurdle. Genome Res. 2000; 10: 398–400. pmid:10779480
View Article
PubMed/NCBI
Google Scholar

[122] View Article

[123] PubMed/NCBI

[124] Google Scholar

[ref32] 32. Hu P, Janga SC, Babu M, Diaz-Mejia JJ, Butland G, Yang W, et al. Global functional atlas of Escherichia coli encompassing previously uncharacterized proteins. PLoS Biol. 2009; 7: e96. pmid:19402753
View Article
PubMed/NCBI
Google Scholar

[126] View Article

[127] PubMed/NCBI

[128] Google Scholar

[ref33] 33. Cai CZ, Han LY, Ji ZL, Chen YZ. Enzyme family classification by support vector machines. Proteins. 2004; 55: 66–76. pmid:14997540
View Article
PubMed/NCBI
Google Scholar

[130] View Article

[131] PubMed/NCBI

[132] Google Scholar

[ref34] 34. Han LY, Cai CZ, Ji ZL, Cao ZW, Cui J, Chen YZ. Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach. Nucleic Acids Res. 2004; 32: 6437–6444. pmid:15585667
View Article
PubMed/NCBI
Google Scholar

[134] View Article

[135] PubMed/NCBI

[136] Google Scholar

[ref35] 35. Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q. nDNA-Prot: identification of DNA-binding proteins based on unbalanced classification. BMC Bioinformatics. 2014; 15: 298. pmid:25196432
View Article
PubMed/NCBI
Google Scholar

[138] View Article

[139] PubMed/NCBI

[140] Google Scholar

[ref36] 36. Lin C, Zou Y, Qin J, Liu X, Jiang Y, Ke C, et al. Hierarchical classification of protein folds using a novel ensemble classifier. PLoS One. 2013; 8: e56499. pmid:23437146
View Article
PubMed/NCBI
Google Scholar

[142] View Article

[143] PubMed/NCBI

[144] Google Scholar

[ref37] 37. Cheng XY, Huang WJ, Hu SC, Zhang HL, Wang H, Zhang JX, et al. A global characterization and identification of multifunctional enzymes. PLoS One. 2012; 7: e38979. pmid:22723914
View Article
PubMed/NCBI
Google Scholar

[146] View Article

[147] PubMed/NCBI

[148] Google Scholar

[ref38] 38. Zou Q, Wang Z, Guan X, Liu B, Wu Y, Lin Z. An approach for identifying cytokines based on a novel ensemble classifier. Biomed Res Int. 2013; 2013: 686090. pmid:24027761
View Article
PubMed/NCBI
Google Scholar

[150] View Article

[151] PubMed/NCBI

[152] Google Scholar

[ref39] 39. Wei L, Liao M, Gao X, Zou Q. An Improved Protein Structural Prediction Method by Incorporating Both Sequence and Structure Information. IEEE Trans Nanobioscience. 2014; 14: 339–349.
View Article
Google Scholar

[154] View Article

[155] Google Scholar

[ref40] 40. Wei L, Liao M, Gao X, Zou Q. Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique. IEEE Trans Nanobioscience. 2015; 14: 649–659. pmid:26335556
View Article
PubMed/NCBI
Google Scholar

[157] View Article

[158] PubMed/NCBI

[159] Google Scholar

[ref41] 41. Ong SA, Lin HH, Chen YZ, Li ZR, Cao Z. Efficacy of different protein descriptors in predicting protein functional families. BMC Bioinformatics. 2007; 8: 300. pmid:17705863
View Article
PubMed/NCBI
Google Scholar

[161] View Article

[162] PubMed/NCBI

[163] Google Scholar

[ref42] 42. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009; 10: 421. pmid:20003500
View Article
PubMed/NCBI
Google Scholar

[165] View Article

[166] PubMed/NCBI

[167] Google Scholar

[ref43] 43. Nath N, Mitchell JB. Is EC class predictable from reaction mechanism? BMC Bioinformatics. 2012; 13: 60. pmid:22530800
View Article
PubMed/NCBI
Google Scholar

[169] View Article

[170] PubMed/NCBI

[171] Google Scholar

[ref44] 44. Naveed M, Khan A. GPCR-MPredictor: multi-level prediction of G protein-coupled receptors using genetic ensemble. Amino Acids. 2012; 42: 1809–1823. pmid:21505826
View Article
PubMed/NCBI
Google Scholar

[173] View Article

[174] PubMed/NCBI

[175] Google Scholar

[ref45] 45. Khan ZU, Hayat M, Khan MA. Discrimination of acidic and alkaline enzyme using Chou's pseudo amino acid composition in conjunction with probabilistic neural network model. J Theor Biol. 2015; 365: 197–203. pmid:25452135
View Article
PubMed/NCBI
Google Scholar

[177] View Article

[178] PubMed/NCBI

[179] Google Scholar

[ref46] 46. Wang P, Yang F, Yang H, Xu X, Liu D, Xue W, et al. Identification of dual active agents targeting 5-HT1A and SERT by combinatorial virtual screening methods. Biomed Mater Eng. 2015; 26 Suppl 1: S2233–2239. pmid:26406003
View Article
PubMed/NCBI
Google Scholar

[181] View Article

[182] PubMed/NCBI

[183] Google Scholar

[ref47] 47. Cui J, Han LY, Li H, Ung CY, Tang ZQ, Zheng CJ, et al. Computer prediction of allergen proteins from sequence-derived protein structural and physicochemical properties. Mol Immunol. 2007; 44: 514–520. pmid:16563508
View Article
PubMed/NCBI
Google Scholar

[185] View Article

[186] PubMed/NCBI

[187] Google Scholar

[ref48] 48. Majid A, Ali S, Iqbal M, Kausar N. Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput Methods Programs Biomed. 2014; 113: 792–808. pmid:24472367
View Article
PubMed/NCBI
Google Scholar

[189] View Article

[190] PubMed/NCBI

[191] Google Scholar

[ref49] 49. Dai HL. Imbalanced Protein Data Classification Using Ensemble FTM-SVM. IEEE Trans Nanobioscience. 2015; 14: 350–359.
View Article
Google Scholar

[193] View Article

[194] Google Scholar

[ref50] 50. Minneci F, Piovesan D, Cozzetto D, Jones DT. FFPred 2.0: improved homology-independent prediction of gene ontology terms for eukaryotic protein sequences. PLoS One. 2013; 8: e63754. pmid:23717476
View Article
PubMed/NCBI
Google Scholar

[196] View Article

[197] PubMed/NCBI

[198] Google Scholar

[ref51] 51. Boratyn GM, Camacho C, Cooper PS, Coulouris G, Fong A, Ma N, et al. BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013; 41: W29–33. pmid:23609542
View Article
PubMed/NCBI
Google Scholar

[200] View Article

[201] PubMed/NCBI

[202] Google Scholar

[ref52] 52. Cao P, Zhao D, Zaiane O. Measure oriented cost-sensitive SVM for 3D nodule detection. Conf Proc IEEE Eng Med Biol Soc. 2013; 2013: 3981–3984. pmid:24110604
View Article
PubMed/NCBI
Google Scholar

[204] View Article

[205] PubMed/NCBI

[206] Google Scholar

[ref53] 53. Bernardes JS, Pedreira CE. A review of protein function prediction under machine learning perspective. Recent Pat Biotechnol. 2013; 7: 122–141. pmid:23848274
View Article
PubMed/NCBI
Google Scholar

[208] View Article

[209] PubMed/NCBI

[210] Google Scholar

[ref54] 54. Lyons J, Dehzangi A, Heffernan R, Sharma A, Paliwal K, Sattar A, et al. Predicting backbone Calpha angles and dihedrals from protein sequences by stacked sparse auto-encoder deep neural network. J Comput Chem. 2014; 35: 2040–2046. pmid:25212657
View Article
PubMed/NCBI
Google Scholar

[212] View Article

[213] PubMed/NCBI

[214] Google Scholar

[ref55] 55. Spencer M, Eickholt J, Cheng J. A Deep Learning Network Approach to Protein Secondary Structure Prediction. IEEE/ACM Trans Comput Biol Bioinform. 2015; 12: 103–112. pmid:25750595
View Article
PubMed/NCBI
Google Scholar

[216] View Article

[217] PubMed/NCBI

[218] Google Scholar

[ref56] 56. Heffernan R, Paliwal K, Lyons J, Dehzangi A, Sharma A, Wang J, et al. Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning. Sci Rep. 2015; 5: 11476. pmid:26098304
View Article
PubMed/NCBI
Google Scholar

[220] View Article

[221] PubMed/NCBI

[222] Google Scholar

[ref57] 57. Lin C, Chen WQ, Qiu C, Wu YF, Krishnan S, Zou Q. LibD3C: Ensemble classifiers with a clustering and dynamic selection strategy. Neurocomputing. 2014; 123: 424–435.
View Article
Google Scholar

[224] View Article

[225] Google Scholar

[ref58] 58. Xu RF, Zhou JY, Wang HP, He YL, Wang XL, Liu B. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst Biol. 2015; 9: S10. pmid:25708928
View Article
PubMed/NCBI
Google Scholar

[227] View Article

[228] PubMed/NCBI

[229] Google Scholar

[ref59] 59. Liu B, Wang SY, Wang XL. DNA binding protein identification by combining pseudo amino acid composition and profile-based protein representation. Sci Rep. 2015; 5: 15479. pmid:26482832
View Article
PubMed/NCBI
Google Scholar

[231] View Article

[232] PubMed/NCBI

[233] Google Scholar

[ref60] 60. Wei LY, Zou Q, Liao MH, Lu HJ, Zhao YM. A Novel Machine Learning Method for Cytokine-Receptor Interaction Prediction. Comb Chem High Throughput Screen. 2016; 19: 144–152. pmid:26552440
View Article
PubMed/NCBI
Google Scholar

[235] View Article

[236] PubMed/NCBI

[237] Google Scholar

[ref61] 61. Tang H, Chen W, Lin H. Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. Mol Biosyst. 2016; 12: 1269–1275. pmid:26883492
View Article
PubMed/NCBI
Google Scholar

[239] View Article

[240] PubMed/NCBI

[241] Google Scholar

[ref62] 62. Schomburg I, Chang A, Schomburg D. BRENDA, enzyme data and metabolic information. Nucleic Acids Res. 2002; 30: 47–49. pmid:11752250
View Article
PubMed/NCBI
Google Scholar

[243] View Article

[244] PubMed/NCBI

[245] Google Scholar

[ref63] 63. Horn F, Vriend G, Cohen FE. Collecting and harvesting biological data: the GPCRDB and NucleaRDB information systems. Nucleic Acids Res. 2001; 29: 346–349. pmid:11125133
View Article
PubMed/NCBI
Google Scholar

[247] View Article

[248] PubMed/NCBI

[249] Google Scholar

[ref64] 64. Karchin R, Karplus K, Haussler D. Classifying G-protein coupled receptors with support vector machines. Bioinformatics. 2002; 18: 147–159. pmid:11836223
View Article
PubMed/NCBI
Google Scholar

[251] View Article

[252] PubMed/NCBI

[253] Google Scholar

[ref65] 65. Saier MH, Jr. A functional-phylogenetic classification system for transmembrane solute transporters. Microbiol Mol Biol Rev. 2000; 64: 354–411. pmid:10839820
View Article
PubMed/NCBI
Google Scholar

[255] View Article

[256] PubMed/NCBI

[257] Google Scholar

[ref66] 66. Le Novere N, Changeux JP. LGICdb: the ligand-gated ion channel database. Nucleic Acids Res. 2001; 29: 294–295. pmid:11125117
View Article
PubMed/NCBI
Google Scholar

[259] View Article

[260] PubMed/NCBI

[261] Google Scholar

[ref67] 67. Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, et al. The Pfam Protein Families Database. Nucleic Acids Res. 2002; 30: 276–280. pmid:11752314
View Article
PubMed/NCBI
Google Scholar

[263] View Article

[264] PubMed/NCBI

[265] Google Scholar

[ref68] 68. Liu B, Liu F, Wang X, Chen J, Fang L, Chou KC. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 2015; 43: W65–71. pmid:25958395
View Article
PubMed/NCBI
Google Scholar

[267] View Article

[268] PubMed/NCBI

[269] Google Scholar

[ref69] 69. Rao HB, Zhu F, Yang GB, Li ZR, Chen YZ. Update of PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2011; 39: W385–390. pmid:21609959
View Article
PubMed/NCBI
Google Scholar

[271] View Article

[272] PubMed/NCBI

[273] Google Scholar

[ref70] 70. Dubchak I, Muchnik I, Holbrook SR, Kim SH. Prediction of Protein-Folding Class Using Global Description of Amino-Acid-Sequence. Proc Natl Acad Sci U S A. 1995; 92: 8700–8704. pmid:7568000
View Article
PubMed/NCBI
Google Scholar

[275] View Article

[276] PubMed/NCBI

[277] Google Scholar

[ref71] 71. Broto P, Moreau G, Vandycke C. Molecular-Structures—Perception, Auto-Correlation Descriptor and Sar Studies—Use of the Auto-Correlation Descriptor in the Qsar Study of 2 Non-Narcotic Analgesic Series. Eur J Med Chem. 1984; 19: 79–84.
View Article
Google Scholar

[279] View Article

[280] Google Scholar

[ref72] 72. Li ZR, Lin HH, Han LY, Jiang L, Chen X, Chen YZ. PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence. Nucleic Acids Res. 2006; 34: W32–37. pmid:16845018
View Article
PubMed/NCBI
Google Scholar

[282] View Article

[283] PubMed/NCBI

[284] Google Scholar

[ref73] 73. Chang CC, Lin CJ. LIBSVM: A Library for Support Vector Machines. ACM Trans Intell Syst Technol. 2011; 2: 1–27.
View Article
Google Scholar

[286] View Article

[287] Google Scholar

[ref74] 74. Fix E, Hodges JL. Discriminatory analysis: Non-parametric discrimination: Consistency properties. Texas: USAF School of Aviation Medicine; 1951. pp. 261–279.

[ref75] 75. Specht DF. Probabilistic neural networks. Neural Networks. 1990; 3: 109–118.
View Article
Google Scholar

[290] View Article

[291] Google Scholar

[ref76] 76. Lin HT, Lin CJ, Weng RC. A note on Platt’s probabilistic outputs for support vector machines. Mach Learn. 2007; 68: 267–276.
View Article
Google Scholar

[293] View Article

[294] Google Scholar

[ref77] 77. Grassmann J, Reczko M, Suhai S, Edler L. Protein fold class prediction: new methods of statistical classification. Proc Int Conf Intell Syst Mol Biol. 1999; 1999: 106–112.
View Article
Google Scholar

[296] View Article

[297] Google Scholar

[ref78] 78. Reynolds SM, Kall L, Riffle ME, Bilmes JA, Noble WS. Transmembrane topology and signal peptide prediction using dynamic bayesian networks. PLoS Comput Biol. 2008; 4: e1000213. pmid:18989393
View Article
PubMed/NCBI
Google Scholar

[299] View Article

[300] PubMed/NCBI

[301] Google Scholar

[ref79] 79. Guermeur Y, Geourjon C, Gallinari P, Deleage G. Improved performance in protein secondary structure prediction by inhomogeneous score combination. Bioinformatics. 1999; 15: 413–421. pmid:10366661
View Article
PubMed/NCBI
Google Scholar

[303] View Article

[304] PubMed/NCBI

[305] Google Scholar

[ref80] 80. Needham CJ, Bradford JR, Bulpitt AJ, Care MA, Westhead DR. Predicting the effect of missense mutations on protein function: analysis with Bayesian networks. BMC Bioinformatics. 2006; 7: 405. pmid:16956412
View Article
PubMed/NCBI
Google Scholar

[307] View Article

[308] PubMed/NCBI

[309] Google Scholar

Figures

Abstract

Introduction

Results and Discussion

Methods

Data collection

Datasets construction

Protein representation

Protein functional family prediction models

Performance measurement

Supporting Information

S1 Table. List of literature-reported protein functional family prediction models developed by using kNN and PNN methods.

S2 Table. Complete list of the protein functional families covered by SVMProt and the prediction performance of the SVM, kNN and PNN models on the independent testing sets.

S3 Table. List of the novel proteins published in 2015 and 2014 that are not in the SVMProt training sets and have available sequence in the literature or public databases.

S4 Table. The detailed results of the prediction of the functional families of the 42 novel proteins by SVMProt, FFPred and NCBI BLAST.

S5 Table. 10 representative protein functional families covered by SVM-Prot and the prediction performance of the LibD3C, SVM, kNN and PNN models on the independent testing sets.

Acknowledgments

Author Contributions

References