We address the problem of assigning biological function to solved protein structures. Computational tools play a critical role in identifying potential active sites and informing screening decisions for further lab analysis. A critical parameter in the practical application of computational methods is the precision, or positive predictive value. Precision measures the level of confidence the user should have in a particular computed functional assignment. Low precision annotations lead to futile laboratory investigations and waste scarce research resources. In this paper we describe an advanced version of the protein function annotation system FEATURE, which achieved 99% precision and average recall of 95% across 20 representative functional sites. The system uses a Support Vector Machine classifier operating on the microenvironment of physicochemical features around an amino acid. We also compared performance of our method with state-of-the-art sequence-level annotator Pfam in terms of precision, recall and localization. To our knowledge, no other functional site annotator has been rigorously evaluated against these key criteria. The software and predictive models are incorporated into the WebFEATURE service at http://feature.stanford.edu/wf4.0-beta.
Citation: Buturovic L, Wong M, Tang GW, Altman RB, Petkovic D (2014) High Precision Prediction of Functional Sites in Protein Structures. PLoS ONE 9(3): e91240. https://doi.org/10.1371/journal.pone.0091240
Editor: Iddo Friedberg, Miami University, United States of America
Received: September 26, 2013; Accepted: February 11, 2014; Published: March 14, 2014
Copyright: © 2014 Buturovic et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: The computational resources of this research were supported through an Amazon Web Services in Education Research Grant to Center for Computing for Life Sciences at San Francisco State University, awarded June 18, 2012. RBA is supported by NIH grants LM05652, GM072970, GM102365 and the NSFBIOX2 award CNS-0619926. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: We received funding from a commercial source (Amazon Web Services in Education). This does not alter our adherence to all the PLOS ONE policies on sharing data and materials.
In the past decade, the amount of three-dimensional structural information for biological macromolecules has increased greatly, partly through technological advances as well as through the structural genomics initiatives that have prioritized the systematic determination of protein and nucleic acid structures  using X-ray crystallography, Nuclear Magnetic Resonance, electron microscopy, and other methods. As a result of this great acceleration of new information about 3D structure of proteins, there is a shift in the amount of background biological information available for many of the newly solved structures. In particular, there are many solved structures with no reported biological function, and so computational methods are critical to identify active sites and understand their molecular function. Methods based on sequence analysis are very powerful in this regard, as they can recognize domains and 1D motifs associated with function. Sometimes, however, only an analysis of the 3D structure allows the recognition of spatial interactions that are not apparent in the sequence analysis. Several methods have been developed to seek functional sites using 3D information including FFFs , TESS , GASPS , MarkUs  and FEATURE , .
An important protein function annotation strategy includes computational functional site prediction followed by experimental confirmation of the most promising results. In this context, the precision, or positive predictive value of the predictor is of paramount importance. This parameter quantifies the proportion of positive predictions which are indeed functional. Low precision models waste resources spent on laborious pursuit of functional activity that is not present. We postulate that an annotator which delivers at least 99% precision should have considerable utility in many realistic applications, such as identification of therapeutic targets. At this level of precision, ninety nine out of a hundred predicted functional sites would have been confirmed in the lab, and the challenge becomes maximizing recall (proportion of true functional sites found by the algorithm) among candidate computational models. Thus, the best method in the scenario we are considering maximizes recall at 99% precision. To our knowledge, none of the previously proposed sequence-based or structure-based methods had been developed for or rigorously evaluated against these specific goals, and thus this presented a key motivation for the present work.
The basis for our approach was FEATURE, a function annotation method that uses 3D protein structure information. FEATURE regards functional sites as protein microenvironments represented by vectors of physicochemical properties (features). For developing machine learning predictors, these vectors are aggregated to build Naïve Bayes classification models for recognizing the location of binding and active sites by using examples of these sites of interest as a positive training set (e.g. calcium binding sites , , or thioredoxin active sites ) and using suitable non-sites as the negative training set. In this paper we utilized the FEATURE vectors and Support Vector Machine learning algorithm to construct a functional annotator which meets the stated precision and recall goals. The classifier choice was based on comprehensive evidence of SVM performance , , availability of industrial-strength software library  and the authors’ own experience , . We also compared the new FEATURE with Pfam , a sequence-based annotator commonly used for functional annotation.
We built a 3D annotator FEATURE, which assigns functional sites defined in PROSITE  to novel protein structures. We used Protein Data Bank (PDB)  as the source of protein structures and PROSITE as the source of protein functional site definitions for supervised training of FEATURE machine learning models. PROSITE patterns are manually curated and are created according to previous observations from literature or from a sequence alignment of the protein sequences possessing the observed function. The patterns are derived from the alignment by taking the shortest common subsequence that matches known proteins with high specificity. Each pattern may result in multiple FEATURE predictive models, one for each functional atom in a conserved residue. Crucially, PROSITE entries identify true positive and false positive examples. It is this information which enabled us to conduct accurate learning and evaluation of the FEATURE functional predictive models.
Each FEATURE model requires positive and negative examples for training. We considered a structure to be a positive example if PROSITE indicated that it contained the functional site being modeled. Structures were considered negative examples if they were not positive. The positive and negative examples were chosen as follows:
- • Positive examples
- Identify true positive PROSITE examples and extract their structure data from the PDB. To avoid redundancy, cluster homologs sharing 100% sequence similarity and select a single representative structure with the highest X-ray crystallography resolution from each cluster for further processing.
- For each of the PDB structures, map the PROSITE pattern to the amino acid sequence of the protein and find the residue number and residue name of the conserved residues in the PDB protein sequence.
- Extract coordinates of functional atoms for all residues identified in Step 2b. The different conserved residues represent positive examples for the given predictive model; the extracted coordinates of the functional atoms are used to calculate feature vectors for training the FEATURE classifiers.
- • Negative examples
- From a snapshot of all PDB structures available at the time of PROSITE 20.80 release, remove structures that are associated with the given functional class, as identified by PROSITE (i.e. we removed positive or potentially positive examples). We did not take negative examples from proteins containing positive sites, in order to avoid possible contamination of the negative set with sites that are close to positive sites and therefore contain residual signal.
- For each functional atom in a PROSITE pattern, find 50,000 atom coordinates by randomly choosing atoms within remaining PDB structures with the same residue name and atom name, sampling without replacement.
All positive and negative coordinates were converted to FEATURE vectors to generate the positive and negative samples for training the models. Specifically, we used Featurize, a function available in the public release of the FEATURE package (https://simtk.org/home/feature). Featurize extracts physicochemical properties from the three-dimensional structure of the spatial neighborhood surrounding the position associated with the function of interest. It represents functional sites as protein microenvironments that contain six spherical shells of 1.25 Ångstroms in thickness, oriented around a central point of interest. Featurize accumulates statistics about the abundance of atoms, residues, secondary structures, charge, polarity, hydrophobicity and other biophysical and biochemical properties (totaling 80 properties in each shell) in order to describe a microenvironment in a vector of 6 shells×80 properties = 480 features. The characteristic properties are represented as numeric vectors and are listed in Table 1.
We chose 20 biologically distinct protein models based on adequate number of positive examples, biological relevance as judged by the authors and available resources for analyses. The choice was made prior to any downstream processing and never changed. The training samples for each protein model were converted to vectors of physicochemical properties using Featurize. The details of the protein models are given in Table 2.
We note that PROSITE also provides false negative designation for certain PDB proteins, which could in principle be used as positive examples. In practice, this is challenging because these proteins are known to have the function, yet do not conform to the PROSITE pattern and thus the exact atomic coordinates of the functional site are not available through PROSITE/PDB. This in turn prevents FEATURE modeling since it requires exact location of the functional site, and consequently we did not use false negatives in any analyses.
The FEATURE concept consists of multivariate representation of functional sites using the physicochemical microenvironment properties as feature vectors, followed by a classifier which assigns function (or lack thereof) to the resulting vector of properties. The original FEATURE system ,  used the Naïve Bayes classifier, whereas the focus of this work is the Support Vector Machine classifier. To distinguish the two, we refer to them as FEATURE-SVM and FEATURE-NB.
Support vector machine.
The Support Vector Machine classifier refers to several variations of a two-class linear classifier described as having the maximum margin property. Intuitively, the property means that the linear classification hyperplane is as distant as possible from training data points in both classes.
In standard formulation, SVM is a linear two-class classifier over a feature vector x(1)where the coefficients are chosen to yield the maximum margin by solving the following constrained optimization problem:
Here, is total number of positive and negative examples in the training set, are the feature vectors, and are their class labels. is a user-defined positive constant and measures the degree of misclassification of example . Large values of improve training data accuracy paid for by decreased generalization ability of the classifier.
This problem is equivalent to the standard linear regression problem , (3)where is the hinge loss term, the second term is the regularization term, and a user-given constant. The loss term measures accuracy of the classifier on the training data; the regularization term controls the generalization ability of the classifier. The constant controls the trade-off between the two goals. The hinge loss distinguishes Support Vector Machine from other linear regression algorithms.
In this paper we used formulation (2).
In particular for this application, it is critically important to generate class-conditional posterior probabilities because they drive the decision of whether to invest scarce resources into experimental confirmation of putative functional sites. The Naïve Bayes algorithm used in FEATURE-NB natively produces the posterior probabilities. However, in original formulation, the SVM algorithm does not produce the probabilities, but scores on an arbitrary, non-intuitive scale. To overcome this issue, we used the probabilistic extension of the SVM algorithm  as implemented in the LIBSVM  software library.
The original FEATURE program (FEATURE-NB) used Naïve-Bayes classifier models. The Naïve-Bayes learning algorithm estimates class-conditional probability density functions for each class by assuming independence of individual features:(4)where is the number of classes and the number of features ( and in this work). Class-conditional posterior probability estimates are derived by combining the density functions and class probabilities using Bayes theorem:
We treated as a tunable parameter. We approximated using the training data and dividing the observed values into a histogram of five bins .
Predictive Model Selection and Performance Estimation
Performance evaluation of FEATURE included selection of the best classification model for each site. The different models were built by varying the top-level parameter ( for Naïve Bayes, for SVM). We performed model selection by comparing cross-validation estimates of performance for the different models, and selecting as the best model the one producing the minimum number of misclassifications. For each model corresponding to a different value of the top-level parameter, we also recorded the estimated class-conditional posterior probabilities for each sample.
Once the best model was chosen for each functional site, we calculated precision and recall using the recorded class-conditional probabilities. This required setting a decision threshold to achieve the stated goal of 99% precision. In a finite-sample scenario, it is not possible to achieve the exactly specified value of precision; we used the closest achievable value. The actual achieved precision values are reported in the Results section.
The model selection process used the positive and negative feature vectors and performed a grid search of user-tunable parameters (cost C for FEATURE-SVM, prior probability of the positive class P for FEATURE-NB) yielding the best model. The search amounted to selecting the model parameters which produced the highest recall given a precision, estimated using cross-validation as described below. The optimization of the parameters and was conducted over a pre-defined set of values. For each value, we performed the five-fold cross-validation estimation of the performance of the classifier. Based on published guidelines  and the authors’ experience, we used the following set of SVM cost grid values on the scale: . Classifiers built using Naïve Bayes utilized a previously published  grid of values .
By necessity, the number of positive examples was significantly smaller than the number of negative examples. To ameliorate the impact of the highly unbalanced classes, we used stratification by class label, whereby each cross-validation fold had approximately the same proportion of positive and negative examples as the overall training set.
The cross-validation algorithm for a given top-level parameter is defined in Algorithm 1 box. The number of folds was set to five.
Algorithm 1 The cross-validation approach for generating sample predictions for a given functional site and associated training dataset D
- Require: dataset D, subsets D1, D2 … DF, parameters π
- for i = 1 … F do
- Learning Seti = D\Di
- Cross Validation Seti = Di
- Modeli = Train(Learning Seti, π)
- Predictionsi = Predict(Modeli, Cross Validation Seti)
- end for
- Predictions = Predictionsi
This approach highlights the following question related to estimation of model performance in the cross-validation setting. is the set of probability estimates for examples in subset . The union of all sets contains the entire training set, so the above procedure generates probability estimates for all training set examples. In principle, this is the required input data for estimating classifier performance. However, the individual prediction sets were generated by different models, and are therefore not directly comparable. To the best of our knowledge, there is no consensus in the machine learning community on how to produce aggregate measures in this scenario . We took the approach of treating all cross-validation iterations as a single continuous experiment, although other approaches may be sensible.
The probability estimates were used to calculate all statistics reported in the Results section.
Comparison of FEATURE with Pfam
The key challenge in comparing different annotators is matching their respective functional site assignments. In our case, FEATURE produces functional sites predictions as defined by PROSITE, because that is where the “truth” labels for FEATURE models are derived from. Pfam has its own nomenclature of functional sites, creating the challenge of comparing predictions for the two methods. To resolve this and estimate Pfam predictive performance on a scale comparable to FEATURE, we developed a protocol utilizing InterPro , a resource which unites diverse protein annotation databases, including PROSITE and Pfam. The protocol consisted of the following steps for each of the 20 protein models we analyzed:
- Record PROSITE accession number for the functional site. FEATURE predictive models are functional site predictors based on PROSITE patterns, therefore by definition each has a PROSITE accession number.
- Record all Pfam annotations that are co-located with PROSITE annotations (identified by the PROSITE accession number) in InterPro. To increase confidence in the mapping, we only used InterPro mapping entries for which the corresponding protein exists in SWISSPROT .
As an example, PROSITE ASP_PROTEASE (PS00141) maps to two Pfam domains: Eukaryotic aspartyl protease (PF00026) and Retroviral aspartyl protease (PF00077). The mapping of all 20 protein models is listed in Table 3.
- Generate Pfam predictions (domains) using the amino-acid sequence data for the positive and negative examples as input.
- Calculate Pfam precision and recall using the PROSITE-to-Pfam mapping. The confusion matrix was generated using the following logic:
- – For positive examples, if any of the Pfam predictions matched PROSITE as per Table 3 mapping, we considered the prediction a True Positive; if none of the Pfam predictions matched PROSITE, we considered the prediction a False Negative.
As an example, consider an ASP_PROTEASE positive example. If Pfam prediction for the example contained Eukaryotic aspartyl protease (PF00026) or Retroviral aspartyl protease (PF00077), it was considered a True Positive.
- – For negative examples, if any of the Pfam predictions matched PROSITE, we considered the prediction a False Positive; if none of the Pfam predictions matched PROSITE, we considered the prediction a True Negative.
One of the functional sites (ZINC_PROTEASE) did not have a matching InterPro entry and therefore was not used in Pfam analyses because there was no pre-specified way to compare the FEATURE and Pfam predictions for that site.
This protocol does not provide an opportunity to control the precision/recall trade-off. Therefore the Pfam results were reported at whatever precision level was reached with Pfam.
Training and evaluation of SVM machine learning on all PROSITE v20.80 functional classes demanded large-scale parallel computation. Feature extraction, parameter optimization and cross-validation takes 4–8 hours on an Intel Xeon 3400-series processor for a typical SVM predictive model, the most computationally demanding of the three methods considered here. To meet this challenge, all computations were performed using Amazon Elastic Cloud Computing (EC2) services with MIT StarCluster software . Amazon EC2 provides virtual machines (VMs) for scalable cost-efficient computation. MIT StarCluster organizes these VMs into a dynamically scalable Beowulf cluster with parallel computing tools such as MPI and Open Grid Scheduler.
We extracted positive and negative examples using the protocol described in the Materials section. The resulting numbers of examples, given in Table 4, provided for narrow 95% confidence intervals of the estimated performance parameters and robust conclusions regarding the methods’ performances.
Due to finite training set size, precision could not be set exactly at 99%. We used the closest achievable value for FEATURE-SVM and FEATURE-NB, as reported in Table 5. For Pfam, no precision tuning was possible, but with the exception of ADH_SHORT and KINASE_TYR it also provided precision exceeding 99% (for ADH_SHORT and KINASE_TYR the Pfam precision values were 98% and 96%, respectively).
Overall, FEATURE-SVM clearly surpassed Pfam and FEATURE-NB in terms of recall at approximately 99% precision (Fig. 1, Table 5 and Figures S1–S20 in File S1). For 18 out of the 20 functional sites, the difference between the FEATURE-SVM recall rate and that of Pfam was between 6% and 78%. All differences were statistically significant with 95% confidence. For one site (EGF_1), Pfam recall rate was slightly higher than FEATURE-SVM (75% vs. 72%), though the difference was not statistically significant. The Pfam result for ZINC_PROTEASE was not available because InterPro did not have a corresponding Pfam match.
y-axis is recall value at approximately 99% precision. Vertical lines within bars indicate 95% confidence intervals. Pfam result for ZINC_PROTEASE was not available because the InterPro database, which was used to map site names, does not have a mapping record for this functional site. The functional sites are sorted by increasing recall value of FEATURE-SVM.
FEATURE-SVM was superior to FEATURE-NB for 16 sites by between 1% and 60%. In ten out of the 16 comparisons the difference was statistically significant with 95% confidence. For LACTALBUMIN_LYSOZYME, ALPHA_CA_1, CYTOCHROME_P450 and CARBOXYLESTERASE_B_2, both FEATURE-SVM and FEATURE-NB achieved 100% recall. In summary, for the 19 sites for which all three methods yielded a result, the mean recall rates were 95% (FEATURE-SVM), 83% (FEATURE-NB) and 59% (Pfam).
We sought to develop a system for identifying functional sites in protein structures for an important use case scenario. Specifically, our goal was to develop an annotator that achieves acceptable levels of recall at 99% precision. We found that the combination of FEATURE and Support Vector Machine classifier delivered high recall (exceeding 70% in all of the cases studied, and averaging 95% over 20 functional sites) at the specified level of precision. This met our goals and thus we are able to provide a useful new tool (through the WebFEATURE service) for researchers in this domain, especially given the magnitude of the absolute and relative performance gain (95% recall vs. 83% for FEATURE-NB and 59% for Pfam).
We observed that the Support Vector Machine classifier delivered better classification accuracy than Naïve Bayes (95% recall vs. 83% for the FEATURE-NB averaged over all 20 functional sites). This is consistent with observations in many other application domains (for example cancer diagnostics ) and further confirms the power of this classification model.
The FEATURE-SVM annotator is purely predictive and does not explain to what extent individual microenvironment attributes contribute to the functionality of the predicted site. This behavior is a consequence of our focus on maximizing accuracy (i.e., precision/recall). It is consistent with recent findings in causal inference  that demonstrate that ranking of features for classification may have no explanatory utility.
When evaluating annotators for our use case scenario (i.e., prediction of function in a solved structure followed by experimental confirmation), it is important to note that the FEATURE-based tools point to exact atomic location of the functional site, unlike Pfam, which reports a (sometimes long) sequence segment corresponding to a functional domain.
We performed exhaustive analysis of 20 functional sites, which is a small fraction of the potentially useful sites (the number offered through the WebFEATURE service is over 600). Nevertheless, we argue that our main conclusion of high utility of the FEATURE-SVM annotator is likely to apply to the general population of sites for the following reasons:
- The 20 sites were chosen a priori, before any analyses, and then frozen, which makes for an unbiased sample.
- Given the magnitude of the estimated recall (95%), even if the estimate is biased, the large-sample estimate is still likely to be in the very useful range.
We developed a protocol for measuring Pfam performance in a way that is comparable to FEATURE. There is no single best way to do this since the mapping of functional sites from Pfam to FEATURE involves a degree of expert judgment. We argue that our protocol does not favor FEATURE for the following reason. Pfam may predict multiple domains for a given input sequence. If any of the predicted domains matches PROSITE per the established mapping, we consider the prediction to be a True Positive. Therefore we believe that the FEATURE performance relative to Pfam observed in practice is likely to be as good as reported here or better.
The choice of Pfam as the primary 1D function prediction method for the comparison was somewhat arbitrary. It is based on the fact that Pfam is a well-recognized tool, and that it represents a class of sequence-based methods with similar performance. Thus our comparative results should be representative of the expected performance gap between FEATURE-SVM and 1D methods.
We performed extensive and rigorous evaluation of the methods we used, with over 50,000 training examples for each functional class and extensive grid-search of the user-tunable parameters using cross-validation. To the best of our knowledge, no other annotator has been evaluated in a comparable manner.
End user of a functional annotator system would benefit from a rigorous performance comparison of competing state-of-the-art structural methods. However, we are not aware of another predictive algorithm which has been evaluated in the way performed in this paper, therefore direct comparison with our work is not possible. Furthermore, a key requirement for the comparison of different predictor outputs is translation to a common “language” of functional sites. As illustrated in our FEATURE - Pfam comparison, this requires extensive automation and human judgment, and is beyond scope of the present report. We leave a comparison of FEATURE to other structural methods for future research.
The combination of FEATURE properties and Support Vector Machine classifier predicts precise location of functional sites in unannotated protein structures with 99% precision and high recall rates (exceeding 70% in all of the cases studied, and averaging 95%). As a result, the WebFEATURE service which implements the FEATURE predictive models allows users to confidently pursue laboratory confirmation of the predicted protein function. Additionally, our findings suggest that bioinformaticians interested in predictive modeling of protein activity should consider Support Vector Machine classifiers for the most accurate results.
This file contains Figures S1–S20, which are recall vs. precision graphs for the 20 models analyzed in the paper. Figure S1, Recall vs. Precision: ADH_SHORT. Figure S2, Recall vs. Precision: Alpha_CA_1. Figure S3, Recall vs. Precision: ASP_PROTEASE. Figure S4, Recall vs. Precision: ATPASE_ALPHA_BETA. Figure S5, Recall vs. Precision: CARBOXYLESTERASE_B_2. Figure S6, Recall vs. Precision: CYTOCHROME_P450. Figure S7, Recall vs. Precision: EF_HAND. Figure S8, Recall vs. Precision: EGF_1. Figure S9, Recall vs. Precision: IG_MHG. Figure S10, Recall vs. Precision: INSULIN. Figure S11, Recall vs. Precision: LACTALBUMIN_LYSOZYME. Figure S12, Recall vs. Precision: LECTIN_LEGUME_BETA. Figure S13, Recall vs. Precision: PA2_HIS. Figure S14, Recall vs. Precision: PROTEIN_KINASE_ST. Figure S15, Recall vs. Precision: PROTEIN_KINASE_TYR. Figure S16, Recall vs. Precision: RNASE_PANCREATIC. Figure S17, Recall vs. Precision: SOD_CU_ZN_1. Figure S18, Recall vs. Precision: TRYPSIN_HIS. Figure S19, Recall vs. Precision: TRYPSIN _SER. Figure S20, Recall vs. Precision: ZINC_PROTEASE.
We thank SFSU graduate students Teague Sterling and Nick Mew for their contributions to WebFEATURE service.
Conceived and designed the experiments: LB MW RA DP. Performed the experiments: MW. Analyzed the data: MW LB. Contributed reagents/materials/analysis tools: GT MW LB. Wrote the paper: LB.
- 1. Thornton J (2001) Structural genomics takes off. Trends Biochem Sci 26: 88–89.
- 2. Fetrow JS, Skolnick J (1998) Method for prediction of protein function from sequence using the sequence-to-structure-to-function paradigm with application to glutaredoxins/thioredoxins and T1 ribonucleases. J Mol Biol 281: 949–968.
- 3. Wallace AC, Borkakoti N, Thornton JM (1997) TESS: a geometric hashing algorithm for deriving 3D coordinate templates for searching structural databases. Application to enzyme active sites. Protein Sci 6: 2308–2323.
- 4. Polacco BJ, Babbitt PC (2006) Automated discovery of 3D motifs for protein function annotation. Bioinformatics 22: 723–730.
- 5. Petrey D, Honig B (2009) Is protein classification necessary? Toward alternative approaches to function annotation. Curr Opin Struct Biol 19: 363–368.
- 6. Wei L, Altman RB (2003) Recognizing complex, asymmetric functional sites in protein structures using a Bayesian scoring function. J Bioinform Comput Biol 1: 119–138.
- 7. Halperin I, Glazer DS, Wu S, Altman RB (2008) The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications. BMC Genomics (Suppl 2): S2.
- 8. Wei L, Altman RB (1998) Recognizing protein binding sites using statistical descriptions of their 3D environments. In: Pac Symp Biocomput. pp. 497–508.
- 9. Glazer DS, Radmer RJ, Altman RB (2009) Improving structure-based function prediction using molecular dynamics. Structure 17: 919–929.
- 10. Tang GW, Altman RB (2011) Remote thioredoxin recognition using evolutionary conservation and structural dynamics. Structure 19: 461–470.
- 11. Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S (2005) A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21: 631–643.
- 12. Statnikov A, Wang L, Aliferis CF (2008) A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics doi: 10.1186/1471-2105-9-319.
- 13. Chang CC, Lin CJ (2011) LIBSVM: A library for support vector machines. ACM Trans Intell Syst Technol 2: 27 1–27: 27.
- 14. Monzon FA, Lyons-Weiler M, Buturovic LJ, Rigl CT, Henner WD, et al. (2009) Multicenter validation of a 1,550-gene expression profile for identification of tumor tissue of origin. J Clin Oncol 27: 2503–2508.
- 15. Pillai R, Deeter R, Rigl CT, Nystrom JS, Miller MH, et al. (2011) Validation and reproducibility of a microarray-based gene expression test for tumor identification in formalin-fixed, paraffin-embedded specimens. J Mol Diagn 13: 48–56.
- 16. Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, et al. (2012) The Pfam protein families database. Nucleic Acids Res 40: D290–D301.
- 17. Sigrist CJA, Cerutti L, de Castro E, Langendijk-Genevaux PS, Bulliard V, et al. (2010) PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Res 38: 161–6.
- 18. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, et al. (2000) The Protein Data Bank. Nucleic Acids Res 28: 235–242.
- 19. Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning, Second Edition. New York: Springer.
- 20. Wang L, Zhu J, Zou H (2008) Hybrid huberized support vector machines for microarray classification and gene selection. Bioinformatics 24: 412–419.
- 21. Wu TF, Lin CJ, Weng RC (2004) Probability estimates for multi-class classification by pairwise coupling. J Mach Learn Res 5: 975–1005.
- 22. Hsu CW, Chang CC, Lin CJ. A practical guide to support vector classification. Available: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf. Accessed 2014 Feb 15.
- 23. Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27: 861–874.
- 24. Hunter S, Jones P, Mitchell A, Apweiler R, Attwood TK, et al. (2012) InterPro in 2011: new developments in the family and domain prediction database. Nucleic Acids Res 40: D306–D312.
- 25. Bairoch A, Apweiler R (1997) The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Res 25: 31–36.
- 26. MIT StarCluster software. Available: http://star.mit.edu/cluster. Accessed 2014 Feb 15.
- 27. Lal A, Panos R, Marjanović M, Walker M, Fuentes E, et al. (2013) A gene expression profile test to resolve head & neck squamous versus lung squamous cancers. Diagn Pathol 8: 44.
- 28. Aliferis CF, Statnikov A, Tsamardinos I, Mani S, Koutsoukos XD (2010) Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation. J Mach Learn Res 11: 171–234.