PredHSP: Sequence Based Proteome-Wide Heat Shock Protein Prediction and Classification Tool to Unlock the Stress Biology

Ravindra Kumar; Bandana Kumari; Manish Kumar

doi:10.1371/journal.pone.0155872

Abstract

Heat shock proteins are chaperonic proteins, which are present in every domain of life. They play a crucial role in folding/unfolding of proteins, their sorting and assembly into multi-protein complex, cell cycle control and also protect the cell during stress. Considering the fact that no web-based predictor is available for simultaneous prediction and classification of HSPs, it is imperative to develop a method, which can predict and classify them efficiently. In this study, we have developed coupled amino acid composition and support vector machine based two-tier method, PredHSP that identifies heat shock proteins (1^st tier) and classifies it to different families (at 2^nd tier). At 1^st tier, we achieved maximum accuracy 76.66% with MCC 0.43, while at 2^nd tier we achieved maximum accuracy 96.36% with MCC 0.87 for HSP20, 91.91% with MCC 0.83 for HSP40, 95.96% with MCC 0.72 for HSP60, 91.87% with MCC 0.71 for HSP70, 98.43% with MCC 0.70 for HSP90 and 97.48% with MCC 0.71 for HSP100. We have also developed a webserver, as well as standalone package for the use of scientific community, which can be accessed at http://14.139.227.92/mkumar/predhsp/index.html.

Citation: Kumar R, Kumari B, Kumar M (2016) PredHSP: Sequence Based Proteome-Wide Heat Shock Protein Prediction and Classification Tool to Unlock the Stress Biology. PLoS ONE 11(5): e0155872. https://doi.org/10.1371/journal.pone.0155872

Editor: Eugene A. Permyakov, Russian Academy of Sciences, Institute for Biological Instrumentation, RUSSIAN FEDERATION

Received: December 30, 2015; Accepted: May 5, 2016; Published: May 19, 2016

Copyright: © 2016 Kumar et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper.

Funding: This study was supported by the University Grant Commission Major Research Project (http://www.ugc.ac.in) (grant no. 41-38/2012(SR)) (MK); University Grants Commission of India (http://www.ugc.ac.in) (grant no. 20-12/2009(ii)EU-IV) (RK); Science & Engineering Research Board (SERB), Department of Science & Technology, Government of India under Fast Track Scheme for Young Scientist grant (http://www.serb.gov.in/home.php#/) (grant no. SR/FT/LS-84/2010) (BK). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Heat shock proteins (HSPs) are stress-induced proteins, ubiquitously found in all organisms, ranging from bacteria to human. They are one of the largest groups of molecular chaperones that assist in correct folding of partially folded or denatured proteins. Depending on the molecular weight and core functions, six major families of HSPs have been reported: (i) HSP20 or small heat shock proteins (sHsp), (ii) Hsp40 or J-class proteins, (iii) Hsp60 or chaperonins, (iv) Hsp70, (v) Hsp90, and (vi) Hsp100/ClpB protein [1, 2]. HSPs play a vital role in cellular stress response against unfavourable environmental condition like physical (temperature elevation) or chemical (increase or decrease in pH, salinity, or oxygen concentration). To protect the cell from the destructive effects of stress, HSPs promote attainment of functional conformation of partially denatured proteins [3]. The activities of stress proteins are not limited to the chaperoning of other proteins but also includes other functions, like, modulation of their own synthesis [4], regulation of the stress kinase JNK [5], participation in signal transduction pathways [6] and in rRNA processing [7]. Due to the wide range of functional activities, malfunctioning of HSPs leads to a number of life-threatening diseases that includes Parkinson’s disease [8], Alzheimer’s disease [9], cardiovascular diseases [10] and cancer [11].

Due to availability of rapid and relatively inexpensive genome sequencing technologies, a large number of protein sequences are continuously added into the databases. A major fraction of these sequences are not annotated. Considering the time and resources involved in experimental annotations, these sequences are very unlikely to be annotated in the near future. This makes computational pipelines an ideal choice for annotation due to their inexpensive and high throughput nature. Considering the importance of HSPs in cellular metabolism and number of un-annotated sequences in the databases that might be HSPs, development of computational method to identify HSPs and classify their family only on the basis of primary protein sequence will have a far reaching effect. Two attempts have already been made by (i) Feng et al. [1] and (ii) Ahmad et al. [12] regarding HSP protein annotation but only for their classification into different HSP families. But methods have following shortcomings; (i) they do not have provision for classifying HSP family without first verifying that query proteins is HSP or not, (ii) method developed by Ahmad et al. [12], does not provide any web based tool or standalone software for the prediction purpose.

Here, we describe PredHSP to address the shortcomings of existing methods. PredHSP is capable to predict HSP and also its different families. It is based on coupled amino acid composition (CAA) based sequence encapsulation as input and support vector machine (SVM) as the prediction machine.

Materials and Methods

Data Source

Training Dataset.

To develop PredHSP, we used the same dataset recently reported to develop iHSP-PseRAAAC [1]. The dataset was originally derived from HSPIR database [2]. Further they removed the sequences having ≥40% sequence similarity within the same subset by using CD-HIT [13], and obtained 2225 sequences from different HSP families (Table 1). 10000 non-HSP sequences were also randomly picked from SwissProt keeping in mind that no two sequences are homologous. During training HSP sequences were used as positive dataset while non-HSP sequences were used as negative dataset.

Download:

Table 1. Protein distribution in training dataset.

https://doi.org/10.1371/journal.pone.0155872.t001

Independent Dataset.

We built two independent datasets having sequences of different HSP families (Table 2): i) an HGNC dataset [14] having 95 human HSPs (collected from HUGO Gene Nomenclature Committee (HGNC) database), ii) a mixed dataset of 55 rice HSPs. For mixed dataset HSPs reported in two different research papers were used: 31 HSPs were obtained from Wang et al [15] and 24 HSPs of single family, namely HSP70, were obtained from Sarkar et al [16].

Download:

Table 2. Distribution of HSPs across different families in independent datasets.

HGNC dataset contains human HSPs obtained from HGNC [14] and mixed dataset contains rice HSPs obtained from Wang et al [15] and Sarkar et al [16].

https://doi.org/10.1371/journal.pone.0155872.t002

Genome Wide Prediction of HSPs

We downloaded nine different proteomes from Uniprot, one was from archaea (Methanothermobacter thermautotrophicus), two were from prokaryotes (Escherichia coli, Mycobacterium tuberculosis) and six were from eukaryotes that included common baker yeast (Saccharomyces cerevisiae), plants (Arabidopsis thaliana, Oryza sativa), and animals (Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens). Using PredHSP annotation pipeline we predicted the HSPs and annotated their family at proteome level. The total number of proteins were 1868, 4305, 3993, 6721, 31480, 37386, 26612, 22006, 70076 in Methanothermobacter thermautotrophicus, Escherichia coli, Mycobacterium tuberculosis, Saccharomyces cerevisiae, Arabidopsis thaliana, Oryza sativa, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens respectively.

Prediction Schema

Considering the heterogeneous nature of HSPs, generally multi-class classification approach is being used to predict various HSP families. Multi-class classification-based predictors assume that the input/query sequence(s) belong(s) to the same class whose sub-class is to be predicted. This assumption might work during training, which is being done on a curated data but in reality or during blind prediction, a non-class member may be used as a query protein, which may cause the wrong prediction as a class member to which it did not belong. To reduce the likelihood of wrong classification, we adopted a two-tier approach. At 1^st tier, non-HSPs were filtered out and only HSP sequences were passed to the 2^nd tier where the family was predicted (Fig 1).

Download:

Fig 1. Flow chart to show the prediction schema of HSPs and its families.

https://doi.org/10.1371/journal.pone.0155872.g001

Support Vector Machine

Support vector machine is one of the popular classifiers [17] used for development of many bioinformatics prediction methods [18–20]. We used SVM_light package [21] in this work.

SVM Model Generation

In order to develop 1^st tier of predictor, which can discriminate HSPs from non-HSPs, we developed the SVM model from 10,000 non-HSPs and 2,225 HSPs, which was labelled as negative and positive dataset respectively. For 2^nd tier, which is a multi-class classification problem, a series of binary classifiers were developed. Each classifier was capable to predict heat shock proteins of a particular family. Classifiers used for HSP class prediction were actually SVM models, trained on the HSPs only (Table 1). During training all proteins of the family, for whose prediction the SVM model was being generated, were labelled positive and proteins of remaining families were labelled negative. Same approach has been used in a number of earlier studies like prediction of sub-cellular localization [18, 22, 23], β-lactamase and its class prediction [19], G-protein coupled receptors [24], nuclear receptor protein sub-family prediction [20, 25–27].

Cross-Validation and Performance Evaluation

Cross-validation is a way to estimate the performance of a prediction model during training. It is done on a dataset, which is not used during training. It involves partitioning of data into multiple sub-sets, performing the analysis on one sub-set (called training set), and validating the analysis on other sub-set (called testing set). The former process is called as training while the later as testing. To reduce variability in performance due to sample partition, multiple rounds of cross-validations were performed using different data partitions and the final result was obtained after averaging the results of all partitions. In the present work five-fold cross validation (FFCV) and leave-one-out cross validation (LOOCV), also named as jack-knife approach was used during 1^st and 2^nd tier respectively.

FFCV divides whole dataset into five sub-sets. Each sub-set consists of one-fifth of HSP and one-fifth of non-HSP. In each cycle of training four sub-sets were combined to make training set and the remaining one sub-set was used for testing. This process was repeated five times so that each sub-set was used once for testing. LOOCV partitions entire data into multiple training and test set pairs, whose number is equal to the number of sequences in dataset. In each pair, training set contains all except one sequence, while testing set contains the remaining one. During 1^st tier, since we had to train a large data with 12,225 sequences, FFCV approach was used. Using LOOCV on a dataset composed of a large number of sequences is time consuming as total number of training-test pairs generated during LOOCV is equal to the total number of sequences used. For the 2^nd tier of prediction where we had relatively small data from each HSP family, LOOCV approach of training was used. At a selected parameter, SVM model was generated using the training set and performance was evaluated on corresponding test set. On the basis of actual and predicted state, each prediction was classified into four distinct categories: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). For better explanation, we describe them in context of prediction schema.

At tier 1, TP represents the number of proteins, which are actually HSPs and also predicted as HSPs. TN represents the number of proteins which are actually non-HSPs and also predicted as non-HSPs. FP is number of non-HSPs, predicted as HSPs while FN is number of proteins which are actually HSPs but predicted as non-HSPs (Fig 2). In tier 2, since the classification was done to predict the family of a known HSP, the meaning of TP, TN, FP and FN have also changed accordingly. For a hypothetical family X, TP is the number of correctly predicted sequence that belongs to family X; TN is the number of non-family member also predicted as not a member of family X; FP is the number of sequences wrongly predicted to belong to family X while FN is the number of sequences which actually belongs to family X but predicted as non-family protein (Fig 2).

Download:

Fig 2. Schematic illustration of categorization of prediction into different categories.

https://doi.org/10.1371/journal.pone.0155872.g002

Above-mentioned four prediction indices were used to calculate three additional parameters namely, sensitivity, specificity and accuracy. A sensitivity of 100% implies that the classifier identifies all HSPs and their family correctly. Specificity of 100% means all non-HSPs and non-family members were correctly predicted. Accuracy presents overall picture and shows how well the classifier distinguishes true positives and true negatives in entire prediction. 100% accuracy denotes a perfect prediction.

(1)

(2)

(3)

Another criterion used for the prediction evaluation was Matthew’s correlation coefficient (MCC), which takes over- and under-predictions into account [28]. MCC = 1 denotes a perfect prediction, MCC = 0 indicates a completely random assignment, and MCC = -1 means a completely reverse prediction. MCC is defined as follows: (4)

Input Feature Encoding

Any SVM based prediction method requires a fixed length input. In order to extract fixed length vector from the protein sequences of different lengths, a number of encoding methods have been used to represent different forms of amino acid compositions viz., discrete amino acid composition (AA) [20, 29], pseudo amino acid composition (PseAA) [19, 30], coupled amino acid composition [20, 31] and split amino acid composition (SAA) [18, 32]. In this work, we used discrete amino acid composition and coupled amino acid composition to encode variable length protein sequence information into fixed length input to train SVM.

Discrete Amino Acid Composition.

Discrete amino acid composition is the most popular and simplest way to represent a protein sequence. It is the fraction of each amino acid present in a protein sequence. Hence it encapsulates a protein sequence in a vector of 20 dimensions. It is calculated using the expression: (5) Where, comp(i) is the amino acid composition of residue type Ri and N is the total number of amino acids.

Coupled Amino Acids Composition.

One of the main drawbacks of discrete amino acid composition is that it only uses total amino acid information but ignores the local order information of amino acids in the protein. In order to incorporate the local sequence order information along with amino acid compositions, coupled amino acid composition was also used as input. The coupled amino acid composition provides a fixed pattern length of 400. It is calculated using following expression: (6) Where, Coupled AA(j) = coupled amino acid composition of residue type M_j; j = 1 to 400 and N_{coupled AA} is the total number of possible coupled amino acid composition.

Results and Discussion

Amino Acid Composition Analysis

In order to analyse the general trend of amino acids in heat shock proteins and in their families, we performed amino acid composition analysis using Composition Profiler [33]. Statistical significance of analysis was estimated at P-value ≤ 0.05. Composition Profiler calculates the fractional difference between the distributions of a particular amino acid (say aa) in two different samples (X and Y) as follows: (7)

The fractional difference determines the relative enrichment/depletion of aa in query sample X, against the aa in background sample Y.

To analyse the behaviour of amino acids in heat shock proteins, we used all HSPs of the training dataset as query while all non-HSPs were used as background sample. The result shows that the HSPs were enriched with charged (both positive and negative) and polar residues but depleted of hydrophobic and aromatic residues (Fig 3A).

Download:

Fig 3. Relative enrichment and depletion of amino acids in HSP and their families with reference to non-HSP and other HSP families respectively.

(3a) HSPs vs. Non-HSPs; (3b) HSP20 vs. remaining HSP family; (3c) HSP40 vs. remaining HSP family; (3d) HSP60 vs. remaining HSP family; (3e) HSP70 vs. remaining HSP family; (3f) HSP90 vs. remaining HSP family; (3g) HSP100 vs. remaining HSP family.

https://doi.org/10.1371/journal.pone.0155872.g003

At 2^nd level i.e. at family level, one family of HSPs was used as query group and remaining all families were together used as background. For example, to analyse the amino acid enrichment and depletion pattern of HSP20, sequences belonging to HSP20 were used as query sample and remaining sequences (belonging to the HSP40, HSP60, HSP70, HSP90 and HSP100) were used as background.

In HSP20 family (Fig 3B), the distribution of negative charged residues were high while aromatic as well as hydrophobic amino acid residues was low. In HSP40 family (Fig 3C), distribution of aromatic, polar and positively charged residues were high while hydrophobic amino acid residues were low. In HSP60 family (Fig 3D), the distribution of aromatic, charged (both positive and negative charged) and polar residues were low. In HSP70 (Fig 3E), aromatic residues, positively charged residues and polar residues were depleted while negatively charged residues and hydrophobic residues were enriched. In HSP90 (Fig 3F), aromatic content, negatively charged residues and polar residues were enriched while positively charged residues were not significant. In HSP100 family (Fig 3G), hydrophobic residues were enriched, aromatic content and polar residues were depleted and charged residues (positively as well as negatively charged) were not significant.

Performance of SVM during Cross Validation

1^st tier of Prediction.

Using FFCV and discrete amino acid composition as SVM input, we were able to achieve 72.98% overall accuracy with MCC 0.34. When coupled amino acid composition was used as input, the overall accuracy increased to 76.66% while MCC rose to 0.43 (Table 3). The result clearly shows that coupled amino acid composition based model performed better than discrete amino acid composition based model.

Download:

Table 3. Performance of discrete amino acid and coupled amino acid composition based SVM models during FFCV at 1^st tier.

https://doi.org/10.1371/journal.pone.0155872.t003

2^nd tier of Prediction.

At 2^nd tier, the prediction was done to identify the family to which an HSP (predicted as 1^st tier) might belong. Similar to the 1^st tier, coupled amino acid composition based SVM model achieved higher accuracy than discrete amino acid composition in each family (Table 4).

Download:

Table 4. Performance of discrete amino acid and coupled amino acid composition based SVM models during LOOCV at 2^nd tier.

https://doi.org/10.1371/journal.pone.0155872.t004

Receiver Operating Characteristics Curve Analysis.

Receiver operating characteristics (ROC) curve is a plot between sensitivity and false positive rate [34]. It shows the trade-off between sensitivity and specificity and can be used as a measure to assess the performance of a classifier. The area under the ROC curve is called AUC value [35], which quantifies the performance of the classifier. Higher AUC value shows better prediction. If AUC value reaches 1, it shows perfect prediction. We used ROCR package [36] to plot ROC curves and to calculate AUC values. ROC curve and AUC values of tier 1 and tier 2 SVM models also suggested that coupled amino acid composition was a better choice over the discrete amino acid composition (Fig 4, Table 4). Hence in further work, we used coupled amino acid composition based SVM models for the prediction of HSP and its families and termed it as predHSP.

Download:

Fig 4.

ROC curve of SVM models based on amino acid and coupled amino acid composition for prediction of (4a) HSPs and (4b) different families of HSPs. Solid line represents discrete amino acid composition (AA) while broken represents coupled amino acid composition (CAA) based SVM model.

https://doi.org/10.1371/journal.pone.0155872.g004

Comparative Performance vis-à-vis Existing Methods

It is important to compare the performance of a newly developed prediction method vis-à-vis the existing one. The method developed by Ahmad et al. [12] does not provide any family wise performance of HSP class prediction. So we compared the performance of PredHSP only with the method developed by Feng et al. and which was named as iHSP-PseRAAAC [1]. It was developed by using the 2,225 HSPs and the reduced amino acid composition as the input to classify a query protein into one of the six families of HSPs. In their paper, Feng et al. [1] described performance of five different types of reduced amino acid compositions namely (CP(13), CP(11), CP(9), CP(8) and CP(5)). Among all five modes, CP(11) was reported to have maximum performance. Hence we have compared performance of PredHSP with the performance of model developed using CP(11). We were able to compare our results for 2^nd tier SVM models only because iHSP-PseRAAAC only reported classification performance of six families as it was not intended to differentiate between HSP and non-HSP sequences.

Table 5 shows the jackknife success rate of identification in iHSP-PseRAAAC and PredHSP. The comparison clearly shows that the performance of PredHSP is better than iHSP-PseRAAAC both in terms of sensitivity and specificity. The higher success rate of PredHSP also shows that coupled amino acid composition encapsulates protein sequence attributes better than the simple/discrete as well as reduced amino acid composition.

Download:

Table 5. Comparison of performance of PredHSP with iHSP-PseRAAAC at 2^nd tier.

https://doi.org/10.1371/journal.pone.0155872.t005

There are two additional advantages of PredHSP over iHSP-PseRAAAC (i) unlike iHSP-PseRAAAC, PredHSP does not necessarily require only known HSP as query as it is capable to discriminate between HSPs and non-HSPs with very high accuracy and (ii) PredHSP has shown better performance than iHSP-PseRAAAC. It is anticipated that PredHSP become a useful high throughput tool in speeding up identification and classification of heat shock proteins.

Performance of PredHSP on Independent Datasets

We also benchmarked the performance of PredHSP on two different datasets belonging to human (HGNC dataset) and rice (mixed dataset) respectively. In human HSPs, among 11 proteins of HSP20, PredHSP predicted only 2 as non-HSP and 1 HSP20 protein was classified to a wrong family (HSP40). Out of 49 proteins of HSP40 belonging to human, PredHSP predicted only 4 as non-HSP hence there were no misclassification. Among 14 HSP60 proteins, PredHSP predicted 4 HSPs as non-HSPs while 1 was predicted in wrong family (HSP70). For other two HSPs i.e., HSP70 and HSP90, there was no wrong prediction.

The proteins of different families of rice HSPs were obtained from [15] and [16]. Out of 14 HSP20, PredHSP predicted only 2 proteins as non-HSPs, while for HSP60, HSP70, HSP90 and HSP100, PredHSP did not give any false prediction (Table 6). PredHSP gave 23 true prediction as HSP70 while only one protein was misclassified as HSP20 from the proteins obtained from Sarkar et al [16].

Download:

Table 6. Performance of PredHSP on human HSPs obtained from HGNC [14] and rice HSPs obtained from Wang et al. [15] and Sarkar et al. [16].

TP represents true prediction and FP represents false prediction.

https://doi.org/10.1371/journal.pone.0155872.t006

Genome Wide Identification of HSPs

Since HSPs are present in all the three domains of life, thus we selected nine different proteome from archaea, prokaryotes and eukaryotes for annotation. We found 43 HSPs in M. thermautotrophicus, 51 in E. coli, 123 in M. tuberculosis, 145 in S. cerevisiae, 814 in A. thaliana, 2192 in O. sativa, 556 in C. elegans, 331 in D. melanogaster and 979 in H. sapiens (Table 7). The results clearly show that both plant species included in our study i.e., Arabidopsis and Oryza contains higher percentage of HSPs than other organisms which might be due the fact that plants tolerate extra abiotic stresses such as heat, drought, salinity, chemical toxicity, extreme temperature, oxidative stress and biotic stresses such as pathogen infection, insect attacks and other human activities [37, 38] etc. due to their immobile nature.

Download:

Table 7. Genome wide annotation of heat shock proteins in different organisms.

https://doi.org/10.1371/journal.pone.0155872.t007

Webserver

We have also established a webserver for the use of PredHSP by scientific community. It is freely available at http://14.139.227.92/mkumar/predhsp/index.html. A standalone version of PredHSP is also available at the above-mentioned link, which can be used to handle large data.

Conclusions

HSPs are one of the largest groups of chaperones, which play a key role in protein folding and unfolding. In this work, we reported a SVM based two-tier prediction method, PredHSP, to identify HSPs and their families namely HSP20, HSP40, HSP60, HSP70, HSP90, and HSP100. Discrete amino acid composition and coupled amino acid composition were used as SVM input, however the later (check spelling) performed better at both levels. This may be due to the fact that discrete amino acid composition does not have the sequence order information. Performance results show that PredHSP is more efficient than the existing HSP classifier, iHSP-PseRAAAC. It is anticipated that PredHSP would be useful for high throughput prediction of HSPs prediction and would aid in basic research as well as in drug development.

Acknowledgments

We gratefully acknowledge Dr. Neelja Singhal (Department of Microbiology, University of Delhi South Campus, New Delhi, India) for critically reading the manuscript.

Author Contributions

Conceived and designed the experiments: MK. Performed the experiments: MK RK. Analyzed the data: MK RK BK. Wrote the paper: MK RK BK.

References

1. Feng PM, Chen W, Lin H, Chou KC. iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal Biochem. 2013;442(1):118–25. pmid:23756733
- View Article
- PubMed/NCBI
- Google Scholar
2. Ratheesh RK, Nagarajan SN, Arunraj S, P., Sinha D, Veedin Rajan VB, Esthaki VK, et al. HSPIR: a manually annotated heat shock protein information resource. Bioinformatics. 2012;28(21):2853–5. pmid:22923302
- View Article
- PubMed/NCBI
- Google Scholar
3. Morimoto RI. Regulation of the heat shock transcriptional response: cross talk between a family of heat shock factors, molecular chaperones, and negative regulators. Genes Dev. 1998;12(24):3788–96. pmid:9869631
- View Article
- PubMed/NCBI
- Google Scholar
4. Blaszczak A, Georgopoulos C, Liberek K. On the mechanism of FtsH-dependent degradation of the sigma 32 transcriptional regulator of Escherichia coli and the role of the Dnak chaperone machine. Mol Microbiol. 1999;31(1):157–66. pmid:9987118
- View Article
- PubMed/NCBI
- Google Scholar
5. Gabai VL, Meriin AB, Yaglom JA, Volloch VZ, Sherman MY. Role of Hsp70 in regulation of stress-kinase JNK: implications in apoptosis and aging. FEBS Lett. 1998;438(1–2):1–4. pmid:9821948
- View Article
- PubMed/NCBI
- Google Scholar
6. Louvion JF, Abbas-Terki T, Picard D. Hsp90 is required for pheromone signaling in yeast. Mol Biol Cell. 1998;9(11):3071–83. pmid:9802897
- View Article
- PubMed/NCBI
- Google Scholar
7. Ruggero D, Ciammaruconi A, Londei P. The chaperonin of the archaeon Sulfolobus solfataricus is an RNA-binding protein that participates in ribosomal RNA processing. The EMBO journal. 1998;17(12):3471–7. pmid:9628882
- View Article
- PubMed/NCBI
- Google Scholar
8. Wu YR, Wang CK, Chen CM, Hsu Y, Lin SJ, Lin YY, et al. Analysis of heat-shock protein 70 gene polymorphisms and the risk of Parkinson's disease. Hum Genet. 2004;114(3):236–41. pmid:14605873
- View Article
- PubMed/NCBI
- Google Scholar
9. Hamos JE, Oblas B, Pulaski-Salo D, Welch WJ, Bole DG, Drachman DA. Expression of heat shock proteins in Alzheimer's disease. Neurology. 1991;41(3):345–50. pmid:2005999
- View Article
- PubMed/NCBI
- Google Scholar
10. Pockley AG. Heat shock proteins, inflammation, and cardiovascular disease. Circulation. 2002;105(8):1012–7. pmid:11864934
- View Article
- PubMed/NCBI
- Google Scholar
11. Goldstein MG, Li Z. Heat-shock proteins in infection-mediated inflammation-induced tumorigenesis. J Hematol Oncol. 2009;2:5. pmid:19183457
- View Article
- PubMed/NCBI
- Google Scholar
12. Ahmad S, Kabir M, Hayat M. Identification of Heat Shock Protein families and J-protein types by incorporating Dipeptide Composition into Chou's general PseAAC. Comput Methods Programs Biomed. 2015;122(2):165–74. pmid:26233307
- View Article
- PubMed/NCBI
- Google Scholar
13. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9. pmid:16731699
- View Article
- PubMed/NCBI
- Google Scholar
14. Kampinga HH, Hageman J, Vos MJ, Kubota H, Tanguay RM, Bruford EA, et al. Guidelines for the nomenclature of the human heat shock proteins. Cell Stress Chaperones. 2009;14(1):105–11. pmid:18663603
- View Article
- PubMed/NCBI
- Google Scholar
15. Wang Y, Lin S, Song Q, Li K, Tao H, Huang J, et al. Genome-wide identification of heat shock proteins (Hsps) and Hsp interactors in rice: Hsp70s as a case study. BMC Genomics. 2014;15:344. pmid:24884676
- View Article
- PubMed/NCBI
- Google Scholar
16. Sarkar NK, Kundnani P, Grover A. Functional analysis of Hsp70 superfamily proteins of rice (Oryza sativa). Cell Stress Chaperones. 2013;18(4):427–37. pmid:23264228
- View Article
- PubMed/NCBI
- Google Scholar
17. Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20(3):273–97.
- View Article
- Google Scholar
18. Kumar R, Jain S, Kumari B, Kumar M. Protein Sub-Nuclear Localization Prediction Using SVM and Pfam Domain Information. PloS one. 2014;9(6):e98345. pmid:24897370
- View Article
- PubMed/NCBI
- Google Scholar
19. Kumar R, Srivastava A, Kumari B, Kumar M. Prediction of β-lactamase and its Class by Chou’s Pseudo-amino Acid Composition and Support Vector Machine. J Theor Biol. 2015;365:96–103. pmid:25454009
- View Article
- PubMed/NCBI
- Google Scholar
20. Kumar R, Kumari B, Srivastava A, Kumar M. NRfamPred: A proteome-scale two level method for prediction of nuclear receptor proteins and their sub-families. Scientific reports. 2014;4:6810. pmid:25351274
- View Article
- PubMed/NCBI
- Google Scholar
21. Advances in Kernel Methods—Support Vector Learning. MIT Press; 1999
22. Hua S, Sun Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics. 2001;17(8):721–8. pmid:11524373
- View Article
- PubMed/NCBI
- Google Scholar
23. Bhasin M, Garg A, Raghava GP. PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics. 2005;21(10):2522–4. pmid:15699023
- View Article
- PubMed/NCBI
- Google Scholar
24. Bhasin M, Raghava GPS. GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors. Nucleic Acids Res. 2005;33 (Web Server issue):W143–7. pmid:15980444
- View Article
- PubMed/NCBI
- Google Scholar
25. Xiao X, Wang P, Chou KC. iNR-PhysChem: a sequence-based predictor for identifying nuclear receptors and their subfamilies via physical-chemical property matrix. PloS one. 2012;7(2):e30869. pmid:22363503
- View Article
- PubMed/NCBI
- Google Scholar
26. Bhasin M, Raghava GP. Classification of nuclear receptors based on amino acid composition and dipeptide composition. The Journal of biological chemistry. 2004;279(22):23262–6. pmid:15039428
- View Article
- PubMed/NCBI
- Google Scholar
27. Wang P, Xiao X, Chou KC. NR-2L: a two-level predictor for identifying nuclear receptor subfamilies based on sequence-derived features. PloS one. 2011;6(8):e23505. pmid:21858146
- View Article
- PubMed/NCBI
- Google Scholar
28. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et biophysica acta. 1975;405(2):442–51. pmid:1180967
- View Article
- PubMed/NCBI
- Google Scholar
29. Garg A, Bhasin M, Raghava GP. Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. The Journal of biological chemistry. 2005;280(15):14427–32. pmid:15647269
- View Article
- PubMed/NCBI
- Google Scholar
30. Xu Y, Wen X, Wen LS, Wu LY, Deng NY, Chou KC. iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition. PloS one. 2014;9(8):e105018. pmid:25121969
- View Article
- PubMed/NCBI
- Google Scholar
31. Chou KC. Using pair-coupled amino acid composition to predict protein secondary structure content. J Protein Chem. 1999;18(4):473–80. pmid:10449044
- View Article
- PubMed/NCBI
- Google Scholar
32. Kumar M, Verma R, Raghava GP. Prediction of mitochondrial proteins using support vector machine and hidden Markov model. The Journal of biological chemistry. 2006;281(9):5357–63. pmid:16339140
- View Article
- PubMed/NCBI
- Google Scholar
33. Vacic V, Uversky VN, Dunker AK, Lonardi S. Composition Profiler: a tool for discovery and visualization of amino acid composition differences. BMC Bioinformatics. 2007;8:211. pmid:17578581
- View Article
- PubMed/NCBI
- Google Scholar
34. Fawcett T. An introduction to ROC analysis. Pattern Recog Lett. 2006;27:861–74.
- View Article
- Google Scholar
35. Bradley AE. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997;30:1145–59.
- View Article
- Google Scholar
36. Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21(20):3940–1. pmid:16096348
- View Article
- PubMed/NCBI
- Google Scholar
37. Park CJ, Seo YS. Heat Shock Proteins: A Review of the Molecular Chaperones for Plant Immunity. Plant Pathol J. 2015;31(4):323–33. pmid:26676169
- View Article
- PubMed/NCBI
- Google Scholar
38. Al-Whaibi MH. Plant heat-shock proteins: A mini review. Journal of King Saud University—Science. 2011;23(2):139–50.
- View Article
- Google Scholar

[ref1] 1. Feng PM, Chen W, Lin H, Chou KC. iHSP-PseRAAAC: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal Biochem. 2013;442(1):118–25. pmid:23756733
View Article
PubMed/NCBI
Google Scholar

[2] View Article

[3] PubMed/NCBI

[4] Google Scholar

[ref2] 2. Ratheesh RK, Nagarajan SN, Arunraj S, P., Sinha D, Veedin Rajan VB, Esthaki VK, et al. HSPIR: a manually annotated heat shock protein information resource. Bioinformatics. 2012;28(21):2853–5. pmid:22923302
View Article
PubMed/NCBI
Google Scholar

[6] View Article

[7] PubMed/NCBI

[8] Google Scholar

[ref3] 3. Morimoto RI. Regulation of the heat shock transcriptional response: cross talk between a family of heat shock factors, molecular chaperones, and negative regulators. Genes Dev. 1998;12(24):3788–96. pmid:9869631
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref4] 4. Blaszczak A, Georgopoulos C, Liberek K. On the mechanism of FtsH-dependent degradation of the sigma 32 transcriptional regulator of Escherichia coli and the role of the Dnak chaperone machine. Mol Microbiol. 1999;31(1):157–66. pmid:9987118
View Article
PubMed/NCBI
Google Scholar

[14] View Article

[15] PubMed/NCBI

[16] Google Scholar

[ref5] 5. Gabai VL, Meriin AB, Yaglom JA, Volloch VZ, Sherman MY. Role of Hsp70 in regulation of stress-kinase JNK: implications in apoptosis and aging. FEBS Lett. 1998;438(1–2):1–4. pmid:9821948
View Article
PubMed/NCBI
Google Scholar

[18] View Article

[19] PubMed/NCBI

[20] Google Scholar

[ref6] 6. Louvion JF, Abbas-Terki T, Picard D. Hsp90 is required for pheromone signaling in yeast. Mol Biol Cell. 1998;9(11):3071–83. pmid:9802897
View Article
PubMed/NCBI
Google Scholar

[22] View Article

[23] PubMed/NCBI

[24] Google Scholar

[ref7] 7. Ruggero D, Ciammaruconi A, Londei P. The chaperonin of the archaeon Sulfolobus solfataricus is an RNA-binding protein that participates in ribosomal RNA processing. The EMBO journal. 1998;17(12):3471–7. pmid:9628882
View Article
PubMed/NCBI
Google Scholar

[26] View Article

[27] PubMed/NCBI

[28] Google Scholar

[ref8] 8. Wu YR, Wang CK, Chen CM, Hsu Y, Lin SJ, Lin YY, et al. Analysis of heat-shock protein 70 gene polymorphisms and the risk of Parkinson's disease. Hum Genet. 2004;114(3):236–41. pmid:14605873
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref9] 9. Hamos JE, Oblas B, Pulaski-Salo D, Welch WJ, Bole DG, Drachman DA. Expression of heat shock proteins in Alzheimer's disease. Neurology. 1991;41(3):345–50. pmid:2005999
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref10] 10. Pockley AG. Heat shock proteins, inflammation, and cardiovascular disease. Circulation. 2002;105(8):1012–7. pmid:11864934
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref11] 11. Goldstein MG, Li Z. Heat-shock proteins in infection-mediated inflammation-induced tumorigenesis. J Hematol Oncol. 2009;2:5. pmid:19183457
View Article
PubMed/NCBI
Google Scholar

[42] View Article

[43] PubMed/NCBI

[44] Google Scholar

[ref12] 12. Ahmad S, Kabir M, Hayat M. Identification of Heat Shock Protein families and J-protein types by incorporating Dipeptide Composition into Chou's general PseAAC. Comput Methods Programs Biomed. 2015;122(2):165–74. pmid:26233307
View Article
PubMed/NCBI
Google Scholar

[46] View Article

[47] PubMed/NCBI

[48] Google Scholar

[ref13] 13. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9. pmid:16731699
View Article
PubMed/NCBI
Google Scholar

[50] View Article

[51] PubMed/NCBI

[52] Google Scholar

[ref14] 14. Kampinga HH, Hageman J, Vos MJ, Kubota H, Tanguay RM, Bruford EA, et al. Guidelines for the nomenclature of the human heat shock proteins. Cell Stress Chaperones. 2009;14(1):105–11. pmid:18663603
View Article
PubMed/NCBI
Google Scholar

[54] View Article

[55] PubMed/NCBI

[56] Google Scholar

[ref15] 15. Wang Y, Lin S, Song Q, Li K, Tao H, Huang J, et al. Genome-wide identification of heat shock proteins (Hsps) and Hsp interactors in rice: Hsp70s as a case study. BMC Genomics. 2014;15:344. pmid:24884676
View Article
PubMed/NCBI
Google Scholar

[58] View Article

[59] PubMed/NCBI

[60] Google Scholar

[ref16] 16. Sarkar NK, Kundnani P, Grover A. Functional analysis of Hsp70 superfamily proteins of rice (Oryza sativa). Cell Stress Chaperones. 2013;18(4):427–37. pmid:23264228
View Article
PubMed/NCBI
Google Scholar

[62] View Article

[63] PubMed/NCBI

[64] Google Scholar

[ref17] 17. Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995;20(3):273–97.
View Article
Google Scholar

[66] View Article

[67] Google Scholar

[ref18] 18. Kumar R, Jain S, Kumari B, Kumar M. Protein Sub-Nuclear Localization Prediction Using SVM and Pfam Domain Information. PloS one. 2014;9(6):e98345. pmid:24897370
View Article
PubMed/NCBI
Google Scholar

[69] View Article

[70] PubMed/NCBI

[71] Google Scholar

[ref19] 19. Kumar R, Srivastava A, Kumari B, Kumar M. Prediction of β-lactamase and its Class by Chou’s Pseudo-amino Acid Composition and Support Vector Machine. J Theor Biol. 2015;365:96–103. pmid:25454009
View Article
PubMed/NCBI
Google Scholar

[73] View Article

[74] PubMed/NCBI

[75] Google Scholar

[ref20] 20. Kumar R, Kumari B, Srivastava A, Kumar M. NRfamPred: A proteome-scale two level method for prediction of nuclear receptor proteins and their sub-families. Scientific reports. 2014;4:6810. pmid:25351274
View Article
PubMed/NCBI
Google Scholar

[77] View Article

[78] PubMed/NCBI

[79] Google Scholar

[ref21] 21. Advances in Kernel Methods—Support Vector Learning. MIT Press; 1999

[ref22] 22. Hua S, Sun Z. Support vector machine approach for protein subcellular localization prediction. Bioinformatics. 2001;17(8):721–8. pmid:11524373
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

[ref23] 23. Bhasin M, Garg A, Raghava GP. PSLpred: prediction of subcellular localization of bacterial proteins. Bioinformatics. 2005;21(10):2522–4. pmid:15699023
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref24] 24. Bhasin M, Raghava GPS. GPCRsclass: a web tool for the classification of amine type of G-protein-coupled receptors. Nucleic Acids Res. 2005;33 (Web Server issue):W143–7. pmid:15980444
View Article
PubMed/NCBI
Google Scholar

[90] View Article

[91] PubMed/NCBI

[92] Google Scholar

[ref25] 25. Xiao X, Wang P, Chou KC. iNR-PhysChem: a sequence-based predictor for identifying nuclear receptors and their subfamilies via physical-chemical property matrix. PloS one. 2012;7(2):e30869. pmid:22363503
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref26] 26. Bhasin M, Raghava GP. Classification of nuclear receptors based on amino acid composition and dipeptide composition. The Journal of biological chemistry. 2004;279(22):23262–6. pmid:15039428
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref27] 27. Wang P, Xiao X, Chou KC. NR-2L: a two-level predictor for identifying nuclear receptor subfamilies based on sequence-derived features. PloS one. 2011;6(8):e23505. pmid:21858146
View Article
PubMed/NCBI
Google Scholar

[102] View Article

[103] PubMed/NCBI

[104] Google Scholar

[ref28] 28. Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et biophysica acta. 1975;405(2):442–51. pmid:1180967
View Article
PubMed/NCBI
Google Scholar

[106] View Article

[107] PubMed/NCBI

[108] Google Scholar

[ref29] 29. Garg A, Bhasin M, Raghava GP. Support vector machine-based method for subcellular localization of human proteins using amino acid compositions, their order, and similarity search. The Journal of biological chemistry. 2005;280(15):14427–32. pmid:15647269
View Article
PubMed/NCBI
Google Scholar

[110] View Article

[111] PubMed/NCBI

[112] Google Scholar

[ref30] 30. Xu Y, Wen X, Wen LS, Wu LY, Deng NY, Chou KC. iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition. PloS one. 2014;9(8):e105018. pmid:25121969
View Article
PubMed/NCBI
Google Scholar

[114] View Article

[115] PubMed/NCBI

[116] Google Scholar

[ref31] 31. Chou KC. Using pair-coupled amino acid composition to predict protein secondary structure content. J Protein Chem. 1999;18(4):473–80. pmid:10449044
View Article
PubMed/NCBI
Google Scholar

[118] View Article

[119] PubMed/NCBI

[120] Google Scholar

[ref32] 32. Kumar M, Verma R, Raghava GP. Prediction of mitochondrial proteins using support vector machine and hidden Markov model. The Journal of biological chemistry. 2006;281(9):5357–63. pmid:16339140
View Article
PubMed/NCBI
Google Scholar

[122] View Article

[123] PubMed/NCBI

[124] Google Scholar

[ref33] 33. Vacic V, Uversky VN, Dunker AK, Lonardi S. Composition Profiler: a tool for discovery and visualization of amino acid composition differences. BMC Bioinformatics. 2007;8:211. pmid:17578581
View Article
PubMed/NCBI
Google Scholar

[126] View Article

[127] PubMed/NCBI

[128] Google Scholar

[ref34] 34. Fawcett T. An introduction to ROC analysis. Pattern Recog Lett. 2006;27:861–74.
View Article
Google Scholar

[130] View Article

[131] Google Scholar

[ref35] 35. Bradley AE. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition. 1997;30:1145–59.
View Article
Google Scholar

[133] View Article

[134] Google Scholar

[ref36] 36. Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21(20):3940–1. pmid:16096348
View Article
PubMed/NCBI
Google Scholar

[136] View Article

[137] PubMed/NCBI

[138] Google Scholar

[ref37] 37. Park CJ, Seo YS. Heat Shock Proteins: A Review of the Molecular Chaperones for Plant Immunity. Plant Pathol J. 2015;31(4):323–33. pmid:26676169
View Article
PubMed/NCBI
Google Scholar

[140] View Article

[141] PubMed/NCBI

[142] Google Scholar

[ref38] 38. Al-Whaibi MH. Plant heat-shock proteins: A mini review. Journal of King Saud University—Science. 2011;23(2):139–50.
View Article
Google Scholar

[144] View Article

[145] Google Scholar

Figures

Abstract

Introduction

Materials and Methods

Data Source

Training Dataset.

Independent Dataset.

Genome Wide Prediction of HSPs

Prediction Schema

Support Vector Machine

SVM Model Generation

Cross-Validation and Performance Evaluation

Input Feature Encoding

Discrete Amino Acid Composition.

Coupled Amino Acids Composition.

Results and Discussion

Amino Acid Composition Analysis

Performance of SVM during Cross Validation

1st tier of Prediction.

2nd tier of Prediction.

Receiver Operating Characteristics Curve Analysis.

Comparative Performance vis-à-vis Existing Methods

Performance of PredHSP on Independent Datasets

Genome Wide Identification of HSPs

Webserver

Conclusions

Acknowledgments

Author Contributions

References

1^st tier of Prediction.

2^nd tier of Prediction.