iNR-PhysChem: A Sequence-Based Predictor for Identifying Nuclear Receptors and Their Subfamilies via Physical-Chemical Property Matrix

Nuclear receptors (NRs) form a family of ligand-activated transcription factors that regulate a wide variety of biological processes, such as homeostasis, reproduction, development, and metabolism. Human genome contains 48 genes encoding NRs. These receptors have become one of the most important targets for therapeutic drug development. According to their different action mechanisms or functions, NRs have been classified into seven subfamilies. With the avalanche of protein sequences generated in the postgenomic age, we are facing the following challenging problems. Given an uncharacterized protein sequence, how can we identify whether it is a nuclear receptor? If it is, what subfamily it belongs to? To address these problems, we developed a predictor called iNR-PhysChem in which the protein samples were expressed by a novel mode of pseudo amino acid composition (PseAAC) whose components were derived from a physical-chemical matrix via a series of auto-covariance and cross-covariance transformations. It was observed that the overall success rate achieved by iNR-PhysChem was over 98% in identifying NRs or non-NRs, and over 92% in identifying NRs among the following seven subfamilies: NR1thyroid hormone like, NR2HNF4-like, NR3estrogen like, NR4nerve growth factor IB-like, NR5fushi tarazu-F1 like, NR6germ cell nuclear factor like, and NR0knirps like. These rates were derived by the jackknife tests on a stringent benchmark dataset in which none of protein sequences included has pairwise sequence identity to any other in a same subset. As a user-friendly web-server, iNR-PhysChem is freely accessible to the public at either http://www.jci-bioinfo.cn/iNR-PhysChem or http://icpr.jci.edu.cn/bioinfo/iNR-PhysChem. Also a step-by-step guide is provided on how to use the web-server to get the desired results without the need to follow the complicated mathematics involved in developing the predictor. It is anticipated that iNR-PhysChem may become a useful high throughput tool for both basic research and drug design.


Introduction
Found within cells, nuclear receptors (NRs) are a class of proteins responsible for sensing steroid and thyroid hormones and certain other molecules. In response, these receptors work with other proteins to regulate the expression of specific genes, thereby controlling the development, homeostasis, and metabolism of the organism. A unique property of NRs that distinguishes themselves from other classes of receptors is their ability to directly interact with and control the expression of genomic DNA, and hence they are also classified as transcription factors [1,2]. Since NRs bind small molecules that can be easily modified by chemical manipulation, and also since NRs control the functions closely associated with major diseases (such as cancer, osteoporosis, and diabetes), they have become promising pharmacological targets [3,4,5].
Grouped into a superfamily that includes receptors for steroid hormones, vitamin D, ecdysone, retinoic acid and thyroid hormone [6,7], NRs are modular proteins composed of six distinct regions (A-F) [8,9] that correspond to functional and structural domains. Not all the NRs contain all the six domains. Regions C and E display the highest degree of conservation. C is involved in DNA binding and E involved in ligand binding and dimerization. Owing to its high conservation, the C domain is the signature motif of the superfamily. It is composed of two zinc fingers; the presence of such feature facilitates the identification of NRs [5]. Based on the alignments of the conserved domains [4,10], the superfamily has been subdivided into seven subfamilies [11,12].
The importance of NRs has prompted a rapid accumulation of the relevant data from a great diversity of fields of research: sequences, expression patterns, 3-D (three-dimensional) structures, protein-protein interactions, target genes, physiological roles, mutations, etc. These accumulated data are very helpful for data mining and knowledge discovery. Since the function of a NR is closely correlated with which subfamily it belongs to, facing the avalanche of protein sequences generated in the post-genomic age, it is highly desired to develop automated methods for rapidly and effectively identifying NRs and their subfamilies according to their sequences information alone, because the knowledge thus acquired may benefit both basic research and drug development. Actually, some efforts have already been made in this regard.
In 2004, Bhasin and Raghava [13] have proposed a method for predicting the subfamilies of NRs using SVM as the prediction engine and the amino acid composition and dipeptide composition as the input. In 2009, Gao et al. [14] reconstructed the benchmark dataset for NRs and introduced the pseudo amino acid composition (PseAAC) [15] to represent the protein samples in hope to improve the prediction quality. As pioneering efforts in this area, the works by Bhasin and Raghava [13] and Gao et al. [14] did play a stimulating role in this area. However, the above two predictors have the following problems needed to be further addressed: (1) The benchmark datasets used to train the two predictors only covered four subfamilies, too narrow for the coverage scope. (2) There are many high homologous sequences included in their benchmark datasets because the cutoff threshold set by these authors to remove homologous sequences was 90%; a much more stringent threshold should be adopted to avoid homology bias. (3) The predictions by the two predictors were made under the assumption that the input query sequences are already known belonging to NRs; in other words, they cannot be used to identify whether a query protein as a NR or non-NR. (4) No web-server was provided [13] or the web-server provided by [14] is currently not working, and hence their methods cannot be easily used by the majority of experimental scientists to acquire the desired data for basic research and drug development.
To address the aforementioned four problems, recently a different predictor was proposed by extending the coverage scope from the four subfamilies of NRs as covered in [13,14] to seven subfamilies [16]. The name of that predictor is called NR-2L, where 2L means that it is a two-level predictor. The 1 st level is for identifying query proteins as NRs or non-NRs, while the 2 nd level for identifying the NRs among their seven subfamilies.
In view of the importance of NRs to both basic research and drug development, the present study was initiated in an attempt to further improve the prediction quality of NR-2L by developing a new and more powerful predictor for identifying NRs and their subfamilies.
According to a recent review [17], to establish a really useful statistical predictor for a protein system, we need to consider the following procedures: (1) construct or select a high quality benchmark dataset to train and test the predictor; (2) formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (3) introduce or develop a powerful algorithm (or engine) to operate the prediction; (4) properly perform crossvalidation tests to objectively evaluate the anticipated accuracy of the predictor; (5) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us describe how to follow the above procedures to develop a new predictor that can further enhance the prediction quality in identifying NRs and their subfamilies.

Benchmark Datasets
In this study, we selected the datasets from [16] as the benchmark dataset. The reason for doing so is because that the datasets constructed in [16] for establishing the predictor NR-2L are relatively more rigorous, and that it is also more convenient to compare our new predictor with NR-2L by using a same benchmark dataset. The benchmark dataset in [16] can be formulated as where S nNR contains 500 non-NR protein sequences; while S NR contains 159 protein sequences classified into the following seven subfamilies: (1) NR1: thyroid hormone like (thyroid hormone, retinoic acid, RAR-related orphan receptor, peroxisome proliferator activated, vitamin D3-like), (2) NR2: HNF4-like (hepatocyte nuclear factor 4, retinoic acid X, tailless-like, COUP-TF-like, USP), (3) NR3: estrogen like (estrogen, estrogen-related, glucocorticoid-like), (4) NR4: nerve growth factor IB-like (NGFI-B-like), (5) NR5: fushi tarazu-F1 like (fushi tarazu-F1 like), (6) NR6: germ cell nuclear factor like (germ cell nuclear factor), and (7) NR0: knirps like (knirps, knirps-related, embryonic gonad protein, ODR7, trithorax) and DAX like (DAX, SHP). For the dataset S nNR , none of the proteins therein has §60% pairwise sequence identity to any other; for each of the seven subsets in S NR , none of the proteins included has §60% pairwise sequence identity to any other in a same subset. Listed in Table 1 is a breakout of the proteins in the benchmark dataset used in the current study. The codes and sequences for the proteins in the benchmark dataset S can be obtained from the Supporting Information S1 of [16] or directly downloaded from the website http://icpr.jci.edu.cn/ bioinfo/NR2L/Supp.html.

Protein Sequence Formulation
One of the keys in developing a method for identifying protein attributes is to formulate the protein samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted [18]. However, it is by no means an easy job to realize this because this kind of correlation is usually deeply ''buried'' or ''hidden'' in piles of complicates sequences.
The most straightforward method to formulate the sample of a query protein P was just using its entire amino acid sequence, as can be generally described by where R 1 represents the 1 st residue of the protein P, R 2 the 2 nd residue, …, R L the L-th residue, and they each belong to one of the 20 native amino acids. In order to identify its attribute, the sequence-similarity-search-based tools, such as BLAST [19,20], was utilized to search protein database for those proteins that have high sequence similarity to the query protein P. Subsequently, the attribute annotations of the proteins thus found were used to deduce the attribute for the query protein P. Unfortunately, this kind of straightforward sequential model, although quite intuitive and able to contain the entire information of a protein sequence, failed to work when the query protein P did not have significant sequence similarity to any attribute-known proteins. Thus, various non-sequential or discrete models to formulate protein samples were proposed in hopes to establish some sort of correlation or cluster manner by which to enhance the prediction power.
Among the discrete models for a protein sample, the simplest one is its amino acid (AA) composition or AAC [21]. According to the AAC-discrete model, the protein P of Eq. 4 can be formulated by [22] 20) are the normalized occurrence frequencies of the 20 native amino acids in protein P, and T the transposing operator. Many methods for predicting various protein attributes were based on the AAC-discrete model (see, e.g., [21,23,24,25,26]). However, as we can see from Eq. 3, if using the ACC model to represent the protein P, all its sequence-order effects would be lost, and hence the prediction quality might be considerably limited. To avoid completely losing the sequence-order information, instead of the simple amino acid composition (AAC), the pseudo amino acid composition (PseAAC) was proposed [15] to represent the protein samples.
According to a recent review article [17], the general form of PseAAC for a protein P can be formulated as where the subscript V is an integer and its value as well as the components y 1 , y 2 , … will depend on how to extract the desired information from the amino acid sequence of P.
Below, we are to use the ''Physical-Chemical Property Matrix'' and ''Auto-and Cross-Covariance Transformation' to define the V elements in Eq. 4.
Thus, according to the ten PC properties, the protein P of Eq. 2 can be formulated with a 10|L physical-chemical property matrix as given by where PC i (R j ) is the value of PC i (i~1, 2, Á Á Á , 10) for residue R j (j~1, 2, Á Á Á , L).
Of the ten PC properties, the values for the first six can be directly obtained from the website http://www.csbio.sjtu.edu.cn/ bioinf/PseAAC/PseAAReadme.htm, a part of the web-server PseAAC established for computing pseudo amino acid compositions of proteins according to their sequences [63]. The remainder can be obtained from AAindex (http://www.genome.jp/ aaindex/), which is a database of numerical indices for various physicochemical and biochemical properties of amino acids and pairs of amino acids. All data in this database [64,65] are derived from published literatures. Listed in Table 2 are the values for the ten PC properties of the 20 amino acids, respectively. However, before submitting them into Eq. 5, all the data in Table 2 were subject to a standard conversion through the following equation [66]: where x i (i~1,2, Á Á Á , 20) stands for the original score of the ith amino acid, mean(x) for the mean score of the 20 amino acids, and std(x) for the corresponding standard deviation. The converted values obtained via Eq. 6 will have a zero mean value over the 20 amino acids, and will remain unchanged if they go through the same conversion procedure again [66]. Thus, given a protein with L amino acids, it can be expressed as a 10|L numerical matrix via the ten physical-chemical properties as given in Eq. 5. Such a matrix is called the physical-chemical property matrix or PC matrix, for protein P. It is assumed that those NRs that belong to a same type should have a similar PC matrix, or vice versa.
2.2. Auto-covariance and cross-covariance. In statistics, the auto-covariance is the covariance of a stochastic process against a parameter-shifted version of itself (Fig. 1a), while the cross-covariance is used to refer to the covariance between two random vectors (Fig. 1b). Here, let us use the two concepts of covariance to transform the matrix of Eq. 5 to a length-fixed feature vector, as described below.
According to the concept of auto-covariance (AC), the correlation of the same PC property between two subsequences separated by l amino acids can be formulated as where l~(1, 2, Á Á Á , vL) [15] and PC i represents the mean value of the ith horizontal line in Eq. 5, as given by As we can see from Eq. 7, using auto-covariance on the physicalchemical property matrix of Eq. 5, we can generate 10|l autocovariance components. On the other hand, according to the concept of cross-covariance (CC), the correlation between two subsequences with each belonging to a different PC property can be formulated by Hence, using cross-covariance on the physical-chemical property matrix of Eq. 5, we can generate 10|9|l cross-covariance components. Accordingly, a total of (10|lz10|9|l)~100|l components can thus be generated from Eq. 5. However, it was indicated by preliminary computations and analyses that when l~10, better results would be obtained. Thus, in this study the PseAAC for protein P is expressed as where y u is the uth components generated by operating the above auto-covariance and cross-covariance on the physical-chemical property matrix of Eq. 5. 2.3. Support vector machines. Support vector machines (SVMs) are a set of related supervised learning methods that are usually used to analyze data and recognize patterns. The original SVM algorithm was proposed by Vapnik [67] and the current standard incarnation (soft margin) was proposed by Cortes and Vapnik [68]. When used in the current study, its mathematical principles can be briefly described as follows.
where P k can be regarded as the k-th protein sample or vector as formulated by Eq. 10, and R t is an Euclidean space with t dimensions. For the current case, R t is actually a PseAAC space with t~1000 (cf. Eq. 10). The SVM algorithm performs a mapping of the vectors of proteins in the training dataset from the space R t into a higher dimensional space R H by a kernel function and finds an optimal separating hyperplane, which maximizes the margin between the hyperplane and the nearest data points of each class in the space R H . Different kernel functions define different SVMs. In principle, SVM is a two-class classifier. With the recent improvements, the SVM can now be directly used to cope with multi-class classification problem via the one-against-all or pairwise approach. For the detailed mathematical formulations, see Eqs. 3-18 in [69], where instead of the 1000-D PseAAC space, a protein sample was defined in the 2005-D FunD (functional domain) composition space. SVM has been widely used to classify various attributes of proteins according to their sequences information (see, e.g., [33,43,69,70,71,72,73,74]). In this study, the LIBSVM package [75] was used as an implementation of SVM, which can be downloaded from http://www.csie.ntu.edu.tw/,cjlin/libsvm/, the popular radial basis function (RBF) was taken as the kernel function. For the current SVM classifier, there were two unknown parameter: penalty parameter C and kernel parameter c. The method of how to determine the two parameters will be discussed later.
The predictor established via the aforementioned procedures is called iNR-PhysChem, where the character ''i'' stands for ''identifying'', ''NR'' for ''nuclear receptors and their subfamilies'', and ''PhysChem'' for ''using physical-chemical property features''. To provide an intuitive overall picture, a flowchart is provided in Fig. 2 to illustrate the process of how iNR-PhysChem works in identifying nuclear receptors and their subfamilies.
2.4. Performance metrics. The performance of the predictor is evaluated by the overall accuracy, which is the most commonly used metric for assessing the global performance of a multi-class problem. The overall accuracy (ACC) is defined as the ratio of correctly predicted samples to all tested samples: where CN is the number of proteins whose attribute have been correctly identified and N the total number of proteins in the benchmark dataset. Also, to examine the stability of the pretictor, the Matthew's correlation coefficient (MCC) is computed according to the following formulation: where TP represents the true positive; TN, the true negative; FP, the false positive; and FN, the false negative (Fig. 3). 2.5. Web-server guide. The mathematic equations presented above are just for the integrity in developing the iNR-PhysChem predictor. For those who are interested in only using iNR-PhysChem, a web-server has been established. Below, let us give a step-by-step guide on how to use it to get the desired results without the need to follow the complicated mathematic equations at all.
Step 1. Open the web server at either http://www.jci-bioinfo. cn/iNR-PhysChem or http://icpr.jci.edu.cn/bioinfo/iNR-PhysChem, and you will see the top page of the predictor on your computer screen, as shown in Fig. 4. Click on the Read Figure 2. A flowchart to show the prediction process of iNR-PhysChem. T1 represents the benchmark dataset from [16] for training the 1 st -level prediction; T2 represents the benchmark dataset from [16] for training the 2 nd -level prediction. See the text for further explanation. doi:10.1371/journal.pone.0030869.g002 Me button to see a brief introduction about the iNR-PhysChem predictor, and its anticipated accuracy.
Step 2. Either type or copy/paste the query protein sequence into the input box at the center of Fig. 4. The input sequence should be in the FASTA format. A sequence in FASTA format consists of a single initial line beginning with a greater-than symbol (''.'') in the first column, followed by lines of sequence data. The words right after the ''.'' symbol in the single initial line are optional and only used for the purpose of identification and description. The sequence ends if another line starting with a ''.'' appears; this indicates the start of another sequence. Example sequences in FASTA format can be seen by clicking on the Example button right above the input box. The maximum number of query proteins allowed for each submission is 500.
Step 3. Click on the Submit button to see the predicted result. For example, if you use the two query protein sequences in the Example window as the input, after clicking the Submit button, you will see from your computer screen that the 1 st query protein (THB2_RAT) is a ''NR'' belonging to the subfamily of ''NR1 (Thyroid hormone like)'', and that the 2 nd query protein (E1FMC1_LOALO) is a ''non-NR''. All these results are fully consistent with the experimental observations. It only took a few seconds to get the above results. If the input contains 500 query protein sequences, the job will be finished in less than 2 minutes.
Step 4. Click on the Data button to download the benchmark datasets used to train and test the iNR-PhysChem predictor.
Step 5. Click on the Citation button to find the relevant paper that documents the development of the iNR-PhysChem predictor.

Results and Discussion
In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, K-fold (such as 5fold, 7-fold, or 10-fold) subsampling test, and jackknife test [76]. However, as elucidated by [77] and demonstrated by Eqs. 28-32 of [17], among the three cross-validation methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used and widely recognized by investigators to examine the accuracy of various predictors (see, e.g., [37,39,50,53,78,79]). Therefore, in this study the jackknife crossvalidation was adopted to calculate the success prediction rates as well.
However, for a system involved with two uncertain parameters (C and c), it will need a lot of computational times to find their optimal values. Therefore, as a fist step, let us determine the values of C and c for the current SVM operation engine just by optimizing the overall 5-fold cross-validation success rate thru a 2-D grid search (Fig. 5). The values thus obtained for the two parameters are given by C~2 7 , c~2 {3 for the 1st-level prediction C~2 7 , c~2 {5 for the 2nd-level prediction ( ð14Þ where the 1 st -level prediction is for identifying a query protein as NR or non-NR; while the 2 nd -level prediction is for identifying a NR among its seven subfamilies (cf. Table 1).
Subsequently, using the parameters values of Eq. 14 for the SVM operation engine, the jackknife tests were performed on the benchmark dataset S (cf. Eq. 1).
The results thus obtained in identifying proteins as NRs or non-NRs are given in Table 3; while those in identifying NRs among their seven subfamilies are given in Table 4. For facilitating comparison, the corresponding results obtained by the predictor NR-2L [16] are also listed in the two tables.
As we can see from Table 3, the overall jackknife success rate in identifying NRs and non-NRs by the current iNR-PhysChem is 98.16%, which is obviously higher than the corresponding rate by the NR-2L predictor [16]. Meanwhile, the overall MCC by iNR-PhysChem is 0.96, which is also more close to 1 than that by the NR-2L predictor [16]. Also, it can be seen from Table 4, the overall jackknife success rate in identifying NRs among their seven subfamilies and the overall MCC by the current iNR-PhysChem are 92.45% and 0.91, respectively, which are also higher than the corresponding rates by the NR-2L predictor [16]. All these results indicate that the current iNR-PhysChem is superior to NR-2L [16] not only in achieving higher success rates, but also in getting more stable predicted results.
The higher success rates with more stability indicate that it is a promising strategy to use the physical-chemical matrix to investigate the attributes of proteins, and that it can catch the essential features of NRs by representing their sequence samples with PseAAC consisting of the components derived from their physical-chemical matrix via the auto-covariance and crosscovariance transformation.
It is anticipated that iNR-PhysChem may become a useful high throughput tool for both basic research and drug development.
Finally, people might be interested to know how to rank the impacts of the ten amino acid properties (cf. Eq. 5) for their roles in identifying the NRs and their subfamilies. To address this problem, a leave-one-out test was performed for each of the ten amino acid properties. The property would be deemed having the most significant impact if the overall success rate dropped down the most after excluding it from the ten properties. It was observed that for the 1 st -level prediction (i.e., in identifying query proteins as NRs or non-NRs), their impacts were ranked as where the symbol 4 means ''greater than in impact'', and the symbol ¼ D means ''equal to in impact''. For the 2 nd -level prediction (i.e., in identifying the NRs among their seven subfamilies), the impacts of the ten amino acid properties were ranked as In other words, pK1 had the highest impact in identifying query proteins as NRs or non-NRs, followed by pK2 and PI, and so forth (cf. Section 2.1 of Materials and Methods); while pK2 had the highest impact in identifying NRs among their seven subfamilies, followed by pK1 and PI, and so forth.