iNuc-PhysChem: A Sequence-Based Predictor for Identifying Nucleosomes via Physicochemical Properties

Nucleosome positioning has important roles in key cellular processes. Although intensive efforts have been made in this area, the rules defining nucleosome positioning is still elusive and debated. In this study, we carried out a systematic comparison among the profiles of twelve DNA physicochemical features between the nucleosomal and linker sequences in the Saccharomyces cerevisiae genome. We found that nucleosomal sequences have some position-specific physicochemical features, which can be used for in-depth studying nucleosomes. Meanwhile, a new predictor, called iNuc-PhysChem, was developed for identification of nucleosomal sequences by incorporating these physicochemical properties into a 1788-D (dimensional) feature vector, which was further reduced to a 884-D vector via the IFS (incremental feature selection) procedure to optimize the feature set. It was observed by a cross-validation test on a benchmark dataset that the overall success rate achieved by iNuc-PhysChem was over 96% in identifying nucleosomal or linker sequences. As a web-server, iNuc-PhysChem is freely accessible to the public at http://lin.uestc.edu.cn/server/iNuc-PhysChem. For the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results without the need to follow the complicated mathematics that were presented just for the integrity in developing the predictor. Meanwhile, for those who prefer to run predictions in their own computers, the predictor's code can be easily downloaded from the web-server. It is anticipated that iNuc-PhysChem may become a useful high throughput tool for both basic research and drug design.


Introduction
In eukaryotic cells, genomic DNA is highly compacted into several levels of chromatin structures that ultimately make up the chromosomes. At the lowest level of compaction, a ,147 bp DNA sequence is tightly wrapped around the histone-octamer core ( Fig. 1) into the elementary structural unit of chromatin, known as nucleosome [1]. The packaging of DNA around the histoneoctamer modulates the accessibility of genomic regions to regulatory proteins. There are close relationships between nucleosome positioning and key cellular processes, as demonstrated in mRNA splicing, DNA replication, and DNA repair [2,3,4]. Consequently, revealing the mechanism involved in controlling nucleosome positioning is fundamentally important for in-depth understanding the subsequent steps of gene expression.
High-resolution genome-wide nucleosome maps are now available for several model organisms, such as Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster and Homo sapiens [5,6,7,8,9]. These high-resolution data provide unprecedented opportunities for further investigating the roles of nucleosome positioning in gene regulation. However, experimental approach is expensive to perform genome-wide analysis of nucleosome distribution. In this regard, computational methods can be applied to the entire genome without this kind of disadvantage. Since the report of the nucleosome positioning code (,10 bp repeating pattern of dinucleotides AA-TT-TA/GC) in yeast [8], lots of theoretical works have been done attempting to elucidate nucleosome occupancy signals that determine the preference of a particular region in binding to histones and forming a nucleosome [10,11,12]. Although of great interest and value, sequence-based predictions of nucleosome positioning have been limited in their accuracy and resolution, and to which extent nucleosome positioning in vivo is really dictated by the DNA sequence [10] is still an issue of controversy [13].
It was reported by Miele et al. [7] that DNA physical-chemical properties may determine nucleosome occupancy. Moreover, the recent study by Nozaki et al. [14] also suggested the existence of a highly bendable, fragile structure for nucleosomal DNA, implying that nucleosomal sequences indeed have distinct structural properties when compared with linker sequences.
In view of this, the present study was initiated in an attempt to develop a new method for predicting nucleosomal sequences based on the physicochemical properties of DNA.
According to a recent review [15], to establish a really useful statistical predictor for a biological system, we need to consider the following procedures: (1) construct or select a valid benchmark dataset to train and test the predictor; (2) formulate the biological samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted; (3) introduce or develop a powerful algorithm (or engine) to operate the prediction; (4) properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (5) establish a user-friendly web-server for the predictor that is accessible to the public. Below, let us describe how to deal with these steps.

Benchmark Dataset: Nucleosomal and Linker Sequences
The reference genome sequence of Saccharomyces cerevisiae was obtained from the Saccharomyces Genome Database (http:// www.yeastgenome.org/). The nucleosome positions of Saccharomyces cerevisiae were derived from the published data obtained by Lee et al. [16], where each of the 1,206,683 DNA fragments in the dataset constructed by these authors was assigned a nucleosome formation score using a lasso model, with the high or low score to reflect its high or low propensity in forming nucleosome, respectively. The low score can also be interpreted as the propensity to inhibit the formation of nucleosome. To prepare a high quality benchmark dataset, 5,000 fragments of 150 bp with the highest scores were selected as the nucleosome-forming sequence samples to construct the positive set S z , and 5,000 fragments of 150 bp with the lowest scores were selected as the nucleosome-inhibiting (or linker) sequence samples to construct the negative set S { ; i.e., the benchmark dataset S in this study can be formulated as where | represents the symbol for ''union'' in the set theory, and S z contains 5,000 nucleosome-forming samples For the convenience of readers, the 5,000 sequences in S z and 5,000 sequences in S { are given in the Information S1.
In order to quantitatively analyze the physical and chemical properties of the DNA sequence samples, we firstly converted the retrieved nucleosomal and linker sequences into numerical profiles according to the following schemes as validated by Florquin et al. [18]. The detailed procedures are as following steps.
Step 1. For any 2 base pair (bp) piece of DNA, there is a corresponding numerical value associated with any one of the aforementioned 12 physicochemical properties. Since the values of the 12 properties were at different levels, to make them easier to be handled, we normalized them into the range ½{1,z1 by means of the following equation where x 0 ij is the original value of the i-th DNA physicochemical property (i = 1, 2, …, 12) for the j-th (j = AA, AC, AG, AT, CA, CC, CG, CT, GA, GC, GG, GT, TA, TC, TG, TT) dinucleotide (see Information S2); while x ij the corresponding normalized value (see Table 1).
Step 2. By means of a sliding window [33,34] approach with a window size of 2 bp and a step size of 1 bp, a DNA sequence was replaced by the corresponding normalized physicochemical values. Thus, each of the sequences in S was translated into a numerical vector consisting of (150{1)~149 components, i.e., a 149-D (dimensional) numerical vector.
Step 3. After going through the above step with all the 12 physicochemical properties, each of the sequences in S was translated to 12 different 149-D vectors corresponding to the 12 physicochemical features. By combining the 12 vectors, we obtain an integrated vector containing (149|12)~1,788 components; i.e., each of the nucleosomal sequences in S can be formulated as a 1788-D vector where the first 149 components were derived from the property ''A-philicity'' or P(1), the second 149 components from the property ''base stacking'' or P(2), the last 149 components from the property ''Z-DNA'' or P(12) (cf. Table 1), and T the transposing operator.
Its formulation can be briefly described as follows. Suppose the standard feature vectors for the DNA sequences in S z and S { are, respectively, expressed by where f z u,k is the u-th component of the feature vector for the k-th sequence in the positive dataset S z , f { u,k that for the k-th sequence in the negative dataset S { , N z the total number of DNA sequences in S z , N { that in S { , and V the total number of components in a feature vector. For the current case, we have N z~N {~5 ,000 (cf. Information S1) and V~1,788 (cf. Eq.4).

Performance Evaluation
In statistical prediction, the following three cross-validation methods are often used to examine a predictor for its effectiveness in practical application: independent dataset test, subsampling (Kfold cross-validation) test, and jackknife test. However, as elaborated in ref. [53] and demonstrated by Eqs.28-32 of [15], among the three cross-validation methods, the jackknife test is deemed the least arbitrary that can always yield a unique result for a given benchmark dataset, and hence has been increasingly used and widely recognized by investigators to examine the accuracy of various predictors (see, e.g., [54,55,56,57,58,59,60]). However, since the current study would involve feature selection as described below, to reduce the computational time, the 5-fold crossvalidation test would be adopted as done by many investigators using SVM (Support Vector Machine) as the prediction engine.
Also, to use a more intuitive and easier-to-understand method to measure the prediction quality, according to the definition [33,61], the rates of correct predictions for the nucleosome-forming dataset S z and the nucleosome-inhibiting dataset S { are respectively defined by where N z is the total number of nucleosome-forming sequences concerned and m z the number of nucleosome-forming sequences missed in prediction; N { the total number of nucleosomeinhibiting sequences concerned and m { the number of nucleosome-inhibiting sequences missed in prediction. The overall success prediction rate is given by It is clear from Eqs.12-13 that, if and only if none of nucleosomeforming sequences and nucleosome-inhibiting sequences are mispredicted, i.e., m z~m{~0 and L z~L{~1 , we have the overall success rate L~1. Otherwise, the overall success rate would be smaller than 1.

Feature Selection
Inclusion of redundant and noisy information would cause poor prediction results and increase computational time. To improve the prediction quality and gain deeper insights into the physicochemical properties of nucleosomal sequences, we performed feature selection using the wrapper-type feature selection algorithm called ''fselect.py'', which can be downloaded at http:// www.csie.ntu.edu.tw/,cjlin/libsvmtools. The basic idea of this algorithm is to rank each of the features involved according to a score as elaborated by Chen and Lin [62]. The ranked feature with a higher score indicates that it is a more highly relevant one for the target to be predicted. Based on the ranked features, we used the Incremental Feature Selection (IFS) [63] to determine the optimal number of features. During the IFS procedure, features in the ranked feature set were added one by one from higher to lower rank. A new feature set was composed when one feature had been added. Thus, the N feature sets thus formed would be composed of N ranked features. The t-th feature set can be formulated as For each of the N feature sets, a CD prediction model (cf. Eq.7) was constructed and examined with the 5-fold cross-validation on the benchmark dataset. By doing so, we obtained an IFS curve in a 2D Cartesian coordinate system with index t as its abscissa (or Xcoordinate) and the overall success rate L as its ordinate (or Ycoordinate). The optimal feature set is defined by with which the IFS curve reaches its peak. In other words, in the 2D coordinate system, when X~H the value of L is the maximum. Thus, we can use the H features in Eq.15 to build the final predictor. The predictor established by going through all the above procedures is called iNuc-PhysChem. Meanwhile, a userfriendly web-server for the predictor was also established as will be describe at the end of the paper.

Graphic Profiles of Nucleosome and Non-nucleosome Sequences
Different from the previous methods [10,11,12] that were mostly based on the sequence compositional features, we carried out a graphic profile comparison between nucleosomal and linker (non-nucleosomal) sequences in order to explore the specific features possessed by nucleosomal sequences. Using graphic approaches to study biological problems can provide an intuitive picture or useful insights for revealing complicated relations in these systems, as demonstrated by many previous studies on a series of important biological topics, such as enzyme-catalyzed reactions [64,65,66,67], inhibition of HIV-1 reverse transcriptase [68,69], protein folding kinetics [70], drug metabolism systems [71], and using wenxiang diagram or graph [72] to study proteinprotein interactions [73,74,75]. To introduce graphic approach for the current study, let us use the conversion scheme [18] to transform the nucleosome and non-nucleosome sequences into the numerical vectors (cf. Eq.4). To intuitively show the difference between these two different types of sequences, a graphic expression of the standard feature vector (cf. Eq.5) for the nucleosomal sequences and that for the non-nucleosomal sequences are given in Fig. 2, which consists of 12 panels corresponding to 12 physicochemical properties of DNA sequences (cf. Section 2 of Materials and Method). The curves in the ''Aphilicity'' panel reflect the first 149 components in the two standard feature vectors, those in the ''base stacking'' panel reflect the second 149 components, and so forth. It is interesting to note that, except for the ''B-DNA twist'' panel and ''Protein-DNA twist'' panel, the differences between the nucleosomal and nonnucleosomal sequences are quite remarkable in all the other 10 panels. These findings suggest that the two physicochemical properties might play a less role in distinguishing nucleosomal and non-nucleosomal sequences than the other 10 properties.

Comparison of the 12 Properties in Classification Performance
In order to compare the 12 physicochemical properties for the classification performance, the feature vector Eq.4, standard vector Eq.5, and classifier Eq.7 were reduced from the original 1788-D working space to twelve 149-D sub-working spaces. Each of the sub-working spaces corresponds to one of the 12 physicochemical properties. Shown in Fig. 3 are their success rates in the classification performance when examined by the 5fold cross-validation on the benchmark dataset S. As can be seen from Fig. 3, the success rates obtained by using the ''B-DNA twist'' and ''protein-DNA twist'' properties are indeed remarkably lower than those by most of the other properties, quite consistent with the graphic profile analysis of last section.

Selection of Position Specific DNA Features
To identify the key features for nucleosomal sequence prediction, we used the wrapper-type feature selection algorithm and IFS approach as described in Section 5 of Materials and Method. By adding the ranked features one by one according to the scores calculated by fselect.py, we built 1,788 individual CD predictors for the 1,788 sub-feature sets. We then tested the prediction performance for each of the 1,788 predictors and plotted the IFS curve as shown in Fig. 4, from which we can see that, when the top ranked 884 features were used, the overall success rate reached its peak, i.e., L~96:70% (cf. Eq.13), with L z~9 9:60% for the nucleosome-forming sequences and L {~9 3:86% for the nucleosome-inhibiting sequences (cf. Eq.12).
In other words, we have H~884 (cf. Eq.15) and the optimal feature set for the current biological system should be To provide an overall view, a distribution of the 12 physicochemical features and their roles for the prediction model is given in Fig. 5, where the green boxes indicate the features that were not contained in the optimal feature set S 884 . The red and purple boxes indicate the features that were included in the optimal feature set S 884 : features in red boxes were positively correlated with nucleosomal sequences, while those in purple boxes were negatively correlated with nucleosomal sequences.

Comparison with Existing Methods
Based on the 2-mer absolute frequency of nucleotides, Zhang et al. [76] proposed a model to distinguish nucleosomal and linker sequences. When tested by the 5-fold cross-validation on the benchmark dataset, their method achieved an overall success rate of 95.70%, which is lower than that by the present method. Furthermore, our model trained on the yeast data was also applied to the human genome. According to the human reference genome (hg 18), we randomly extracted 1000 nucleosomal and 1000 linker sequences from the high-resolution experimental data of human CD 4 + T cell [9]. Our model achieved an overall success rate of 98.5% for classifying the experimentally confirmed nucleosomal and linker sequences in the human genome. This result is higher than 93.8% obtained by using the model proposed by Peckham et al. [10], which has also been applied to predict human nucleosomal sequences by Gupta et al. [12]. All these results indicate that it is a quite promising approach by incorporating the DNA physicochemical features for predicting the nucleosomal sequences, and also suggest a conserved mechanism of nucleosome positioning across genomes.
Different with most current nucleosome positioning prediction methods that were solely relied on local sequence compositional information, in this study we developed a new method by incorporating the physicochemical features of DNA sequences. Our rationale to do so is that, different from the other nucleotide information, the physicochemical properties might affect DNA binding of regulatory proteins, either directly by hampering or favoring complex formation, or indirectly through the modulation of the chromatin structure and hence the DNA accessibility [77]. Therefore, the current method may become a useful vehicle for indepth studying nucleosomes.

Web-Server Guide
For the convenience of the vast majority of experimental scientists, below let us give a step-by-step guide on how to use the iNuc-PhysChem web-server to get their desired results without the need to follow the complicated mathematic equations that were presented just for the integrity in developing the predictor.
Step 1. Open the web server at http://lin.uestc.edu.cn/ server/iNuc-PhysChem and you will see the top page of iNuc-PhysChem on your computer screen, as shown in Fig. 6. Click on the Read Me button to see a brief introduction about the predictor and the caveat when using it.
Step 2. Either type or copy and paste the query DNA sequence into the input box at the center of Fig. 6. The input sequence should be in the FASTA format. A sequence in FASTA  format consists of a single initial line beginning with a greater-than symbol (''.'') in the first column, followed by lines of sequence data. The words right after the ''.'' symbol in the single initial line are optional and only used for the purpose of identification and description. All lines should be no longer than 120 characters and usually do not exceed 80 characters. The sequence ends if another line starting with a ''.'' appears; this indicates the start of another sequence. Example sequences in FASTA format can be seen by clicking on the Example button right above the input box.
Step 3. Click on the Submit button to see the predicted result. For example, if you use the three query DNA sequences in the Example window as the input, after clicking the Submit button, you will see the following shown on the screen of your computer: the outcome for the 1 st query sample (with 150 bp long) is ''nucleosome''; the outcome for the 2 nd query sample (with 150 bp long) is ''linker''; the outcome for the 3 rd query sample (with 502 bp long) contains (502{150z1)~353 sub-results, in which the outcomes for the segments from #1 to #61 are of Figure 5. A distribution overall view for the 12 physicochemical features. The features that were included in the optimal feature set S 884 are shown in the red and purple boxes: the former was positively correlated with nucleosomal sequences, while the latter negatively correlated with nucleosomal sequences. Those features that were not in the optimal feature set S 884 are shown in the green boxes. doi:10.1371/journal.pone.0047843.g005 ''linker'', those for the segments from #62 to #198 are of ''nucleosome'', and those from #199 to #353 are of ''linker''. All these results are fully consistent with the experimental observations as summarized in the Information S1. It takes about few seconds for the above computation before the predicted result appears on your computer screen; the more number of query sequences and longer of each sequence, the more time it is usually needed.
Step 4. Click on the Citation button to find the relevant papers that document the detailed development and algorithm of iNuc-PhysChem.
Step 5. Click on the Data button to download the benchmark datasets used to train and test the iNuc-PhysChem predictor.
Step 6. The program is also available by clicking the button download on the lower panel of Fig. 6.

Some Remarks
In this study although iNuc-PhysChem was trained by the dataset derived from Saccharomyces cerevisiae, it can be successfully used to identify nucleosome positioning for an independent DNA segment extracted from the Saccharomyces cerevisiae genome, as demonstrated by the 3 rd sequence in the Example window of the iNuc-PhysChem web-server. Particularly, it can be also successfully used to classify nucleosomal and linker sequences in the human genome, as elaborated in Section 4 of Results and Discussion. Therefore, it is anticipated that iNuc-PhysChem can be successfully used to identify nucleosome in the whole genome as well.
The current study was focused on the demonstration that the physicochemical properties of DNA are important for nucleosome positioning prediction. Since the physicochemical properties of DNA can be used to describe the interaction between DNA and chromatin remodeling complexes in vivo, here we just used the in vivo data for the current study. However, it is instructive to point out that although in vivo and in vitro nucleosome maps are similar, promoters and DNA replication regions, where nucleosomal sequences are depleted in vivo, are strongly affected by nucleosome remodeling [78,79]. In view of this, we shall consider in our future work to use in vitro nucleosome maps [78] and the raw data from [80] to train the prediction model. Also, it is intriguing to analyze the impacts of different conformations (such as B-and Z-form) of DNA to nucleosome positioning, and will be investigated in our future studies as well.
Based on the results as reported in Section 4 of the Results and Discussion, we believe that the user-friendly web-server iNuc-PhysChem as proposed in this paper may serve as a useful tool for studying nucleosome positioning. Or at the very least, it can play a complimentary role to the existing methods in this area. Meanwhile, we also sincerely hope to hear any feedbacks (either positive or negative) from the users in using iNuc-PhysChem to generate their desired data. Their feedbacks will be very useful for us to improve the performance of iNuc-PhysChem.