DFA7, a New Method to Distinguish between Intron-Containing and Intronless Genes

Intron-containing and intronless genes have different biological properties and statistical characteristics. Here we propose a new computational method to distinguish between intron-containing and intronless gene sequences. Seven feature parameters , , , , , , and based on detrended fluctuation analysis (DFA) are fully used, and thus we can compute a 7-dimensional feature vector for any given gene sequence to be discriminated. Furthermore, support vector machine (SVM) classifier with Gaussian radial basis kernel function is performed on this feature space to classify the genes into intron-containing and intronless. We investigate the performance of the proposed method in comparison with other state-of-the-art algorithms on biological datasets. The experimental results show that our new method significantly improves the accuracy over those existing techniques.


Introduction
An important problem for geneticists as well as computer scientists involves classifying particular items into common groups. Here we focus on classifying gene sequences as either introncontaining or intronless. Intron-containing and intronless genes have different biological properties and statistical characteristics. For example, congruent with the Spearmans rank correlation, the comparison of intron-containing and intronless genes shows significantly reduced expression for intronless genes when compared to intron-containing genes [1]. Furthermore, introncontaining and intronless genes usually play important roles in evolution of proteins [2][3][4]. These observations raise interesting questions about the classification of intron-containing and intronless genes.
Peng et al. [5] have discovered that long-range correlation exists in the intron-containing genes but does not exist in the intronless genes. This work was based on a simple random-walk model of gene sequences, in which a pyrimidine led to a step up and a purine a step down. Consequently, the walk resulted in a definite landscape for a given sequence and only one parameter was calculated based on the landscape. This parameter was proposed to distinguish between the intron-containing and intronless genes. However, further study showed that this finding can not be used as a general method to identify intronless genes [6,7]. Zhang et al. [7,8] introduced a Z-curve consisting of three parameters. As an application, they used the Z-curve method to classify a dataset consisting of 100 intron-containing and 100 intronless genes. The discriminant accuracy as high as 89.0% can be obtained by using Fisher's linear discriminant algorithm based on Z-curve. However, although the distributions of three different biological types were displayed in Z-curve, it did not reveal the cross-correlations of distances between the nucleic bases, which are also important parameters to classify genes into intron-containing and intronless. In a similar way, Ma [9] created a model based on position weight function to describe genes by transforming them into quaternary numbers. Especially, this method indicates that E.coli K12s genome and the eukaryote yeasts genome have different strengths of single nucleotide periodicities. Yau et al. [10] firstly developed two-dimensional DNA graphical representation without degeneracy. Since then Yau and his collaborators have been studying efficient methods to cluster and classify DNA and proteins [11][12][13][14][15][16][17][18].
Some successful programs for exon/intron parsing are also proposed. For example, GENSCAN [19] was shown to be dramatically more accurate than the previous state-of-the-art prediction algorithms. It is based on a generalized hidden Markov model (GHMM) framework, and remains a popular bioinformatics tool. More recent de novo gene predictors have also been created, including N-SCAN [20] and EXONSCAN [21]. De novo gene predictors additionally made use of aligned gene sequence from other genomes [22]. Alignments can increase predictive accuracy since protein-coding genes exhibit distinctive patterns of conservation. These modern gene-finding or gene-parsing systems provide a prediction of precise (predicted) splice sites of the exons/ introns in genes, while also producing the intron-bearing status of genes.
Here we propose a new approach, DFA7, to classify genes as to their intron-bearing status. We investigate three new parameters which are based on the cross-correlations between the distributions of distances of nucleic bases in gene sequences. Those new parameters together with Zhang et al. 's original three parameters [7] and the value of their total standard deviation can be used to significantly improve the accuracy of classification on intronbearing status of genes. We perform our DFA7 method on three large gene datasets. The experimental results show that our method significantly improves the discriminant accuracy over those existing techniques. In addition, we examine our 7dimensional feature vector by one-by-one feature deletion, and compare the SVM's efficiency with other machine learning approaches.

Background
The Z-curve theory of DNA sequences was firstly developed by Zhang et al. [7,8]. Consider a DNA sequence with N bases. Let the number of steps be denoted by n (n~1,2, . . . ,N). We count the cumulative numbers of base A, C, G, T which occur in the subsequence from the first to the nth base in the DNA sequence. The cumulative numbers are denoted by A n ,C n ,G n and T n , respectively. The Z-curve is a three-dimensional curve which consists of a series of nodes P n (n~1,2, . . . ,N), whose coordinates are denoted by x n , y n and z n . It is shown that where n~1,2, . . . ,N and A 0~C0~G0~T0~0 . The connection of the nodes P 0~0 , P 1 , . . . ,P N one by one by straight lines is defined as the Z curve of the DNA sequence.
Detrended fluctuation analysis (DFA), firstly introduced by Peng et al. [5], is a scaling analysis method used to estimate long-range power law correlation parameters in noisy signals. By using this technique, Zhang et al. [7] calculated three exponents a, b, and c for a given sequence based on its Z-curve. A 3-dimensional space is spanned by the three exponents. Each DNA sequence may be represented by a point in this space. For any query gene sequence, calculate its 3 exponents a, b, and c, corresponding to a point in the 3-dimensional space. If the point is situated at the upper region of the separating plane, the gene is discriminated as an intronless one; otherwise, the gene is an intron-containing one.
For pursuing higher classification accuracy, more intrinsic parameters are needed. Here we propose a novel method, DFA7 method. In this approach, we introduce 4 new feature parameters l, h, w, and s for each DNA sequence based on DFA. Combining with 3 known parameters a, b, and c from Z-curve we can generate a 7-dimensional feature space, which can be used to classify gene sequences into intron-containing and intronless with much higher discriminant accuracy.

DFA7 method
In a DNA sequence, D m j represents the cumulative distance of all nucleotides of nucleic base j (j = A, C, G, T) to the first nucleotide (regarded as origin) in m steps. Let t i j be the distance from the first nucleotide to the ith nucleotide if the ith nucleotide is Similarly, we get D 10 A~7 , D 10 G~1 z6~7 and D 10 T~4 z9~13. Thus three types of cumulative distances can be defined as follows: where n~1,2, . . . ,N and N is the length of the DNA sequence. D n , E n , and H n reveal the cross-correlation of the ''position'' of each nucleic base in a DNA sequence. These cumulative distances are natural objects from the original DNA sequence, and embody more sequence information which Z-curve fails to provide. Now we construct a 3|3 cumulative distance matrix based on D n ,E n and H n , and then use this matrix to compute 3 new feature parameters l, h, and w. The algorithmic steps of setting the new parameters l, h, and w are provided as follows: (1) Set a window with width l, l~2 n , n~1,2,3,4,5, and move the window from the site l 0 . (2) Calculate the variation of each distribution at the two ends of the window, (3) Shift the window sequentially from the beginning site l 0~1 to l 0~2 and so on, up to l 0~N {l, where N is the length of the sequence. For each value of l 0 starting from 1 to N{l, calculate each corresponding DD l , DE l , DH l . (4) Define the fluctuation functions where the bars indicate an average over all site l 0 in the sequence. Then the matrix of fluctuation functions is defined as follows: Obviously, F is a real and symmetric matrix. Denote the three eigenvalues of F by E 1 ,E 2 and E 3 , such that E 1 §E 2 §E 3 . Based on fluctuation analysis, we can get that where l, h, and w are three parameters determined by the slopes in the log-log plots. In other words, Because of the nonlinear scaling of the axes, a function of the form y~a|x b will appear as a straight line on a log-log graph, in which b is the slope of the line. Therefore, the parameters l, h, and w can be computed by estimating each slope of log-log graph corresponding to E 1 ,E 2 and E 3 from the numerical data. (5) Estimate the slopes l, h, and w of each log-log graph corresponding to E 1 , E 2 , and E 3 computed in step (4).
Thus, for any given DNA sequence, we can calculate three parameters l, h, and w by using the above five algorithmic steps. In step (1), in order to reduce the error for determining the slope w and improve the computational efficiency, the values of l~2 n (n~1,2,3,4,5) are adopted. The line fitted by those l's is perfect. Even if the linearity is not so perfect in several cases, the squared error with respect to the slope and intercept parameters is minimized and the unique straight line can also be obtained by performing a least-squares fit of the data.
After determining l, h, and w, we have a 7-dimensional feature vector consisting of parameters a, b, c, l, h, w, and s, where s is the sample standard deviation of the first 6 features. Then a machine learning method based on a support vector machine (SVM) equipped with a Gaussian radial basis kernel function (RBF) is used for prediction of intronless and intron-containing genes based only on the primary sequences.
We pick up 12 gene sequences as an illustration. The corresponding 7 feature parameters are calculated and listed in Table 1. The first 6 genes are intronless and the last 6 are introncontaining. Then an optimal hyperplane for separating intronless and intron-containing genes can be obtained by implementing SVM classifier based on this 7-dimensional feature space. Figure 1 and Figure 2 show the linearity of log-log plots of one introncontaining gene (Z31371) and one intronless gene (A10909) on the value l, h, and w. We can see that eigenvalues E 1 , E 2 , and E 3 are perfectly fitted by the lines with slope l, h and w when l~2 n , n~1,2,3,4,5.

SVM parameter optimization
The kernel function K( : , : ) dominates the learning capability of the SVM [23]. We use radial basis kernel function K(x i ,x j )~exp({cDDx i {x j DD 2 ) to predict the intronless from intron-containing genes. As in many multivariate applications, the performance of the SVM for classification depends on the combination of several parameters. In general, the SVM involves two classes of parameters: the penalty parameter C and kernel type K. C is a regularization parameter that controls the tradeoff between maximizing the margin and minimizing the training error. The kernel type K is another important parameter. In the radial basis function used in this study, c is an important parameter to dominate the generalization ability of SVM by regulating the amplitude of the kernel function. Accordingly, two parameters C and c should be optimized. The parameter optimization is performed by using a grid search approach within a limited range. Prediction accuracy associated with mean-squareerror (MSE) is used to select the parameters:  In the SVM classification, each data point represents a pair (geneID, y); if the gene is experimentally intronless, y is assigned to 1, otherwise, y is {1.

K-fold cross-validation
After all the seven parameters are determined, we can perform the K-fold cross-validation to estimate the accuracy of our predictive model. In a K-fold cross-validation, the original sample is randomly partitioned into K subsamples. Of the K subsamples, a single subsample is retained as the validation data for testing the model, and the remaining K{1 subsamples are used as training data. The cross-validation process is then repeated K times (the folds), with each of the K subsamples used exactly once as the validation data. Then the K results from the folds are averaged (or otherwise combined) to produce a single estimation. Five-fold cross-validation is performed on this work. Using a grid search method, the model with best (C,c) is obtained, which yields a minimum misclassification rate. The program implementing SVM comes from the R package ''e1071'' which is based on the libsvm 2.8 package [24]. The discriminant accuracy is defined as follows: p~T he number of all correct discriminations The number of sequences in the testing dataset GENESCAN, N-SCAN, Z-curve method, and our DFA7 method are implemented on the dataset in order to compare the results. Since the output of these gene parsing and finding systems provides us with the (predicted) beginning and ending coordinates of exons/introns in these sequences, it is easy for us to determine whether or not the gene is intron-bearing based on the prediction. For the intronless genes, the prediction is regarded as ''false'' if it predicts the splice sites between exons and introns. This confirms the existence of introns. For the intron-containing genes, the prediction is regarded as ''false'' if prediction shows ''single exon''.
For the intron-containing or intronless genes which do contain exons, the prediction is regarded as ''false'' if the predicted answer is ''no exon''. By this way, the programs of GENESCAN and N-SCAN are performed directly on the testing set.

On 2000 mixed prokaryotic and eukaryotic genes
We test our DFA7 method on a large dataset which contains 1000 intronless genes (from prokaryotic genomes completely) selected randomly from UniProtKB/Swiss-Prot (release 15.1) and 1000 intron-containing genes selected randomly from Genbank database (release 170). These genes come from human, thale cress, mus musculus, and other eukaryotes in order to avoid similarity. The classical gene parsing systems GENSCAN, N-SCAN, Z-curve method, and DFA7 method are implemented. To avoid the bias of the discriminant accuracy defined in [7] and the similarity of testing dataset, five-fold cross-validation is used. In SVM classification, the parameter ranges are given as follows: Using the optimal values of C and c, the prediction model is constructed based on the training set by using the SVM learning algorithm with RBF. To minimize data dependence on the prediction model, five-fold cross-validation sampling method is prepared. Each training set consists of 1600 sequences; half of them are randomly selected from data of intronless sequences, and the other half are randomly selected from data of introncontaining sequences. Each testing set is constructed using the left 400 sequences. The prediction results are listed in Table 2 and Figure 3. In Table 2, one can see that our DFA7 approach has higher accuracy than Z-Curve, N-SCAN, and GENSCAN methods.  In order to further illustrate the efficiency of our approach we test our method on another dataset. This carefully-selected dataset contains 1000 genes: 500 of them are human intronless genes which are randomly chosen from Intronless Gene Database [25], and the other 500 are intron-containing genes which are randomly chosen from Genbank database (release 170). Here we must emphasize that all the 1000 genes are from eukaryotic organisms. We partition the 1000 sequences into 5 parts; each part contains 200 sequences (100 intronless genes and 100 intron-containing genes). We treat each part as testing dataset, and the corresponding leftover (800 sequences) as training dataset.
We test two models: (1) using all 7 parameters (our DFA7 method), (2) using the first 3 parameters only (Z-Curve method). For each model, we firstly run 10-fold cross-validation on training dataset to get the best SVM tuning parameters C and c, then fit the SVM model with the chosen parameters C and c, and finally use the fitted SVM model to predict the labels of training dataset (get training errors) and the labels of testing dataset (get testing errors). We show the test results in Table 3. We can see that, for the gene dataset from eukaryotic organisms, our DFA7 method still has higher accuracy than Zhang et al. 's method.
We also compare our DFA7 method with GENSCAN, which can predict the locations and exon-intron structures of genes in genomic sequences from a variety of organisms. We use the same dataset partitions as before (the 1000 sequences into 5 parts; each part contains 200 sequences: 100 intronless genes and 100 introncontaining genes) for this program. In Table 4, we can see that, GENSCAN seems to have higher accuracy (82.50%) than ours (80.12%). However, when we are using GENSCAN, some parameters (for example, organism) are needed to be specified. There are only three organisms to choose from: Vertebrate, Arabidopsis, and Maize. Since our datasets are from eukaryotes (actually, most sequences are from human), we choose organism parameter as ''Vertebrate''. In this case, in order to get highaccuracy result for GENSCAN, we must know the prior information for the sequences, at least the hosts of these genes. Otherwise, for example, if we choose ''Maize'' as the organism parameter, GENSCAN got much lower accuracy (62.60%) as shown in Table 4. Thus it is a disadvantage for GENSCAN method. On the contrary, our DFA7 method does not need any prior information for the gene sequences. The 7 parameters are 7 natural quantities from the original DNA sequence, not set by any artificial intervention.

On 1200 eukaryotic genes
We also test our approach on another large dataset including 600 intronless genes and 600 intron-containing genes from three very different eukaryotic genomes (human, drosophila, and yeast). The 600 intronless genes include 200 human genes, 200 drosophila genes, and 200 yeast genes. Similarly, the 600 introncontaining genes also include 200 human genes, 200 drosophila genes, and 200 yeast genes. These genes are chosen from Intronless Gene Database [25], Berkeley Drosophila Genome Project [26], and Saccharomyces Genome Database [27]. We partition the 1200 sequences into 5 parts; each part contains 240 sequences (120 intronless genes and 120 intron-containing genes). We treat each part as testing dataset, and the corresponding leftover (960 sequences) as training dataset.
We test two models: (1) using all 7 parameters (our DFA7 method), (2) using the first 3 parameters only (Z-Curve method). For each model, we firstly run 10-fold cross-validation on training dataset to get the best SVM tuning parameters C and c, then fit SVM model with the chosen parameters C and c, and finally use Table 6. Prediction results of 1000 eukaryotic genes based on DFA7 method by one-by-one feature deletion testing. the fitted SVM model to predict the labels of training dataset (get training errors) and the labels of testing dataset (get testing errors). We show the test results in Table 5. We can see that, for the discriminant accuracy, our DFA7 method still largely outperforms Zhang et al. 's method.
Furthermore, we compare our DFA7 method with GENSCAN with the same dataset. When we are using GENSCAN, the parameter organism is needed to set. If we choose the organism parameter as ''Vertebrate'', the average accuracy rate for this dataset is only 40%. Actually, the genes in this dataset are from three very different eukaryotic organisms: mammalian (human), invertebrate (drosophila), and unicellular (yeast). The diversity of our dataset leads to the very low accuracy rate. Therefore, the GENSCAN can not output meaningful prediction results with the genes from very diverse hosts. However, our approach can be used on a universal dataset.

Examine DFA7 method by one-by-one feature deletion
To test whether there is any overfitting issue, we use the one-byone feature deletion to justify our DFA7 method. Here we use the previous dataset of 1000 eukaryotic genes. We randomly divide the 500 intronless sequences into 250 and 250, and randomly divide 500 intron-containing sequences into 250 and 250, then use 250 intronless and 250 intron-containing genes as training dataset, and use the rest 250 intronless and 250 intron-containing genes as testing dataset. Thus, in this case, the training dataset and the testing dataset are totally independent.
For one-by-one feature deletion, we test 8 models: (1) using all 7 parameters, (2) using 6 parameters after deleting a, (3) using 6 parameters after deleting b, (4) using 6 parameters after deleting c, (5) using 6 parameters after deleting l, (6) using 6 parameters after deleting h, (7) using 6 parameters after deleting w, (8) using 6 parameters after deleting s. For each model, we run 10-fold crossvalidation on training dataset to get the best SVM tuning parameters C and ''c'', fit SVM model with the chosen parameters C and ''c'', then use the fitted SVM model to predict the label of training dataset (get training errors) and the label of testing dataset (get testing errors). We show the test results in Table 6.
From the results, we can see that, except ''deleting h'', the model using all 7 parameters gives the highest average accuracy in Table 6. Here we must point out that h is not an overfitting parameter. The model of ''deleting h'' gives more error counts than the original DFA7 model for many partitions. However, there is only one random partition dataset in which the ''deleting h'' model gives much less errors than DFA7. This big difference causes that the final average accuracy of the ''deleting h'' model is slightly higher than the DFA7 model. Actually, for most random partitions of dataset, our DFA7 model gives much higher accuracies.

Comparison with other machine learning approaches
Based on our proposed new features, we also use other machine learning techniques to test the classification results. Using the same dataset of 1000 eukaryotic genes, we compare the performance of SVM, Backpropagation network (BPN), and Radial Basis Function network (RBFN) [28][29][30]. To train a BPN, we use the function ''neuralnet'' in the R package ''neuralnet''. To train an RBFN, we use the function ''rbf'' in the R package ''RSNNS''. Both of them are available at http://cran.r-project.org/web/packages/.
The training errors and testing errors following the same setup of partitions as in section ''On 1000 eukaryotic genes'' are shown in Table 7 and Table 8, respectively. Here SVM7 indicates SVM with all the 7 parameters, SVM3 indicates SVM with the first 3 parameters, and so on. Note that BPN7 is not available because the convergence of training procedure of BP network with all the 7 parameters is too slow. Based on the cross-validation results, we can see that BPN and RBFN have much more errors than SVM on the dataset.

Conclusion
In this work, we propose a new computational approach to distinguish between intron-containing and intronless gene sequences. In comparison with previous literature, the predictive performance of our method has been significantly enhanced. It is anticipated that the current method can be a complementary tool for distinguishing intronless genes from intron-containing genes. Seven feature parameters a,b,c, l, h, w, and s can be computed using the algorithmic steps as we described. SVM classifier with RBF function is also performed on those seven parameters to classify the genes. Our new feature parameters can be used to discover more information hidden within the genes. Our DFA7 method mainly focuses on distinguishing intron-containing and intronless gene sequences. Further studies of this method will be needed to make specific splice-site predictions available. We will also evaluate the relative importance of the feature parameters and find more valuable features, which could help to classify genes and proteins with higher accuracy based on their structures. The datasets used in this work are available at http://www.math.uic. edu/,jyang06/publications/datasets/.