Short Exon Detection via Wavelet Transform Modulus Maxima

The detection of short exons is a challenging open problem in the field of bioinformatics. Due to the fact that the weakness of existing model-independent methods lies in their inability to reliably detect small exons, a model-independent method based on the singularity detection with wavelet transform modulus maxima has been developed for detecting short coding sequences (exons) in eukaryotic DNA sequences. In the analysis of our method, the local maxima can capture and characterize singularities of short exons, which helps to yield significant patterns that are rarely observed with the traditional methods. In order to get some information about singularities on the differences between the exon signal and the background noise, the noise level is estimated by filtering the genomic sequence through a notch filter. Meanwhile, a fast method based on a piecewise cubic Hermite interpolating polynomial is applied to reconstruct the wavelet coefficients for improving the computational efficiency. In addition, the output measure of a paired-numerical representation calculated in both forward and reverse directions is used to incorporate a useful DNA structural property. The performances of our approach and other techniques are evaluated on two benchmark data sets. Experimental results demonstrate that the proposed method outperforms all assessed model-independent methods for detecting short exons in terms of evaluation metrics.


Introduction
As an initial step in the analysis of eukaryotic genome sequences, detecting exons would lead to a good understanding of the structure and function of a protein that is synthesized by these exons [1,2]. Unlike prokaryotes, eukaryotic genes are further divided into relatively small exons called protein coding regions and the introns called non-coding regions, as shown in Fig 1. In the past twenty years or so, many algorithms have been proposed for exon detection and good detection rate has been achieved in the recognition of exon and intron regions . But despite the efforts spent, it is still an open question whether the strengths of the statistical features are sufficient to identify short exons. Typically, human exons are much shorter in length (137 base pairs (bp) in average) [6]. So the task for accurate and reliable methods to automatically determine the lengths and locations of short exons still needs to be solved today [2,[8][9][10][11][12][13][14][15][32][33]. In addition, accurate location of short exons in genomes can help to design drugs and cure diseases. For example, some short exons of BRCA1 gene are related to ovarian and breast cancer [34][35][36][37]. Therefore, it is essential to develop an efficient technique for detecting the short exons of eukaryotic DNA sequences.
The detection of exons suggested in the literature can be classified into two categories, model-dependent methods and model-independent methods [38]. Model-dependent methods employ previously known genomic information or learning models to train the classifiers in the design of analysis stage [3][4][5][6][7][8][9][10][11], which means these techniques tend to be more precise. However, the issue of detecting short exons in eukaryotic genomes by using these methods still remain to be solved. Because on the one hand, assessing the coding potential of short sequences is not self-evident as intrinsic signals are harder to detect with shorter sequence lengths [11]; and on the other hand, methods such as Markov models rely on more sequences that contain short exons into the training sets [12]. Further, training datasets have limitations in dealing with the next-generation genome sequencing projects and can attach ascertainment bias to unknown genes or exons with a typical organization or structural components [10,31]. By contrast, the methods designed to detect universal and statistical features of exons are model-independent as they do not assume any a prior information to train models or estimate parameters [38]. Model-independent methods for short exons detection, is considerably more challenging for some reasons. First, it is difficult to reveal the underlying property in the data [14]. For example, the statistic features obtained from short exons may be unreliable and the performance for short exon detection may be poor. Second, sequencing and frame shift errors such as insertions and deletions can also vitiate the techniques when the exons are short [15]. Nevertheless, model-independent methods may be adaptive enough to analyze the novel sequences when prior genomic information or training dataset is unavailable from the existing databases [2,[26][27][30][31]. This paper is aimed at improving the detection accuracy of short exons in model-independent domain. It is well-known that the exons exhibit three-base periodicity (TBP), which is absent in other regions such as intergenic and intron regions in eukaryotes [39][40][41][42][43]. Modelindependent methods based on digital signal processing (DSP) techniques and TBP are built upon the phenomenon that exon regions have a prominent power spectrum peak at frequency f = 1/3 [16][17]. During recent years, a large number of model-independent methods have been proposed to detect exons [1,2,[16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31]. The detecting techniques which rely on discrete Fourier transform (DFT) include the spectral content measure [16] and its various improved versions [17][18][19][20][21]. The DFT-based methods have been the basis of several techniques in many ways. Vaidyanathan and Yoon [22] utilized anti-notch filters to solve the same problem. Yin and Yau [25] developed a novel exon detection algorithm, known as exon prediction by nucleotide distribution (EPND) to investigate the relationship between TBP and nucleotide distributions. Mena-Chalco et al. [26] introduced a method based on modified Gabor-wavelet transform (MGWT) for improving exon detection. This technique outperforms the sliding window approaches with respect to detection accuracy. Zhang and Yan [27] combined DNA structural profiles and empirical mode decomposition to improve short exon detection. This novel method named as the fast Fourier transform plus empirical mode decomposition (FFTEMD) provided a pictorial view of spectrum analysis of non-stationary signal. Recently, Marhon and Kremer [31] proposed the wide-range wavelet window (WRWW) method for extracting exon components. The WRWW can adapt its width to accommodate the change in the window length and it outperforms all assessed model-independent methods with relatively short and long exons.
To perform accurate detection of short exons, a natural solution of the problem is to extract and enhance their weak features embedded in background noise of intron regions. Existing methods provide a description of the overall regularity of exons, but they are not well adapted for finding the location and spatial distribution of singularities represented by short exons. This is the major motivation to study the singular measure and its application to short exon detection. As a generalized multifractal formalism for fractal functions and singular measures, the wavelet transform modulus maxima (WTMM) [44][45] has proved its success in studying the long-range correlation properties of genomic DNA sequences [45][46][47][48] and analyzing the strand compositional asymmetry profiles in relation to transcription and replication [49][50][51][52], by providing "oscillating boxes" to get rid of possible smooth behavior that may either mask the singularities or perturb the estimation of their strength. A comprehensive review of the research of wavelet-based multi-scale signal processing and WTMM on genomic information has been summarized in [53].
In this work, we will follow the WTMM-based strategy inspired from the one previously used in the analysis of genomic information [45][46][47][48][49][50][51][52][53] to detect short exons. Firstly, the pairednumerical representation is employed to incorporate a useful structural property and reduce the computational cost. We apply this numerical representation to construct the nucleotide distribution sequence. In this design, the statistical properties in the three reading frames is extracted from DNA sequences to differentiate between exon and intron regions. Then the nucleotide distribution sequence is used to calculated the TBP spectrum by an optimized WTMM-based method. Finally, to reflect a reality of the structure of double helix DNA, the output values are calculated along the forward and reverse directions. Case studies indicate that the proposed method improves detection accuracy of exons in comparison with existing methods and exceeds its counterparts in the ability to detect short exons.

Related Works
2.1.1 Numerical representation of a DNA sequence. A symbolic DNA sequence is composed of four different nucleotides (or bases) named Adenine (A), Thymine (T), Cytosine (C) and Guanine (G). In order to employ DSP tools on a DNA sequence, the first step in the process is to convert symbolic DNA sequences into numerical sequences. In this paper, we introduce the paired-numerical representation [21]. The paired representation exploits the property of the frequency distribution of nucleotides A−T and C−G, which is based on the fact that intron regions are rich in nucleotides A and T while exon regions are rich in nucleotides C and G. This representation could provide more differentiation between exon and intron regions when the TBP is investigated. This paired representation scheme could decrease the computation time compared to three-(Z-curve) [9] and four-(Voss) [42] sequence representations. Using this paired-numerical representation, the presence of nucleotide A or T at a particular base pair position is denoted by 1, and the absence of it is denoted by 0. Similarly, the existence of nucleotide C or G in a relative base pair position is represented by -1 and the absence of the nucleotide is represented by 0. An example of this representation scheme of a DNA sequence is presented in Table 1.
2.1.2 Calculating the TBP spectrum from occurrence frequencies of nucleotides. The nucleotide distribution in the three reading frames incorporates the reality of proteins' prefer specific amino acid compositions [39]. There exists a relationship between the occurrence frequencies of a nucleotide in the three codon positions and the TBP spectrum [24]. In this section, we introduce an efficient way to compute the TBP spectrum by calculating the nucleotide occurrence frequencies in the three codon positions. This approach can reduce the computing time significantly by comparing with Fourier transform.
Let u x (x 2 {A−T, C−G}) be a paired-numerical sequence of length N, the DFT power spectrum of a length-N block of u x is defined as follows: where i 2 = −1. The TBP in exon regions of a DNA sequence suggests that the DFT power spectrum S x [k] corresponding to k = N/3 (where N is chosen to be a multiple of 3) in each DFT sequence should be large [16,17]. By sliding a window across the sequence, we can determine the TBP value at each position.
denote the occurrence frequencies of a nucleotide in the first, the second and the third codon position of u x , respectively. The occurrence frequency f x (p) of nucleotide x at the codon position p is defined as Table 1. An example of the paired-numerical representation scheme of a DNA sequence.

Procedures of Singularity Detection
The singularities (or sharp variation points) often carry the most important information in the analysis of transient signal. Mallat's WTMM theory proves that the singularities of a signal can be measured from the evolution of the wavelet transform maxima across scales [44]. In this section, we first describe the implementation issues of the stationary wavelet transform (SWT) [54][55][56][57]. The second part of this section introduces the basic principles of the WTMM method. Then we give a fast reconstruction algorithm of wavelet coefficients from WTMM points, based on Hermite interpolation [58][59]. Finally, an optimized algorithm of singularity detection has been proposed for tracking WTMM and extracting useful information easily and correctly. 2.2.1 SWT method. The SWT method having been independently discovered for different purposes and given a number of different names [54][55][56][57], including shift/translation invariant wavelet transform, undecimated discrete wavelet transform, or redundant wavelet transform.
The key point is that it is redundant, shift invariant, and it gives a denser approximation to the continuous wavelet transform than the approximation provided by the orthonormal discrete wavelet transform (DWT). Let h 2 ' 2 ðZÞ and g 2 ' 2 ðZÞ be the scaling and wavelet filters of an orthonormal DWT, respectively. The scaling filter and wavelet filter of an SWT at scale j + 1 is defined recursively as where h 0 ½k ¼ h½k and (4), where w j is the wavelet coefficients at scale j and v J is the scaling coefficients at the coarsest resolution. The passage from one resolution to the next one is obtained by where j = 0, . . .J−1 and " Ã " denotes convolution.

Basic principles of WTMM.
Definition [44]: Let W a, t be the wavelet transform of a signal s(t) at the scale a and the position t.
• A WTMM is defined as a point (a 0 , t 0 ) such that jW a 0 ;t j < jW a 0 ;t 0 j when t belongs to either a right or the left neighborhood of t, and jW a 0 ;t j jW a 0 ;t 0 j when t belongs to the other side of the neighborhood of t 0 . We call maxima line, any connected curve in the scale space (a, t) along which all points are WTMM.
• For any point is Lipschitz α at t 0 , if and only if there exists a constant A at each modulus maxima (a, t) it holds that jW a;t j Aa a ; i:e:; logjW a;t j logðAÞ þ alogðaÞ: Let a = 2 j , we can rewrite Eq (6) in the form: log|W j, t | log(A) + αj. Mallat has proved that the signal has singularities whose Lipschitz regularity are positive, the amplitudes of the WTMM would increase as j increases, while the noise creates singularities whose Lipschitz regularity is negative, its modulus maximum will decrease with the increasing of the level j.
2.2.3 Reconstruction of wavelet coefficients using Hermite interpolation. As a method of piecewise-polynomial interpolation, Hermite interpolation matches the observed values of their first derivatives, which agrees well with the fact that the absolute value of the first derivative of a WTMM is zero. In [58] and [59], a Hermite interpolation polynomial is utilized to reconstruct wavelet coefficients from WTMM points, which yields to high computational efficiency compared with Mallat's alternating projection reconstruction algorithm. In this study, we use Hermite interpolation to accelerate the reconstruction process of wavelet coefficients. For simplicity, let W j, t be the wavelet coefficients of a function at the position t, where j represents the resolution level. Let A j, r (r = 1, 2, . . ., r 0 ) denote the local maxima of W j, t for modulus maximum lines at the position t j, r , the reconstructed wavelet coefficients can be given as follows: W 0 j; t ¼ A j; r 1 À 2 t À t j; r t j; r À t j; rþ1 " # t À t j; rþ1 t j;r À t j; rþ1 " # 2 þ A j; rþ1 1 À 2 t À t j; rþ1 t j; rþ1 À t j; r " # t À t j; r t j; r À t j; rþ1 where t 2 [t j, r , t j, r+1 ] and r = 1, 2, . . .,r 0 −1. Eq (7) gives the wavelet coefficient points between adjacent local maxima at every resolution level. Note that the degree of the interpolation polynomial is only three. Therefore, this reconstruction scheme simplifies and accelerates the reconstruction process.

An optimized algorithm for singularity detection with WTMM.
In this subsection, we attempt to discriminate between the singular points of exons and introns, both presented in the DNA sequences. Given the input signal and the input noise (detailed in Section 2.3.2), our singularity detection algorithm for exon detection is described as follows: 2. Calculate the local modulus maxima and their positions on each level from the wavelet coefficient set W = {w 1 , . . ., w J } of the input signal. At level J, retain the local modulus maxima which is greater than a threshold value and delete the local modulus maxima which is less than this threshold. In this paper, the threshold value is decided by where c is the weight value determined by experience.
3. At level j = J−1, search modulus maxima that propagate from level J to J−1. Set j = j−1, repeat the process until J = 1. The maxima set for modulus maximum lines is marked as W 0 ¼ fw 0 1 ; . . . ; w 0 J g.

4.
Utilize the Hermite interpolation polynomial method to reconstruct the wavelet coefficients from W 0 ¼ fw 0 1 ; . . . ; w 0 J g, then perform the inverse SWT to obtain the desired signal.

Method for Short Exon Detection Based on WTMM
In this section, we first describe a method to estimate the background noise of intron regions by using a notch filter with its notch centered at ω 0 = 2π/3. The second part details the construction algorithm of nucleotide distribution sequence. Finally, we present our WTMM-based method for short exon detection. 2.3.1 Estimate background noise using notch filter. In order to estimate the noise level in intron regions, we suppress the TBP components from background information by filtering the genomic sequence through a notch filter with passband centered at ω 0 = 2π/3. The notch filter can be obtained by starting from a second order all-pass filter [22] AðzÞ ¼ which has poles at Re ±jθ and zeros at 1/Re ±jθ . Then the notch filter G(z) has the form where K = (1 + R 2 )/2, cosω 0 = 2Rcosθ/(1 + R 2 ) and R = 0.992. By choosing ω 0 = 2π/3, the filter G(z) can be used to suppress the TBP components of the DNA effectively [30]. The response of G(z) is shown in Fig 2.  3. With the DNA sequence u x (x 2 {A−T, C−G}) taken as input, let u ðnÞ x denote the output of the notch filter G(z) of Eq (10). Set u x ¼ u ðnÞ x , and repeat the step (1) and step (2), the nucleotide distribution sequence for estimated noise is marked as Step 2: From Eqs (3) and (15), the TBP spectrum for the l-th segment in one direction is Step 3: In view of Eq (16), the output feature value can be calculated based on the following equation: PðlÞ¼ X k P k ðlÞ: ð17Þ Step 4: Interpolate P(l) to get the length of the signal back to N.

General Setting
In this section, we provide experiments of exon detection by using the WTMM-based method. A window length of 90 points is used throughout the experiments, and the input is represented up to 4 resolution levels through the biorthogonal wavelet function Bior1.3. In our WTMMbased method, the moderate value of weight coefficient of Eq (8) is set to c 2 [0.6, 0.7]. Herein, the WTMM-based method with weight coefficient c = 0.7 is denoted by WTMM I, while the WTMM II denotes the WTMM approach for the weight coefficient c = 0.6. As a comparison, we have evaluated the general performances of three other methods: EPND [25], MGWT [26] and FFTEMD [27]. In the case of MGWT, the window length is 1200, and the scale parameter is set to 40 exponentially-separated values between 0.2 and 0.7 for four input-sequences. In order to investigate the performance of MGWT on short exon detection, the operation on MGWT with window length 1200 points and scale values exponentially separated between 0.1 and 0.7 is denoted by MGWT I, while MGWT II denotes the MGWT with window length 900 points and scale values exponentially separated between 0.2 and 0.7. To quantify the comparison, the magnitudes of TBP spectrums obtained from the WTMM I, WTMM II, MGWT, MGWT I, MGWT II and FFTEMD on each DNA sequence have been normalized with values between 0 and 1.

Data Resources
To evaluate the performance of various methods, the DNA sequence AB021866 of H. sapien (GenBank file AB021866) is used in our study. In order to comprehensively evaluate and compare our WTMM-based algorithm with the other methods, the two benchmark data sets HMR195 [60] and BG570 [61] have been considered. Table 2 summarizes the statistics of the considered test data sets. Fig 4 shows the distribution of the exon lengths in these two benchmark data sets.

Evaluation Metrics
In order to evaluate the performance of considered methods, we compute the following counts: TP is the true positive, which is the length of nucleotides of correctly detected exons; TN is the true negative, which is the length of nucleotides of correctly detected introns; FN is the false negative, which is the length of nucleotides of wrongly detected introns; and FP is the false positive, which is the length of nucleotides of wrongly detected exons. Following the evaluation metric included in [60], the performances of various methods are measured in terms of the correlation coefficient (CC) which provides a measure of overall detection accuracy. We also calculate the sensitivity Sn and specificity Sp: Sp The sensitivity provides a measure of the proportion of exon nucleotides that have been correctly detected as exons, and the specificity provides the proportion of intron nucleotides that have been correctly detected as introns. We use the average value of Sn and Sp as a measure of the probability of correct detection. For validation of the classification of sequences, we employ the receiver operating characteristic (ROC) curve [62,63], which explores the effects on Sensitivity and 1 − Specificity.

Performance Evaluation on a Typical Sequence
The DNA sequence AB021866 of 3,611 bp contains seven exons located at relative positions of 43-93, 225-259, 1,573-1,681, 2,393-2,543, 2,683-2,801, 2,874-2,962 and 3,392-3,413. The experimental results obtained from consider methods on this particular sequence are presented in

Performance Evaluation on Benchmark Data Sets
To evaluate the performances of the considered methods over a larger number of sequences, we carry the classification experiments to compare the efficiencies of the introduced methodologies over HMR195 and BG570 data sets. In order to set up a comprehensive comparison for exon detection, especially short exon detection, we first conduct an experiment to calculate the CC values achieved in considered methods over exons in five ranges of lengths (length 50 bp, length 100 bp, length 150 bp,  show that the WTMM I outperforms the performance of the other consider methods at these five ranges of exon lengths. The WTMM II gives good results at the first three ranges and is close to the MGWT at the range >150, while it slightly exceeds the other methods at the range 50. In addition, the performances of the MGWT I and MGWT II outperform the MGWT at ranges 50, 100 and 150, but less than the MGWT at the ranges of >150 and all length. In general, our WTMM-based method consistently generates higher detection accuracy when it deals with exons that are either relatively short or long in length. Table 3 summaries the experiment results of the different considered methods at each group of exons from the HMR195&BG570 in terms of the correlation coefficient. Similar   Tables 4 and 5 corresponding to the HMR195 and BG570, respectively. By comparison with the existing methods, the detection results of Table 3 show that: 1) Our WTMM-based method exhibits at least improvement of 74.2%, 12.3% and 21.6% on the exons of lengths not greater than 50 bp, 100 bp and 150 bp, respectively; 2) The WTMM-based method reveals at least improvements of 10.3% and 14.1% over the exons of lengths greater than 150 bp and all length, respectively.
An additional classification experiment is designed to assess the performances of introduced methods. Fig 10 depicts the best accuracy in terms of the average of Sn and Sp calculated at each group of exons from the HMR195&BG570. It should be noted that the EPND generates high accuracy at the range 50, which can be contributed to the statement [25]: "If a DNA region less than 50 base pairs is identified as an intron, and is flanked by two exon regions, this region is often a false negative, and is reset as exon region; similarly, if a DNA region less than 50 base pairs is identified as an exon, and is flanked by two intron regions, this region is often a false positive, and is reset as an intron region.". These plots again establish the superiority of our WTMM-based algorithm over other methods in short exon detection.
The ROC curves obtained from the WTMM I, WTMM II, MGWT, MGWT I, MGWT II and FFTEMD methods for HMR195&BG570 are shown in Fig 11. It can be observed that the WTMM-based method generates higher detection accuracy compared with its counterparts. The experiments in ROC plots are consistent with the results of Figs 7 and 10.

Summary
This research is aimed at detecting short exons in eukaryotic DNA sequences. The misinterpretation of exon locations will lead to a misunderstanding of the gene or exon properties. This open problem poses a challenge to understand and define precisely the biochemical processes and information involved in the pathway from DNA to proteins. Consequently, knowledge pertaining to exon locations may result in the design of customized drugs and new cures for  . In gene BRCA1, the mutations in exon 11 are associated with the forms of breast cancer [34][35][36][37]. It is clear that our WTMM-based algorithm can generate high peaks in the region of exon 11 than the other methods.  To summarize, we have used the WTMM technique to detect exons in eukaryotic genome sequences. Compared with three existing algorithms, the technique described in this paper provides an improvement in short exon detection. The WTMM-based method introduced in this study has the following three features: (1) This technique first constructs a sequence of nucleotide distribution to measure the 3-base periodicity. The spectral contents are calculated from the nucleotide frequencies in the three reading frames, which incorporates the reality of proteins' prefer specific amino acid compositions and reduces the computational cost. (2) The important feature of our method is to explore the evolution of singularities of short exons across scales from the local maxima of their wavelet transform modulus. Unlike the description of the overall regularity for exons, the proposed measure provides significant patterns that are rarely observed with the traditional methods. (3) The method calculates the output values from a paired-numerical representation in both forward and reverse directions, which reflects the reality of the structure of DNA and increases computational efficiency.

Conclusion
In this paper, a model-independent method based on the singularity detection with wavelet transform modulus maxima has been developed for detecting eukaryotic exons, especially short exons. The WTMM technique employed in our method makes it capable of working with short exon detection in the performance of experimental evaluation. The analysis of the proposed method has shown that it has an adaptive response to the location and spatial distribution of singular points represented by short exons. Experimental results show that the WTMM-based method outperforms the assessed model-independent methods for short exon detection in terms of evaluation metrics.