Exon prediction based on multiscale products of a genomic-inspired multiscale bilateral filtering

Multiscale signal processing techniques such as wavelet filtering have proved to be particularly successful in predicting exon sequences. Traditional wavelet predictor is domain filtering, and enforces exon features by weighting nucleotide values with coefficients. Such a measure performs linear filtering and is not suitable for preserving the short coding exons and the exon-intron boundaries. This paper describes a prediction framework that is capable of non-linearly processing DNA sequences while achieving high prediction rates. There are two key contributions. The first is the introduction of a genomic-inspired multiscale bilateral filtering (MSBF) which exploits both weighting coefficients in the spatial domain and nucleotide similarity in the range. Similarly to wavelet transform, the MSBF is also defined as a weighted sum of nucleotides. The difference is that the MSBF takes into account the variation of nucleotides at a specific codon position. The second contribution is the exploitation of inter-scale correlation in MSBF domain to find the inter-scale dependency on the differences between the exon signal and the background noise. This favourite property is used to sharp the important structures while weakening noise. Three benchmark data sets have been used in the evaluation of considered methods. By comparison with four existing techniques, the prediction results demonstrate that: the proposed method reveals at least improvement of 4.1%, 50.5%, 25.6%, 2.5%, 10.8%, 15.5%, 11.1%, 12.3%, 9.2% and 2.4% on the exons length of 1–24, 25–49, 50–74, 75–99, 100–124, 125–149, 150–174, 175–199, 200–299 and 300–300+, respectively. The MSBF of its nonlinear nature is good at energy compaction, which makes it capable of locating the sharp variations around short exons. The direct scale multiplication of coefficients at several adjacent scales obviously enhanced exon features while the noise contents were suppressed. We show that the non-linear nature and correlation-based property achieved in proposed predictor is greater than that for traditional filtering, which leads to better exon prediction performance. There are some possible applications of this predictor. Its good localization and protection of sharp variations will make the predictor be suitable to perform fault diagnosis of aero-engine.


Introduction
Recent advancement in high-throughput analysis, such as next-generation sequencing, has resulted in the development of computational techniques for the rapid prediction of exons in DNA sequences.Although great progress has been made in the development of exon prediction algorithms, the challenge of determining the lengths and locations of short exons urgently needs to be solved [1][2][3].The main difficulty in predicting short exons is that the intrinsic properties, such as codon biases, are harder to determine [3,4].To date, there is no consensus about the definition and classification of short exons.Saeys et al. thought that the exons with lengths of <200 base pair (bp) might be considered small [2].Recently, two independent studies by Irimia et al. [5] in Cell and by Li et al. [6] in Genome Research defined one class of short exons called microexons and uncovered the features regulating the inclusion of these microexons.Irimia et al. reveal that the regulation of microexons (defined as exons with lengths of 3-15 bp) is highly dynamic during neuronal differentiation and the inclusion of these microexons can modulate the function of interaction domains of proteins involved in neurogenesis [5].In another study, Li et al. demonstrate that microexons (defined as exons with lengths of �51 bp) exhibit a high level of sequence conservation and they may possess brain-specific functions [6].Thus, knowledge pertaining to short exons in genomes is very important for understanding the functioning of proteins and the life processes.Therefore, the challenge of determining the lengths and locations of short exons urgently needs to be solved.In another work [7], we have briefly outlined the intrinsic advantages and limitations of the existing methods for predicting exons.In this paper, we focus on the development of a spectral analysis technique for finding exons in eukaryotic DNA sequences, as described below.
The discrete nature of DNA information has been driving a surging interest in the application of the principles of spectral analysis to develop efficient exon-prediction techniques.Spectral analysis techniques are attractive because they are easy to implement, entail reduced computational complexity, and mostly do not require any training of the genomic data [8][9][10].
In the spectral analysis of DNA sequences, the three-base periodicity (TBP) exhibited by exons is a good discriminator of coding potential.The determination of TBP due to codon usage bias is built upon the phenomenon that exon regions have a prominent power spectrum peak at frequency f = 1/3 [8,9].Numerous advanced exon-finding algorithms have been developed by tracking the strength of TBP along a DNA sequence [3,4,[7][8][9][10][11][12][13][14][15][16][17][18][19][20][21].Such methodologies have a strong mathematical basis, including Fourier transform measures [8,[11][12], digital-filterbased methods [10,13], wavelet-based techniques [3,7,[14][15][16][17][18] and other analysis tools [9].Wavelets have proved highly successful in the manipulation and analysis of biomedical signals [22][23][24][25][26][27][28].Among exon-finding methods, wavelet-based techniques are said to be distinctive.The examination of local variations in scale of the multiscale transform data of the sequence makes the wavelet predictor more powerful.Traditionally, the base idea of wavelet predictor, such as the modified Gabor-wavelet transform (MGWT) [14] and the wide-range wavelet window (WRWW) [18], is to computes a weighted sum of nucleotide values over a large neighbourhood at different scales.Although wavelet-based methods yield good predictions, they do not perform well in preserving the short exons and the exon-intron boundaries due to their linear nature.The multiscale bilateral filtering (MSBF) methods have been widely used in medical image processing field [29][30][31].The nonlinear regularization of MSBF makes it an excellent solution for enhancing the high frequency structures and suppressing image noise.In this paper, we will follow the MSBF based strategy inspired from the one previously used in the analysis of image information to predict exons.
Our intuition is that nucleotides in the codon position p(p = 1,2,3) are close to each other not only if they occupy nearby spatial locations but also if they have some similarity at the reading frame p.For this purpose, we propose a genomic-inspired MSBF that can incorporate domain and similarity by means of multiplication.Like traditional wavelet predictor, a domain filtering named B-spline wavelet transform is designed to extract TBP by weighing nucleotide values with complex coefficients.Similarly, we define range filtering, which measures similarity by counting the sum of difference for variable sequence coverage.Another object of this paper is to investigate the inter-scale correlation (or multiscale products) information in MSBF domain and its application to exon prediction.We formulate the problem of investigating the correlated features in terms of the differences between exon and intron coefficients at two adjacent scales.We pursue this investigation which results from the HMR195 dataset by calculating the Jensen-Shannon divergence and the histogram distributions.Experimental results demonstrate that through MSBF and multiscale products, detection accuracy can be significantly improved with only a small loss in exon prediction.The proposed technique, termed multiscale products in MSBF domain (MP-MSBF), is more effective than locating exons directly from the linear filtering data, leading to superior exon prediction results.

Numerical representation of a DNA sequence
The representation of DNA character strings into numerical sequences is the first step in DNA spectral analysis.In this paper, the paired-numerical representation [8] is introduced to map DNA characters (i.e., A, C, G, and T) into numeric values.A particular advantage of this representation is that it exploits the structural differences between exon and intron regions to facilitate the TBP extraction, in addition to reducing complexity.Eq (1) provides an example of this representation scheme for the short DNA fragment . ..CTGCAGTGGT. ..: u ¼ f. . .À 1; 1; À 1; À 1; 1; À 1; 1; À 1; À 1; 1 . ..g: ð1Þ

Genomic-inspired MSBF
To introduce our genomic-inspired MSBF, we first describe in Section A the domain filtering called B-spline wavelet transform.This wavelet function exhibits a higher degree of freedom for curve design, which can be adapted to analyse complex genome.In the next section, we first define a continuous representation of the average magnitude difference function (AMDF) inspired by Akhtar et al. [8], and a range filtering built with AMDF is designed to find certain information about nucleotide similarity in a specific codon position.Finally, the genomicinspired MSBF is suggested for differentiating between intron noise and meaningful data.
A. Domain filtering.In this work, domain filtering given by B-spline windows are formulated.The B-spline window β m (t) of order m, which is time-limited in [−T/2,T/2], is built as follows [32]: where

following Eqs (2) and (3).
To fully analyse the DNA sequences characterized by a specific periodicity, the task here is to extract the TBP at different scales while keeping the analysis frequency constant.From Eqs The domain filtering of Eq (6) measures the geometric distance between the center nucleotide and its neighbourhood.
In the case of domain filter, the length of the domain filter is 2400, and the scale parameter is set to 10 exponentially separated values between 1/60 and 1/6 for an input sequence.For practical purposes, the order of the B-spline function β m (t) is truncated to 6.
B. Range filtering.Before continuing to our genomic-inspired MSBF, we first use the AMDF to design a range filtering for measuring nucleotide similarity in a specific codon position.A continuous representation of AMDF for a signal u, as a function of the grid spacing τ 0 , is defined as where L is equal to the window length of Eq (4), τ 0 is set to 3 for TBP.Before applying AMDF to a DNA sequence, the authors in [8] suggest passing it first through a second-order resonant filter centered at frequency 2π/3 [13].
For efficient implementation, a multiscale and sliding window will move along the filtered sequence to compute AMDF for the whole sequence.The complex envelope of φ d (t,b,a) given in Eq (4) is then used to calculate the window: From Eqs ( 7) and ( 8), the proposed range filtering for a signal u can be formulated as follows: The range filtering of Eq (9) measures the radiometric distance between the center nucleotide and its neighbourhood.Finally, the expressions given in Eqs ( 6) and ( 9) are used to design our genomic-inspired MSBF of a signal u, having the non-linear property: Uðb; aÞ ¼ U d ðb; aÞ � U r ðb; aÞ: ð10Þ Given a DNA sequence of length N, the projection of the MSBF coefficients onto the position axis is defined as a function of b(b = 0,1,. ..,N−1).

Multiscale products
Several exon-finding techniques take advantage of traditional wavelet transform to filter short exons with small scales and long exons with large scales.This approach implies that they do not exploit the dependencies between adjacent scales.To explore the MSBF inter-scale correlations we multiply the adjacent MSBF sub-bands to distinguish intron noise from meaningful data while preserving the sharp variations of short exons.The core idea behind the multiscale products method is based on our research (see Section 3.4): namely, for DNA sequences represented by MSBF, the multiscale transform coefficients related to intron noise are less correlated across scales than the coefficients associated with exon signals.
Let U(b,a j ) be the MSBF of a signal u at the scale a j (j = 1,2,. ..,J) and the position b.The multiscale products (or inter-scale correlation) MP j (b) of the MSBF contents at two adjacent scales is defined as MP j ðbÞ ¼ jUðb; a j Þj � jUðb; a jþ1 Þj; j ¼ 1; 2; . . .
With the observation of experimental results, we can imagine that multiplying the MSBF at adjacent scales would amplify exon structures and dilute noise (see Section 3).

Multiscale products of multiscale bilateral filtering
Our multiscale products of multiscale bilateral filtering (MP-MSBF) for exon prediction is described briefly in Table 1.The input DNA sequence of length N is referred to as u.

Data resources
To evaluate and compare the performance of the proposed MP-MSBF with that of other methods, the two benchmark data sets BG570 [33] and HMR195 [34] have been considered.Furthermore, we conduct an additional classification experiment using 29 genes of the ENm001-004 data set (part of EGASP) [35] (see S1 File for detailed information on these sequences).Table 2 summarizes the features of the considered data sets.

General setting
In this section, we first conduct an experiment to establish a comprehensive analysis of the inter-scale correlation of the differences between exon and intron coefficients.Next, we present experiments in exon prediction using the proposed method.For comparison, MP-MSBF presents comparable performance to that of four popular existing methods: the paired and weighted spectral rotation measure (PWSR) [8], the MGWT [14], the fast Fourier transform plus empirical mode decomposition (FFTEMD) [11] and the WRWW [18].To evaluate the general performances of these measures, the TBP data for each DNA sequence considered have been normalized with values between 0 and 1.

Evaluation metrics
To investigate the inter-scale correlation of the differences between exon and intron sequences, the distance criterion of Jensen-Shannon (JS) divergence [36] is adopted.In probability theory and statistics, the JS divergence is a method of measuring the similarity between two probability distributions.The JS divergence is a convenient divergence measure for our purpose because it is symmetric and bounded between 0 and 1.The distance between two probability Table 1.Exon predictor algorithm using the MP-MSBF technique.
1. Convert an input DNA sequence into the numerical sequence u using the paired-numerical representation.
2. Apply the MSBF to the whole sequence.The transform of the numerical sequence is given by where a j (j = 1,2,. ..,J) is the scale parameter and b denotes the nucleotide position along the DNA sequence.
With a set of results obtained by running a predictor on a test data set, the true positive (TP), true negative (TN), false negative (FN) and false positive (FP) counts can be determined.Using these counts, the performances of various methods in handling exons of different lengths are measured in terms of the approximate correlation (AC) [33] AC ¼ 1 4 To evaluate the general performance of the method under consideration, the receiver operating characteristic (ROC) curve [37] is used to explore the effects on sensitivity and specificity.The sensitivity and specificity are given by The area under an ROC curve (AUC) can be used as an indicator of prediction performance.

Inter-scale correlation analysis
The coefficients of the input DNA sequences obtained from the multiresolution decomposition include exon-structure information together with intron noise.The general purpose of inter-scale correlation analysis is to investigate the dependency information on the differences between exon and intron coefficients.We apply the schemes proposed in this paper to analyse the correlation for a large number of exon and intron regions.Herein, the JS divergences are employed to investigate whether the coefficients related to introns are less correlated across scales than the coefficients associated with exons.This distance criterion has been applied in genome comparison [38], bioinformatics [39] and protein surface comparison [40].Table 3 summarizes the JS divergences of the MSBF coefficients between two adjacent scales, a j,j+1 (j = 1,2,. ..,9), for the exon and intron nucleotides of the HMR195 data set.The results of Table 3 reveal that the JS divergences of exons are smaller than those of introns at consecutive scales, while there is a difference of one order of magnitude between the exon and intron regions at the last eight consecutive scales.This property can assist in discriminating exon features from introns in the multiscale transform domain.
To further justify our assumption, histograms with fitted distributions are calculated for the exon and intron nucleotides of HMR195 at different scales.give the distributions of exon and intron nucleotides using the MSBF and inter-scale correlation, respectively.This result indicates that the most relevant exon information represented by the correlation at each scale is captured by large-valued coefficients, whereas the intron information is captured by a large number of small-valued coefficients.Fig 5 clearly illustrates that the distance between the exon and intron curves obtained from inter-scale correlation is greater than that obtained from MSBF.In other words, the MSBF coefficients of the exon sequences have a strong correlation on various decomposition scales, whereas the MSBF coefficients of noise are weakly correlated.These plots justify our assumption.

Performance evaluation on benchmark data sets
Exons have significant functional constraints, and their length plays an important role in splice site selection.Rogic's evaluation work [34] stated, "These constraints have shaped the exon length distribution quite differently from geometric distribution.The length distribution depends on the exon type."In our analysis, we grouped exons into ten ranges of exon lengths, namely, (0,25), [25,50) Table 4 summaries the performances of various methods for exons using the BG570, HMR195 and ENm001-004 data sets.By comparison with the PWSR, MGWT, FFTEMD and WRWW methods, the prediction results show that: our MP-MSBF exhibits at least improvement of 4.1%, 50.5%, 25.6%, 2.5%, 10.8%, 15.5%, 11.1%, 12.3%, 9.2% and 2.4% on the exons of the ranges (0,25), [25,50)  An additional classification experiment on all sequences of considered data sets is designed to assess the general performance of our proposed technique and other methods.Fig 7 presents the ROC curves obtained from the different methods tested in this experiment.The MP-MSBF method has higher prediction accuracy than its counterparts.Our MP-MSBF method consistently exhibits higher prediction accuracy than its counterparts for exons that are either relatively short or long in length.

Summary
In this work, we have introduced a new, robust and efficient method to predict exons in eukaryotes.Unlike some prediction techniques that detect exons directly by linear filtering, the proposed scheme incorporates a genomic-inspired multiscale bilateral filtering and its inter-scale dependencies and then applies these features to better differentiate exon structures from background noise.The first key concept of our method is its nonlinear nature, which exploits geometric distance in the spatial domain and nucleotide similarity in the range.The second is that this technique compacts the energy of the exons into coefficients with large amplitudes and spreads the energy of the introns over a large number of coefficients with small amplitudes.This phenomenon has led to improved results with respect to exon preservation and noise suppression.The proposed MP-MSBF method requires neither prior information nor training models for exon prediction, and so it can be applied to analyse unknown and novel genomes.It should be noted that all five methods considered here tend to have low accuracy in predicting microexons shorter than 25 bp (there were only 121) and microexons with lengths of 25-49 bp (there were only 227) as shown in Table 4.A possible explanation for this phenomenon is that these microexons are too short to be efficiently spliced in vivo without special splicing activation sequences [41].In other words, the length of these microexons is too short to be clearly distinguished from surrounding noncoding regions.For almost all the methods, the accuracies slowly rise with the length of annotated exons between 75 and 300+ nucleotides.Although not good for exons shorter than 50 bp, the results obtained from our method are acceptable.Our MP-BSBF should encourage further development of existing methods in prediction of microexons.

Conclusion
Exons encode the biochemical processes and information involved in the pathway from DNA to proteins.In genomic sequence analysis, exon prediction based on the annotated sequences  in the online databases is an important problem.For exon prediction, extracting the relevant features of short coding sequences is a major task because the subtle features of short exons are obscured by the strong presence of background noise.In practice, spectral analysis is an important tool for the discovery of interesting patterns and structures in exon data.In this paper, we present a new exon-finding spectral analysis method that overcomes some of the shortcomings of current predicting techniques.The MP-MSBF predictor takes advantage of the nonlinear filtering and the dependency information between scales, which makes it capable of short exon prediction.We see some possible applications of this predictor.The correlationbased property and nonlinear nature of this technique allow the selection of a characteristic frequency from surrounding noise and thereby makes it possible to offer good localization and protection of sharp variations for locating hot spots in proteins and performing fault diagnosis of aero-engine.

( 2 )Fig 1 (
Fig 1(B) and Fig 1(C) illustrate our domain filter with two different scales in the time and frequency domains.The proposed domain filtering of a signal u is given by

Fig 1 .
Fig 1. Examples of B-spline windows, and domain filter (order 6) with two different scales in the time and frequency domains.(A) Examples of B-spline windows; (B) Magnitude response of domain filter in the time domain; (C) Frequency response of domain filter.https://doi.org/10.1371/journal.pone.0205050.g001 wðt; b; aÞ ¼ jφ d ðt; b; aÞj: ð8Þ In other words, the window w(t,b,a) is the magnitude response of φ d (t,b,a) in time domain.

Fig 2
shows the prediction plots of the sequence HUMDZA2G locus (AZGP1 gene) of Homo sapiens (GenBank accession number D14034) using MSBF and its inter-scale correlation (or multiscale products) at different scales.The sequence HUMDZA2G locus (AZGP1 gene) contains four exons at positions 5322-5388, 9329-9589, 12907-13182 and 14052-14335.The peaks corresponding to the exon regions of the original data appear much stronger in Fig 2(B) than those in Fig 2(A).The results demonstrate that inter-scale correlation can suppress intron noise while retaining more exon details.Fig 3 compares the prediction results of the sequence HUMDZA2G locus (AZGP1 gene) using the tested methods.Our MP-MSBF algorithm identified the localized peaks better and located the short coding sequence (exon 1) more accurately.

Fig 2 .Fig 3 .
Fig 2. Prediction plots for sequence HUMDZA2G locus (AZGP1 gene) at different scales.The abscissa axes of all the plots represent the relative base positions, the actual locations of the exons are marked with rectangles in red dashed lines.Part (A) shows the MSBF result; and (B) shows the result of inter-scale correlation.https://doi.org/10.1371/journal.pone.0205050.g002 Fig 4(A) and Fig 4(B) , [50,75), [75,100),[100,125),[125,150),[150,175),[175,200),[200,300)  and [300,300+).We thought the exons of these ranges are relatively short and long in length.The best accuracies achieved by the tested methods are calculated in terms of the AC values for each group of exons.Fig 6(A) and Fig 6(B) depict the experimental results obtained from various methods using the HMR195 and BG570 data sets, respectively.The MP-MSBF exhibits good accuracies in these ten ranges.In Fig 6(A), MP-MSBF presents results close to those of MGWT in the ranges (0,25) and WRWW in the ranges [300,300+), while it exceeds the performance of other methods in the other ranges.The results of Fig 6(B) show that the performance of the MP-MSBF method is close to those of FFTEMD at the range (0,25) and MGWT at the range [75,100), while it outperforms the performance of the other methods at the other ranges of exon lengths.Similar results obtained with the sequences in the ENm00-004 data set are shown in Fig 6(C); however, no exons of length <25 occur in these sequences.The MP-MSBF exhibits good accuracies in these nine ranges and presents results close to those of FFTEMD at the ranges [25,75) and WRWW in the ranges [100,125), [200,300) and [300,300+), while it slightly exceeds the performance of other methods in the other ranges.

Fig 4 .
Fig 4. Histogram distributions at different scales for MSBF and inter-scale correlation applied to HMR195.For all the plots, blue lines represent exons, red lines indicate introns, the abscissa axes represent the magnitude values, and the ordinate axes represent the number of coefficients.Part (A) shows the MSBF result; and (B) shows the result of inter-scale correlation.https://doi.org/10.1371/journal.pone.0205050.g004