Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

On DNA numerical representations for genomic similarity computation

  • Gerardo Mendizabal-Ruiz,

    Affiliation Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México

  • Israel Román-Godínez,

    Affiliation Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México

  • Sulema Torres-Ramos,

    Affiliation Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México

  • Ricardo A. Salido-Ruiz,

    Affiliation Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México

  • J. Alejandro Morales

    alejandro.morales@cucei.udg.mx

    Affiliation Departamento de Ciencias Computacionales, División de Electrónica y Computación, Universidad de Guadalajara, Guadalajara, Jalisco, México

On DNA numerical representations for genomic similarity computation

  • Gerardo Mendizabal-Ruiz, 
  • Israel Román-Godínez, 
  • Sulema Torres-Ramos, 
  • Ricardo A. Salido-Ruiz, 
  • J. Alejandro Morales
PLOS
x

Abstract

Genomic signal processing (GSP) refers to the use of signal processing for the analysis of genomic data. GSP methods require the transformation or mapping of the genomic data to a numeric representation. To date, several DNA numeric representations (DNR) have been proposed; however, it is not clear what the properties of each DNR are and how the selection of one will affect the results when using a signal processing technique to analyze them. In this paper, we present an experimental study of the characteristics of nine of the most frequently-used DNR. The objective of this paper is to evaluate the behavior of each representation when used to measure the similarity of a given pair of DNA sequences.

Introduction

Genomic signal processing (GSP) refers to the use of signal processing theory, algorithms, and mathematical methods for the analysis, transformation, and interpretation of the information contained in genomic data. It has been an active field of research for the past 25 years. While most current GSP methods focus on identifying protein-coding regions in DNA sequences (e.g., [110]), other applications include searching for genomic repeats [11], determining the structural, thermodynamic, and bending properties of DNA [12], biological sequence querying [13], estimating of DNA sequence similarity [1416], and sequence alignment [17].

GSP methods require the transformation or mapping of the genomic information usually represented as a string of characters (i.e., A, T, G and C) to a numeric representation in the form of a single or multidimensional array of numeric values (i.e., a signal) [18]. Current DNA numerical representations (DNR) may be divided into three categories: single-value mapping, multidimensional sequence mapping, and cumulative sequence mapping.

Single-value representations are characterized by the use of a single one-dimensional numerical value for each nucleotide in the DNA sequence. In this category we find: (i) “integer representation”, where a numeric vector is generated by replacing each of the four possible letters of the nucleotide by a fixed integer value [19]; (ii) “real number representation”, which employs positive decimal values for the pyrimidines (i.e., A and G), and negative decimal values for the purines (C and T) [20, 21]; (iii) “paired numeric representation”, which incorporates the complementarity property of the nucleotides in the DNA strain [5]; (iv) “atomic number representation”, which assigns the atomic number of each nucleotide [22]; and, (v) “electron-ion” interaction potential representation (EIIP), which employs numeric values that represent the distribution of the free electron’s energies along the DNA sequence [23].

Multidimensional representations replace every nucleotide in the DNA sequence with a vector that represents a point in a space of two or more dimensions. In this category, we find: (i) “Voss representation”, which employs four binary indicator sequences to denote the presence of a nucleotide of each type [24]; and (ii) “Tetrahedron representation”, in which each nucleotide corresponds to a vertex of a three-dimensional structure that is characterized by having equal distances between every pair of vortices [25].

Cumulative representations can use single or multidimensional vectors, and are characterized by employing a random walk model in which a curve is constructed by the aggregate contribution of consecutive numeric values assigned to each nucleotide. In this category we find (i) “DNA walk representation” consists in taking a step upwards if the nucleotide is a pyrimidine, and downwards if it is a purine [26]; and (ii) “Z-curve representation”, which constructs a three-dimensional curve in which the first dimension relates to the distributions of the types of nitrogenous base rings (purines vs. pyrimidines), while the second reflects the type of chemical functional groups (i.e., amino vs. keto), and the third represents the strength of the hydrogen bonds in the nucleotide molecules (i.e., strong H bonds vs. weak H-bonds) [27].

To date, no DNR can be considered the “gold standard” nor is there any study or comparison of the properties of the different DNRs in a common task. In this paper, we present an experimental study and comparison of the characteristics of nine common DNRs when used to estimate the similarity between DNA sequences, employing the frequency power spectrum obtained by the fast Fourier transform (FFT). The principle contribution of this paper is its the exploration of the characteristics of the existing DNRs, which helps to provide insight into the features that may be desirable for proposing new DNRs and GSP methods.

Materials and methods

Nine of the DNRs in the literature were selected for analysis and comparison (Table 1). For each DNR, we performed synthetic and biological data experiments consisting of the computation of pairwise DNA sequence similarity. The details of the proposed experimental methodology are described in the following section.

Sequence similarity computation

Consider a DNA sequence α (e.g., α = ATTCGCAT…) and let denote the digital signal version of that sequence that has been obtained using a DNR method. By applying the FFT to it is possible to compute its power spectral density (PSD) , which describes how the power of the signal (energy per unit time) is distributed over the different frequencies [28].

Consider two DNA signals and corresponding to two DNA sequences α and β, respectively. The relatedness or similarity score of these two sequences can be estimated by comparing their frequency power spectra using a similarity metric.

In this work, we explore four widely-used metrics: Euclidean distance [29], Normalized Squared Euclidean distance [30], Correlation coefficient [29], and Manhattan distance [29]. To compare the PSD of two signals, both spectra must have the same number of elements k, and every element in both vectors must correspond to the same frequency component. However, since the length of the signal representation of two different DNA sequences can differ, this condition may not be satisfied. To overcome this challenge, we apply a zero padding to the DNA signal with the smaller length before computing the FFT [31]. Also, the first entry of a power spectrum (e.g., ) is known as the zero-frequency (DC) component and represents the average intensity of the DNA signal. In this work, we chose not to consider the DC component in the spectrum comparisons for two reasons: (i) this value does not provide information about the possible patterns present in the DNA sequences; and, (ii) this value is affected by the zero padding, which will have an impact on the computed similarity score.

Euclidean and normalized squared euclidean distances.

The Euclidean distance Eq (1) is a metric used to define the distance between two points in an N-dimensional space. By considering each k frequency component of a DNA signal spectra as a dimension, a DNA sequence may be represented as a point in a k-dimensional space. Therefore, the Euclidean distance can be employed to determine the relatedness or similarity between sequences. A Euclidean distance of zero can be interpreted as meaning that the two DNA sequences are identical or closely-related, while a larger value means that the sequences are different. Additionally, since the Euclidean distance is unbounded (i.e., there is no limit for the largest value), we compute the normalized squared Euclidean distance (Eq (2)), which provides similarity values in the interval [01]. (1) (2)

Manhattan distance.

The Manhattan distance described in Eq (3) (also known as Taxicab geometry or L1-Norm), is also used to determine the distance between two points in an N-dimensional space; however, it considers distance only in orthogonal directions. This metric is usually used to assess the differences in discrete space distributions, in contrast to the Euclidean metric. Thus, this property makes it suitable for use as a measure of similarity between the PSD of DNA signals. (3)

Correlation coefficient.

The correlation coefficient (Eq (4)) measures the strength and direction of a linear relationship between two variables and so can be used to measure the degree of similarity between the PSD of two DNA signals. The correlation coefficient is bounded in the interval [01]. In general, a correlation value greater than 0.8 is generally assumed as strong, whereas a correlation smaller than 0.5 is generally assumed as weak. (4) where (5)

Synthetic data experiments

Data generation.

To evaluate how different changes in a DNA sequence will affect the similarity score when using a DNR, we generated a baseline DNA sequence of length 1,000 where each element was selected randomly with an equal probability of 0.25 for each type of nucleotide (i.e., A, C, G, T). A total of 42 datasets were generated that corresponded to the combinations of seven types of modifications (i.e., the three basic types of changes: insertion (i), deletion (d), substitution (s); and their combinations: insertion and deletion (i-d), insertion and substitution (i-s), deletion and substitution (d-s), insertion, deletion, and substitution (i-d-s)), and six percentages of change (i.e., 1%, 2%, 4%, 8%, 16% and 32%) with respect to the baseline sequence. For each type of change, the position of the nucleotide to be inserted, removed, or replaced was selected randomly using uniform distribution. The kind of nucleotide to be inserted or replaced was also selected using an equal probability of 0.25 for each type of nucleotide. For each of the 42 datasets, we generated a sample of 400 sequences out of the total number of possible variations of the baseline sequence. The sample size of 400 was determined by computing the minimal number of modified sequences needed for statistically significant experiments with a confidence interval of Z = 1.96, an expected true proportion of p = 0.5, and a confidence interval of c = 0.05 [32].

We performed three experiments using the synthetic data:

  1. The first experiment was designed to evaluate how a DNR is affected by the different types of change. We computed the mean similarity score of all the modified sequences within every data set compared to the baseline sequence using the Euclidean distance, normalized squared Euclidean distance, Manhattan distance, and correlation coefficient.
  2. The second experiment was designed to evaluate how the different percentages of change affect the frequency components of the power spectrum generated with a DNR. To this end, we computed the variance of the similarity score of all the modified sequences within the i-d-s data set, then we divided the power spectrum frequency axis into ten frequency ranges. For each range of frequencies, we computed the average variance and mapped it to a color value to generate an image that depicts the changes in the variance of the frequency components with respect to the percentage of change.
  3. The third experiment consisted in evaluating the genetic similarity score obtained with each selected DNR, when comparing a DNA sequence with its corresponding complementary sequence (e.g., the complementary sequence of ATCG is TAGC), and its reverse complementary sequence (e.g., the reverse complementary sequence of ATCG is CGAT). To achieve this, we generated the complementary sequence of the baseline sequence, and computed the similarity by comparing the power spectra with the Euclidean distance, normalized squared Euclidean distance, Manhattan distance, and correlation coefficient.

Biological data experiments

To evaluate the characteristics of the selected DNRs for estimating the similarity between real biological sequences, we generated a database consisting of the DNA sequences that correspond to the ribosomal protein encoding gene RP-S18 [33], downloaded from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [34, 35]. The main reason for employing the RP-S18 is that this gene can be found in all eukaryotes. Thus, each sequence represents one species in the eukaryote tree, allowing us to evaluate the performance of each DNR in computing the similarity between highly-related species (e.g., H. sapiens vs. P. troglodytes), as well as distantly-related species (e.g., H. sapiens vs. S. cerevisiae). Twenty-six sequences were selected in order to generate various clusters that were highly-distinct from each other, i.e., eutherians, insects, and plants. Furthermore, at least one sequence was located outside every group (e.g., M. domestica is external to eutherians and together they constitute mammals, for which there are two sequences external to them, and so on), where S. Cerevisiae is the furthest external sequence. Fig 1 depicts the species selected for the RP-S18 gene, organized according to the taxonomy tree.

thumbnail
Fig 1. Biological species selected for gene RP-S18 similarity comparison.

https://doi.org/10.1371/journal.pone.0173288.g001

Experiments that consisted in computing the pairwise similarity score of every DNA sequence compared to (i) H. sapiens (representing the mammals group), and (ii) P. Saccharomyces (representing the species external to all others) were performed employing the Euclidean distance, normalized squared Euclidean distance, Manhattan distance, and correlation coefficient.

In the first experiment, the expectation was that the species belonging to the mammals group (red) would be clustered together with a high similarity score compared to H. sapiens, and that the insects (orange), and plants (green) species would be grouped with their corresponding groups with a lower similarity score with respect to all eutherians. On the other hand, it is expected that the most external species, S. Cerevisiae, would obtain the lowest similarity score with respect to H. sapiens. In the second experiment, the expectation was that every species would obtain low similarity scores with respect to S. cerevisisae, with no a particular grouping order.

Additionally, we performed a second set of experiments using the Cytochome C oxidase subunit 1 (COX1), a widely-known gene that has been branded as a general molecular marker [36]. A total of 41 sequences were obtained from the KEGG database (Orthology: K02256), corresponding to 17 mammals, 6 insects, 7 plants, 9 other vertebrates that can be located between the mammals and the insects, 1 organism located between the insects and plants, and the yeast S. cerevisiae as the external Eukaryota group. We performed comparisons of each group with respect to one organism: H. sapiens for mammals, D. melanogaster for insects, O. sativa for plants, and the S. cerevisiae as the external group.

We selected the COX1 gene because of its ability to allow the differentiation from Phyla to Order with a mean pairwise divergence value of 11.3% among animals [37]. While it can dissect the insect order appropriately [37] and perform reasonably well for all vertebrates [38, 39], its value has been questioned for plants, where its mutation rate is even slower and chloroplast genes are preferred [40].

Based on this rationale, the expectation was that in our experiment that compared sequences with respect to mammals or insects, the animals would distribute adequately. Also, when comparing with respect to plants, two groups, that of plants and those of the rest with seemingly undifferentiated clumps, would appear clearly.

Fig 2 depicts the selected species for the COX1 genes organized according to the taxonomy tree.

thumbnail
Fig 2. Biological species selected for gene COX1 similarity comparison.

https://doi.org/10.1371/journal.pone.0173288.g002

Results

Synthetic data results

Fig 3 depicts plots of the mean Euclidean distance scores for 400 synthetic sequences in each one of the 42 datasets when using the selected DNRs.

thumbnail
Fig 3. Mean Euclidean distance scores for 400 synthetic sequences in each one of the 42 datasets when using each of the selected DNR (i: insertion, d: deletion, s: substitution, i-d-s: insertion-deletion-substitution, i-d: insertion-deletion, i-s: insertion-substitution, d-s: deletion-substitution).

https://doi.org/10.1371/journal.pone.0173288.g003

Note that the increases in the mean Euclidean distance score for the real, paired numeric, Voss, and tetrahedron representations present a similar behavior for all types of changes despite the fact that they correspond to different DNR types (see Section 1). For these DNRs, the curve with the shortest distance scores corresponds to substitutions. The curves corresponding to deletions-substitutions, insertions-deletions, and insertions-deletions-substitutions present notable differences when a small percentage of changes are present, and almost an identical rate of increase after changes above 16%.

The curve corresponding to deletions exhibits a decrease in the Euclidean distance score for changes above 4% for the real, paired numeric, and tetrahedron representations. Finally, the curves corresponding to insertion-substitutions and insertions are the most distant from the baseline sequence in these DNRs. For the integer representation, the curve with the smallest distance with respect to the baseline sequence is also the one corresponding to substitutions.

Note that there is an important difference between the distance scores for the remaining types of changes, with the exception of the curves corresponding to insertions-deletions and insertion-deletion-substitutions the rate of change is very similar. The plots corresponding to the EEIP and atomic number representations are also almost identical, with slight differences in the rate of increase of distance for the curve corresponding to insertions.

Note as well that there is a brief decrease in the distance score at 4% for insertions, and at 8% for the insertion-substitution curves. It is notable that for the DNRs of cumulative type (i.e., DNA walk and Z-curve), the curve corresponding to substitutions is not the one with the lowest Euclidean distance score, as is the case in the other DNRs. For these DNRs, the lowest distance scores correspond to the insertion-deletions, followed by the insertion-deletions-substitutions. Finally, note that the deletion, substitution, and insertion-deletion curves present notable differences in their order in both plots after changes by 16%.

Table 2 lists the angle (in degrees) of the rate of change in the mean Euclidean distance scores for the type of change corresponding to insertions-deletions-substitutions, for five ranges of the percentage of changes.

thumbnail
Table 2. Angle (in degrees) of the rate of change in the mean Euclidean distance scores for the type of change corresponding to insertions-deletions-substitutions, for five ranges of the percentage of changes (2-1, 4-2, 8-4, 16-8, 32-16).

https://doi.org/10.1371/journal.pone.0173288.t002

Note that for almost all DNRs the angles are close to 90° in the range of 1%–2%, which implies that small differences between two DNA sequences will produce high Euclidean distance scores. The exception to this is the EIIP representation, for which the angle is close to 45°. This DNR presents smaller angles than the others, which means that the Euclidean distance score will not be dramatically affected even when there are great differences between a pair of DNA sequences. The integer and real representations behave similarly in terms of angles, as do as the Voss and tetrahedron representations. The angles corresponding to the atomic number, Z-curve, and DNA walk representations are large for all the ranges, which indicates that any magnitude of difference between two DNA sequences will produce large Euclidean distance scores.

Fig 4 depicts plots of the mean normalized squared Euclidean distance in the same synthetic data set. Note that the integer, EIIP and atomic number representations present a similar behavior to that observed when using the Euclidean distance. The real, paired numeric and tetrahedron representations present very small differences for deletions, deletions-substitutions, and insertion-substitutions after approximately 15% of changes, which may make it unreliable for differentiating among such types of changes. The Voss representation appears to preserve the same structure as with the Euclidean distance, but with more noticeable differences in the distances between the deletion-substitutions, insertion-deletions and insertion-deletion-substitutions. Moreover, it is noteworthy that the cumulative DNRs are more sensitive to deletions, compared to the large sensitivity to insertions when using the Euclidean distance.

thumbnail
Fig 4. Mean Normalized squared euclidean distance scores for 400 synthetic sequences in each of the 42 datasets when using each of the selected DNR (i: insertion, d: deletion, s: substitution, i-d-s: insertion-deletion-substitution, i-d: insertion-deletion, i-s: insertion-substitution, d-s: deletion-substitution).

https://doi.org/10.1371/journal.pone.0173288.g004

Table 3 lists the the angle (in degrees) of the rate of change in the mean normalized squared Euclidean distance scores for the type of change corresponding to insertions-deletions-substitutions for five ranges of the percentage of change. Note that the angle of the rate of change is relatively small for all DNRs compared to the angle of the rate of change observed when using the unbounded Euclidean distance (Table 2). This is explained by the normalization step, which bounds the maximum possible score to the value of one, and therefore, has the effect of “compressing” the relative difference scores.

thumbnail
Table 3. Angle (in degrees) of the rate of change in the mean normalized squared Euclidean distance scores for the type of change corresponding to insertions-deletions-substitutions, for five ranges of the percentage of changes (2-1, 4-2, 8-4, 16-8, 32-16).

A = ×10−2, B = ×10−3.

https://doi.org/10.1371/journal.pone.0173288.t003

Fig 5 depicts the results corresponding to the use of the Manhattan distance in the synthetic DNA signal data set. Note that for all DNRs the substitutions present the highest similarity with respect to the original sequence, while the insertions represent the largest differences. Moreover, the order of the curves in the plots indicate that this distance may be more robust with respect to the DNRs employed.

thumbnail
Fig 5. Mean Manhattan distance scores for 400 synthetic sequences in each one of the 42 datasets when using each of the selected DNR (i: insertion, d: deletion, s: substitution, i-d-s: insertion-deletion-substitution, i-d: insertion-deletion, i-s: insertion-substitution, d-s: deletion-substitution).

https://doi.org/10.1371/journal.pone.0173288.g005

Table 4 lists the the angle (in degrees) of the rate of change in the Manhattan distance scores for the type of change corresponding to insertions-deletions-substitutions for five ranges of the percentage of change. Note that the angles remain large for all the DNRs with the exception of the EEIP, compared to the Euclidean distance, which means that the difference score will continue to increase as the percentage of change increases.

thumbnail
Table 4. Angle (in degrees) of the rate of change in the mean Manhattan distance scores for the type of change corresponding to insertions-deletions-substitutions, for five ranges of the percentage of changes (2-1, 4-2, 8-4, 16-8, 32-16).

https://doi.org/10.1371/journal.pone.0173288.t004

Fig 6 depicts the mean complementary correlation coefficient scores (i.e., 1-Correlation) for the same data when using each one of the selected DNRs.

thumbnail
Fig 6. Mean 1-Correlation scores for 400 synthetic sequences in each one of the 42 datasets when using each of the selected DNRs.

i-d: insertion-deletion, i-s: insertion-substitution, d-s: deletion-substitution). Note that the range for each box is not between [0, 1], instead they vary in order to present a better visualization.

https://doi.org/10.1371/journal.pone.0173288.g006

Note that the magnitude of the similarity scores, in particular the EIIP, the atomic number, and the cumulative representations, present high correlation scores even when large changes occur (e.g., a correlation of approximately 0.94 for changes of 32% with the baseline sequence in the curve corresponding to deletions). Note that for all the non-cumulative DNRs, the highest mean correlation score is obtained by the type of change corresponding to substitutions.

The mean correlation coefficient corresponding to the other types of changes behaves similarly for the real, paired Numeric, and Tetrahedron representations, with the curve corresponding to substitutions far above the other curves, and an apparent convergence of these curves as the percentage of changes increases. The Voss representation behaves similarly, with the difference that the mean correlation coefficient scores are higher for all the curves, and a better separation of the other curves as the percentage of changes increases, as well as its distinct behavior for insertions, which is most similar to the one for the integer representation.

The integer representation also behaves similarly, with the difference that the other curves score higher than the Voss representation curves. The EIIP and Atomic number curves behave like each other, with minor differences in the mean correlation coefficients. Finally, the DNA-walk and Z-curve representations present a quasi-linear decrease in their mean correlation scores with respect to increasing percentages of changes.

For these DNRs, the highest score is obtained by the insertion-deletion curve. Unlike the other DNRs, the i-d-s and i-d curves are better separated. Note how these results are consistent with those obtained with the other distances, with the main difference that the correlation coefficient is always in the range of [−1, 1], while the Euclidean and Manhattan distances ranges within [0, ∞) and the normalized Euclidean distance in the range of [0, 1].

Table 5 lists the angle (in degrees) of the rate of change in the mean correlation coefficient scores for the type of change corresponding to insertions-deletions-substitutions for five ranges of the percentage of change.

thumbnail
Table 5. Angle (in degrees) of the rate of change in the mean correlation coefficient scores for the type of change corresponding to insertions-deletions-substitutions, for five ranges of the percentage of changes (2-1, 4-2, 8-4, 16-8, 32-16).

A = ×10−2, B = ×10−3.

https://doi.org/10.1371/journal.pone.0173288.t005

Similar to the case of the normalized squared Euclidean distance, the angles are subtle for almost all DNRs. Note that for the Atomic number, EEIP, and DNA-Walk representations, the angles are near zero for every range of percentage of change. Therefore, the similarity between two sequences may be impossible to estimate using these DNRs with this metric. The integer representation method may be more sensitive to differences between two signals, while the real, paired numeric, and cumulative representations may be a better option for estimating the correlation between two sequences.

Fig 7 depicts the mean variance of the frequency components according to the percentage of change for the selected DNRs using a color palette where red and blue represent high and low variances, respectively. Note that the tetrahedron representation concentrates the variability around the higher frequencies as well as the frequency corresponding to approximately 1/5 of the maximum frequency for percentages of change around 8%, and more homogeneous spread of variability for higher percentages of change. The integer, real, and Voss representations have a significant variability in the high-frequency components, and in some of the low- and mid- frequency components.

thumbnail
Fig 7. Mean variance of the frequency components according to the percentage of change for the selected DNRs using a color palette where red and blue represent high and low variances, respectively.

HF stands for high and and LF for low frequencies.

https://doi.org/10.1371/journal.pone.0173288.g007

The paired numeric representation concentrates the variability in the mid-frequency components. The EIIP, atomic number, and cumulative representations concentrate an extremely high variability in the low frequencies for percentages of change larger than 8%, which depresses the variability in the other frequency components among the remaining percentages of change. This explains the high correlation scores for these DNRs, since almost the entire power spectrum may seem similar in comparison to the largest possible value differences for the low frequency components.

Table 6 lists the scores for the comparison of the synthetic baseline sequence with its corresponding complementary sequence, and the reverse complementary sequence. Note that the real and paired numeric representations obtain scores that indicate the identity of the power spectra of the complementary and reverse complementary sequences for all metrics. This can be explained because these DNRs consider the complementarity property of the DNA strands for the numeric mapping and, therefore, generate the same patterns in the signals.

thumbnail
Table 6. Complementary sequence scores for each DNR.

EC stands for Euclidean distance, CC for correlation coefficient, NE for normalized Euclidean distance, and MD for Manhattan distance.

https://doi.org/10.1371/journal.pone.0173288.t006

This behavior may be an advantage in some cases of analysis where it is desirable to account for the structural complementarity of the DNA (for example, for determining the similarity between two DNA sequences A and B without the need to determine which of the two strains of A or B needs to be employed). However, this may be a disadvantage in cases where a detailed analysis of the differences between two DNA sequences is required.

Note that the Z-curve and DNA Walk representations also provide scores that indicate identical power spectra compared to the complementary sequence. However, in the case of the reverse complementary sequences, these two scores indicate a large difference between their frequencies. This property can be explained by the cumulative characteristic of these DNRs, which generates different DNA signals when taking the reverse direction. This response could represent an important criticism of these DNRs, since in their formulation the authors justify the mapping values employed arguing that they consider the DNA complementarity property, while this does not apply for computing similarities. Integer, EIIP, Atomic Number, Voss and Tetrahedron present the same scores for the complementary sequence and the reverse complementary sequence, respectively. The latter is due to the symmetry property of the frequency spectrum (i.e., the frequency spectrum of a numeric sequence is the same even if this numeric sequence is sorted in reverse order).

Voss does not present such behavior because of the procedure used to transform a multidimensional signal to a single-dimensional signal in which, for each dimension, the power spectrum is computed and then concatenated one after the other.

Biological data results

Fig 8 depict the distribution of the similarity scores of all the selected species of the gene RP-S18 with respect to H. sapiens (left column) and S. Cerevisiae (right column) when using the four selected similarity metrics. Note that all the non-cumulative DNRs were successful in clustering all mammals with a large similarity score when compared to H. sapiens. Also, note that the Macaca mulatta and Pan Troglodytes were the closest species to H. sapiens as was to be expected.

thumbnail
Fig 8. Biological experiment results for the similarity computation of the selected gene RP-S18 sequences with respect to H. sapiens (left column) and S. Cerevisiae (right column) when using the four selected similarity metrics.

https://doi.org/10.1371/journal.pone.0173288.g008

When using the Euclidean distance, only the Real, Voss, and Tetrahedron representations successfully assign a lower similarity score to the S. Cerevisiae than to all other species (i.e., the black cross marked on top of all other markers). However, when using the normalized squared Euclidean distance and the correlation coefficient, the integer, real, Paired Numeric, Voss, and tetrahedron representations depict the black cross above every other marker. When using the Manhattan distance, the EEIP also depict the S. Cerevisiae as the most unrelated specimen. Note that for the non-cumulative DNRs all species tend to cluster together with low similarity scores when compared to S. Cerevisiae. DNA walk and Z-curve do not show this clustering and present a more uniform distribution of the similarity scores for all metrics. When using the correlation coefficient, a similar behavior can be observed, with the main difference being in the Atomic number and EEIP representations where the species are grouped with a high similarity score when compared to S. Cerevisiae (i.e., around 98% to 99% correlation). This implies that all species are very similar to S. Cerevisiae, which is incorrect. Similarly, the cumulative representations yield high similarity scores with respect to this species.

Figs 912 depict the distribution of the similarity scores of all the selected species of the gene COX1 with respect to H. sapiens, (B) Drosophila melanogaster, and (C) Oryza sativa when using the Euclidean distance, squared Euclidean distance, Manhattan distance, and correlation coefficient as the similarity metrics, respectively.

thumbnail
Fig 9. Biological experiment results for the similarity computation of the selected gene COX1 sequences with respect to H. sapiens (top), Drosophila melanogaster (middle), and Oryza sativa (bottom) when using the Euclidean distance as the similarity metric.

https://doi.org/10.1371/journal.pone.0173288.g009

thumbnail
Fig 10. Biological experiment results for the similarity computation of the selected gene COX1 sequences with respect to H. sapiens (top), Drosophila melanogaster (middle), and Oryza sativa (bottom) when using the normalized squared Euclidean distance as the similarity metric.

https://doi.org/10.1371/journal.pone.0173288.g010

thumbnail
Fig 11. Biological experiment results for the similarity computation of the selected gene COX1 sequences with respect to H. sapiens (top), Drosophila melanogaster (middle), and Oryza sativa (bottom) when using the Manhattan distance as the similarity metric.

https://doi.org/10.1371/journal.pone.0173288.g011

thumbnail
Fig 12. Biological experiment results for the similarity computation of the selected gene COX1 sequences with respect to H. sapiens (top), Drosophila melanogaster (middle), and Oryza sativa (bottom) when using the correlation coefficient as the similarity metric.

https://doi.org/10.1371/journal.pone.0173288.g012

Note that overall, the distance measurements remain similar for the single and multidimensional representations. However, this is not the case for the cumulative DNRs that smear all the species without any chance to resolve even at the Phylum level. In contrast, the Atomic Number and EEIP representations present an erratic clustering of the taxa.

Note that the main difference of all the explored distance metrics is the scale at which they differentiate the organisms, following from the lowest-to-highest: Euclidean < Correlation < Norm L2 < Manhattan. At first glance, the Manhattan distance may seem to disperse adequately through the relevant order layers, but when reviewed for all the comparisons it becomes clear that this measurement is quickly saturated and renders maximum distances to groups that the COX1 gene may still differentiate at the phylum level. Likewise, the Norm L2 distance can barely differentiate between the Phyla before reaching saturation points.

An interesting result is that Bos mutus is consistently the farthest specimen on almost every comparison, independently of the DNR and distance measurement employed. When performing a more detailed examination of its respective KEGG entry (bom:102267288) it showed that even when it is a COX1 gene, in the RefSeq is registered as cytochrome c oxidase subunit 1-like. This means that, a distant homologous gene was introduced and it acted as the external group since it showed greater distance than Saccharomyces cerevisiae. This shows that the methodology presented in this work is capable of discriminating between close orthology and more distant homologies.

Discussion

The proposed DNRs may be grouped into two categories, according to the values to be assigned to each nucleotide [18]: fixed value-based mapping methods characterized by employing arbitrary numeric values for each DNA letter, and biological-based mappings characterized by their use of numerical values that are somehow justified by some biochemical or biophysical properties of the DNA molecules.

We believe that the robustness of the fixed value-based mapping methods such as the integer and real representations is questionable since they do not consider any biological property. Moreover, it is evident that the use of different values generates different results. If we look at EIIP and atomic number representations as fixed value mapping methods since they employ characteristics that may not directly affect the biological properties or the dynamics associated with the DNA molecules, we can verify that the use of arbitrary values and intervals lead to different results. In that respect, biological-based mappings such as the Voss and tetrahedron representations which consider the properties of the DNA molecules and their interactions may represent a better choice.

In the research presented in this paper, we performed experiments employing synthetic DNA sequences that were generated and altered with different types of change in a cumulative manner, using a uniform probability distribution for the selection of each type of nucleotide. This procedure may not be valid for modeling real biological DNA sequences, since the relative proportions of bases in DNA are not even [41]. However, given that the numeric values assigned to each nucleotide are different among the selected DNRs, the uniform probability distribution employed seems to be appropriate to avoid a possible bias in the results due to a high frequency of appearance of a certain numeric value.

It is interesting that the EIIP and atomic number representations behave similarly to each other, and unlike the rest of the single-dimensional DNRs (Figs 36). We believe this is because of the cost of change of a nucleotide, in a given sequence, to a different one. Such a cost is determined by the arithmetical difference in the value of the two different nucleotides to be interchanged (the larger the difference, the greater the cost). In the case of the integer, real, and paired numeric representations, the costs are relatively lower, in comparison with the cost when using the EIIP and atomic number representations. In these latter DNRs, large differences between sequences will tend to generate disproportionately lower frequencies, as can be verified in Fig 7.

It is thus evident that the cumulative representations obtained the worst results with respect to our hypothesis. In particular, we believe that these types of representations are not suitable for FFT-based GSP methods, because of their lack of stationarity, which is a desideratum when using digital signal processing methods [42]. Moreover, the cumulative representations tend to generate disproportionately greater lower frequencies, similarly to the EIIP and atomic number representations (Fig 7).

In this sense, the multidimensional representations may be considered as more appropriate choices, since their structure makes it possible to have equal costs for the replacement of any two nucleotide types. From the results obtained using biological data, we verified that, indeed, the multi-dimensional representations are more accurate with respect to what was expected as a result of the biological experiments. The paired numeric and real representations also seem to be adequate for GSP, since they consider the structural characteristics of the DNA molecule (i.e., complementarity property). This can be verified as well in the biological results (Figs 812).

In fact, we can verify that all the non-cumulative selected DNRs are sub-spaces of the space generated by the Voss representation. For example, the integer, real, EEIP, atomic number, and paired numeric representations can be derived from the Voss representation by multiplying each Voss indicator sequence by the values assigned to each nucleotide type on each of the DNRs, and then performing a sum over the four dimensions.

From the results obtained in this research, we believe that an adequate DNR could consist of a multidimensional mapping that employs different values corresponding to the biological properties of the DNA molecules in each dimension. Moreover, we believe that the notion of neighboring nucleotides must be considered. In this sense, the use of the k-tuples approach could be useful when defining a new DNR.

An application of the presented approach is the assessment of the similarity among sets of DNA sequences without the need of performing alignment over the DNA characters. This will allow performing faster comparisons among large databases, especially if the sequences are stores in DNA signal form with their corresponding power spectra. In fact, thanks to the increase of algorithms and computational methods based on the use of Graphical Processing Units (GPU), we believe that it is very likely that most of the GSP methods will be based on these technologies. Our future work includes the implementation of our methods using GPU and, the evaluation and development of additional DNRs and methods for DNA analysis based on GSP techniques.

Conclusion

We have presented an experimental study on the characteristics of nine DNRs belonging to three categories. Our results indicate that the multidimensional DNRs such as the Voss and tetrahedron representations are more appropriate for the computation of the similarity between DNA signals than are the other DNRs.

Supporting information

S1 File. MatlabCode.zip.

Matlab code and scripts for running the experiments we describe in this work.

https://doi.org/10.1371/journal.pone.0173288.s001

(ZIP)

S2 File. libraries.zip.

Matlab functions needed for running the experiments we describe in this work.

https://doi.org/10.1371/journal.pone.0173288.s002

(ZIP)

S3 File. datasets.zip.

The datasets employed in this work.

https://doi.org/10.1371/journal.pone.0173288.s003

(ZIP)

S4 File. README.txt.

Instructions of how to use the code and run the experiments.

https://doi.org/10.1371/journal.pone.0173288.s004

(TXT)

Author Contributions

  1. Conceptualization: GM IR ST RS JAM.
  2. Data curation: IR ST.
  3. Formal analysis: GM IR ST RS.
  4. Investigation: GM IR ST RS JAM.
  5. Methodology: GM JAM.
  6. Project administration: JAM.
  7. Software: GM IR ST RS.
  8. Validation: JAM.
  9. Visualization: ST JAM.
  10. Writing – original draft: GM IR ST RS JAM.
  11. Writing – review & editing: GM IR ST RS JAM.

References

  1. 1. Das B, Turkoglu I. Fourier-based filtering approach for identification of protein-coding regions in DNA sequences. In: IEEE Signal Processing and Communications Applications Conference, 2015. p. 2529–2532.
  2. 2. Inbamalar TM, Sivakumar R. Filtering Approach to DNA Signal Processing. In: International Proceedings of Computer Science and Information Tech. vol. 28; 2012. p. 1–5.
  3. 3. Marhon S, Kremer SC. Gene prediction based on DNA spectral analysis: a literature review. Journal of computational biology. 2011;18(4):639–76. pmid:21381961
  4. 4. Akhtar M, Epps J, Ambikairajah E. Signal Processing in Sequence Analysis: Advances in Eukaryotic Gene Prediction. Journal of Selected Topics in Signal Processing. 2008;2(3):310–321.
  5. 5. Akhtar M, Epps J, Ambikairajah E. On DNA Numerical Representations for Period-3 Based Exon Prediction. In: IEEE International Workshop on Genomic Signal Processing and Statistics. 2; 2007. p. 1–4.
  6. 6. Rushdi A, Tuqan J. Gene Identification Using the Z-Curve Representation. In: IEEE International Conference on Acoustics Speed and Signal Processing Proceedings. vol. 2; 2006. p. 1024–1027.
  7. 7. Yin C, Yau SST. A Fourier characteristic of coding sequences: origins and a non-Fourier approximation. Journal of computational biology. 2005;12(9):1153–65. pmid:16305326
  8. 8. Kotlar D. Gene Prediction by Spectral Rotation Measure: A New Method for Identifying Protein-Coding Regions. Genome Research. 2003;13(8):1930–1937. pmid:12869578
  9. 9. Anastassiou D. Frequency-domain analysis of biomolecular sequences. Bioinformatics. 2000;16(12):1073–81. pmid:11159326
  10. 10. Tiwari S, Ramachandran S, Bhattacharya A, Bhattacharya S, Ramaswamy R. Prediction of probable genes by Fourier analysis of genomic sequences. Bioinformatics. 1997;13(3):263–270. pmid:9183531
  11. 11. Sharma D, Issac B, Raghava GPS, Ramaswamy R. Spectral Repeat Finder (SRF): identification of repetitive sequences using Fourier transformation. Bioinformatics. 2004;20(9):1405–12. pmid:14976032
  12. 12. Gabrielian A, Pongor S. Correlation of intrinsic DNA curvature with DNA property periodicity. FEBS Letters. 1996;393(1):65–68. pmid:8804425
  13. 13. Ravichandran L, Papandreou-Suppappola A, Spanias A, Lacroix Z, Legendre C. Time-frequency based biological sequence querying. In: IEEE International Conference on Acoustics Speech and Signal Processing; 2010. p. 4174–4177.
  14. 14. Yin C, Yin XE, Wang J. A Novel Method for Comparative Analysis of DNA Sequences by Ramanujan-Fourier Transform. Journal of Computational Biology. 2014;21(12):867–879. pmid:25302665
  15. 15. Borrayo E, Mendizabal-Ruiz EG, Vélez-Pérez H, Romo-Vázquez R, Mendizabal AP, Morales JA. Genomic signal processing methods for computation of alignment-free distances from DNA sequences. PloS one. 2014;9(11):e110954. pmid:25393409
  16. 16. Cheever E, Searls D, Karunaratne W, Overton G. Using signal processing techniques for DNA sequence comparison. In: Proceedings of the Fifteenth Annual Northeast Bioengineering Conference, 1989. p. 173–174.
  17. 17. Skutkova H, Vitek M, Sedlar K, Provaznik I. Progressive alignment of genomic signals by multiple dynamic time warping. Journal of theoretical biology. 2015;385:20–30. pmid:26300069
  18. 18. Kwan HK, Arniker SB. Numerical representation of DNA sequences. In: IEEE International Conference on Electro/Information Technology, 2009. p. 307–310.
  19. 19. Cristea PD. Conversion of nucleotides sequences into genomic signals. Journal of cellular and molecular medicine. 2002;6(2):279–303. pmid:12169214
  20. 20. Chakravarthy N, Spanias A, Iasemidis LD, Tsakalis K. Autoregressive Modeling and Feature Analysis of DNA Sequences. Journal on Advances in Signal Processing. 2004 Jan;2004(1):13–28.
  21. 21. Zhao J, Yang XW, Li JP, Tang YY. DNA sequences classification based on wavelet packet analysis. In: Wavelet Analysis and Its Applications. Springer; 2001. p. 424–429.
  22. 22. Holden T, Subramaniam R, Sullivan R, Cheung E, Schneider C, Tremberger JG, et al. ATCG nucleotide fluctuation of Deinococcus radiodurans radiation genes. In: In Optical Engineering+ Applications, International Society for Optics and Photonics; 2007. p.669417.
  23. 23. Nair AS, Sreenadhan SP. A coding measure scheme employing electron-ion interaction pseudopotential (EIIP). Bioinformation. 2006 Jan;1(6):197–202. pmid:17597888
  24. 24. Voss RF. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Physical Review Letters. 1992 Jun;68(25):3805–3808. pmid:10045801
  25. 25. Silverman BD, Linsker R. A measure of DNA periodicity. Journal of Theoretical Biology. 1986;118(3):295–300. pmid:3713213
  26. 26. Berger JA, Mitra SK, Carli M, Neri A. Visualization and analysis of DNA sequences using DNA walks. Journal of the Franklin Institute. 2004;341(1):37–53.
  27. 27. Zhang R, Zhang CT. Z curves, an intutive tool for visualizing and analyzing the DNA sequences. Journal of Biomolecular Structure and Dynamics. 1994;11(4):767–782. pmid:8204213
  28. 28. Welch PD. The use of fast Fourier transform for the estimation of power spectra: A method based on time averaging over short, modified periodograms. IEEE Transactions on audio and electroacoustics. 1967;15(2):70–73.
  29. 29. Deza MM, Deza E. Encyclopedia of distances. In: Encyclopedia of Distances. Springer; 2009. p. 1–583.
  30. 30. Wolfram Research I. Normalized Squared Euclidian Distance; 2010. Available from: https://reference.wolfram.com/language/ref/NormalizedSquaredEuclideanDistance.html.
  31. 31. Rao KR, Kim DN, Hwang JJ. Fast Fourier Transform-Algorithms and Applications. Springer Science & Business Media; 2011.
  32. 32. Hamburg M. Basic Statistics: A Modern Approach. NY Harcourt Brace Jovanovich; 1974.
  33. 33. Chassin D, Bellet D, Koman A. The human homolog of ribosomal protein S18. Nucleic acids research. 1993;21(3):745. pmid:8441687
  34. 34. Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic acids research. 2000;28(1):27–30. pmid:10592173
  35. 35. Kanehisa M, Goto S, Sato Y, Kawashima M, Furumichi M, Tanabe M. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic acids research. 2014;42(D1):D199–D205. pmid:24214961
  36. 36. Patwardhan A, Ray S, Roy A. Molecular Markers in Phylogenetic Studies-A Review. Journal of Phylogenetics & Evolutionary Biology. 2014.
  37. 37. Hebert PD, Cywinska A, Ball SL, deWaard JR. Biological identifications through DNA barcodes. Proceedings of the Royal Society of London B: Biological Sciences. 2003;270(1512):313–321.
  38. 38. Russo C, Takezaki N, Nei M. Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny. Molecular Biology and Evolution. 1996;13(3):525–536. pmid:8742641
  39. 39. Zardoya R, Meyer A. Phylogenetic performance of mitochondrial protein-coding genes in resolving relationships among vertebrates. Molecular biology and evolution. 1996;13(7):933–942. pmid:8752002
  40. 40. Palmer JD. Mitochondrial DNA in plant systematics: applications and limitations. In: Molecular systematics of plants. Springer; 1992. p. 36–49.
  41. 41. Bansal M. DNA structure: Revisiting the Watson-Crick double helix. Current Science. 2003;85(11):1556–1563.
  42. 42. Rioul O, Vetterli M. Wavelets and signal processing. IEEE signal processing magazine. 1991;8(4):14–38.