The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances

doi:10.1371/journal.pone.0228070

Fig 1.

Spaced-word match between two DNA sequences S₁ and S₂ at (2,3) with respect to a pattern P = 1100101 representing match positions (‘1’) and don’t-care positions (‘0’).

The same spaced word TA**A*C occurs at position 2 in S₁ and at position 3 in S₂.

More »

Expand

Fig 2.

Test run on Shigella dysenteriae 1 197 (4.44 Mb) and E. coli strain UTI89 (5.15 Mb).

F(k), as defined in (3), is plotted against the word length k for contiguous words. From the length of the sequences, the values of k_min and k_max are calculated with (14) and (15) as k_min = 19 and k_max = 24.

More »

Expand

Fig 3.

For distance values d between 0.05 and 1.0, we generated pairs of simulated DNA sequences of length L = 100 kb with a Jukes-Cantor distance d, i.e. with an average of d substitutions per sequence position.

Distances between the sequences were estimated with Slope-SpaM, (a) based on k-mers and (b) based on spaced words with random patterns with a probability of 0.5 for a match position at each position. For each value of d, 20,000 sequence pairs were generated, their average estimated distances are plotted against the real distances. Standard deviations are shown as error bars.

More »

Expand

Fig 4.

Estimated vs. real distances for pairs of simulated sequences as in Fig 3, but with sequences of length L = 1Mb.

More »

Expand

Fig 5.

Distances estimated by the alignment-free programs Mash [9], Skmer [37], Slope-Spam (this paper) and Spaced [40] between semi-artificial sequences consisting of 252 homologous genes from 19 strains of Wolbachia, embedded in non-related random sequences.

Ten data sets were generated by adding random sequences of different lengths. The x-axis is the proportion of the homologous sequence within the semi-artificial sequences, the y-axis is the ratio between the distance estimated from the semi-artificial sequences and the distances estimated from the original, homologous gene sequences, see the main text for more details.

More »

Expand

Table 1.

Test results on five sets of genome sequences from AFproject.

Pairwise distance values calculated with different alignment-free methods were used as input for Neighbor-Joining [60]; the table contains normalized Robinson-Foulds distances between the resulting trees and reference trees. Thus, the smaller the values are, the more similar are the produced trees to the reference trees. The table also shows the median results from 74 methods evaluated in the AFproject study. Three of the best performing programs in this study are shown (all results, except for Slope-SpaM, are taken from [44]).

More »

Expand

Table 2.

Runtime in seconds of Slope-SpaM (with spaced words) and six other alignment-free programs on three different sets of genomes that were used as benchmark data in AFproject [44].

On the largest data set, the set of plant genomes, co-phylog and kmacs were unable to produce results.

More »

Expand