These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

doi:10.1371/journal.pone.0101271

Figure 1.

Comparison of the time it takes for k-mer counting tools to calculate k-mer abundance histograms, with time (y axis, in seconds) against data set size (in number of reads, x axis).

All programs executed in time approximately linear with the number of input reads.

More »

Expand

Figure 2.

Memory usage of k-mer counting tools when calculating k-mer abundance histograms, with maximum resident program size (y axis, in GB) plotted against the total number of distinct k-mers in the data set (x axis, billions of k-mers).

More »

Expand

Table 1.

Benchmark soil metagenome data sets for k-mer counting performance, taken from [11].

More »

Expand

Figure 3.

Disk storage usage of different k-mer counting tools to calculate k-mer abundance histograms in GB (y axis), plotted against the number of distinct k-mers in the data set (x axis).

Note that khmer does not use the disk during counting or retrieval, although its hash tables can be saved for reuse.

More »

Expand

Figure 4.

Time for several k-mer counting tools to retrieve the counts of 9.7 m randomly chosen k-mers (y axis), plotted against the number of distinct k-mers in the data set being queried (x axis).

BFCounter, DSK, Turtle, KAnalyze, and KMC do not support this functionality.

More »

Expand

Figure 5.

Relation between average miscount — amount by which the count for k-mers is incorrect — on the y axis, plotted against false positive rate (x axis), for five data sets.

The five data sets were chosen to have the same total number of distinct k-mers: one metagenome data set; a set of randomly generated k-mers; a set of reads, chosen with 3x coverage and 1% error, from a randomly generated genome; a simulated set of error-free reads (3x) chosen from a randomly generated genome and a set of E. coli reads.

More »

Expand

Table 2.

Data sets used for analyzing miscounts.

More »

Expand

Figure 6.

Relation between percent miscount — amount by which the count for k-mers is incorrect relative to its true count — on the y axis, plotted against false positive rate (x axis), for five data sets.

The five data sets are the same as in Figure 5.

More »

Expand

Figure 7.

Number of unique k-mers (y axis) by starting position within read (x axis) in an untrimmed E. coli 100-bp Illumina shotgun data set, for k = 17 and k = 32.

The increasing numbers of unique k-mers are a sign of the increasing sequencing error towards the 3′ end of reads. Note that there are only 69 starting positions for 32-mers in a 100 base read.

More »

Expand

Table 3.

Iterative low-memory k-mer trimming.

More »

Expand

Table 4.

Low-memory digital normalization.

More »

Expand

Table 5.

E. coli genome assembly after low-memory digital normalization.

More »

Expand