Figure 1.
Comparison of the time it takes for k-mer counting tools to calculate k-mer abundance histograms, with time (y axis, in seconds) against data set size (in number of reads, x axis).
All programs executed in time approximately linear with the number of input reads.
Figure 2.
Memory usage of k-mer counting tools when calculating k-mer abundance histograms, with maximum resident program size (y axis, in GB) plotted against the total number of distinct k-mers in the data set (x axis, billions of k-mers).
Table 1.
Benchmark soil metagenome data sets for k-mer counting performance, taken from [11].
Figure 3.
Disk storage usage of different k-mer counting tools to calculate k-mer abundance histograms in GB (y axis), plotted against the number of distinct k-mers in the data set (x axis).
Note that khmer does not use the disk during counting or retrieval, although its hash tables can be saved for reuse.
Figure 4.
Time for several k-mer counting tools to retrieve the counts of 9.7 m randomly chosen k-mers (y axis), plotted against the number of distinct k-mers in the data set being queried (x axis).
BFCounter, DSK, Turtle, KAnalyze, and KMC do not support this functionality.
Figure 5.
Relation between average miscount — amount by which the count for k-mers is incorrect — on the y axis, plotted against false positive rate (x axis), for five data sets.
The five data sets were chosen to have the same total number of distinct k-mers: one metagenome data set; a set of randomly generated k-mers; a set of reads, chosen with 3x coverage and 1% error, from a randomly generated genome; a simulated set of error-free reads (3x) chosen from a randomly generated genome and a set of E. coli reads.
Table 2.
Data sets used for analyzing miscounts.
Figure 6.
Relation between percent miscount — amount by which the count for k-mers is incorrect relative to its true count — on the y axis, plotted against false positive rate (x axis), for five data sets.
The five data sets are the same as in Figure 5.
Figure 7.
Number of unique k-mers (y axis) by starting position within read (x axis) in an untrimmed E. coli 100-bp Illumina shotgun data set, for k = 17 and k = 32.
The increasing numbers of unique k-mers are a sign of the increasing sequencing error towards the 3′ end of reads. Note that there are only 69 starting positions for 32-mers in a 100 base read.
Table 3.
Iterative low-memory k-mer trimming.
Table 4.
Low-memory digital normalization.
Table 5.
E. coli genome assembly after low-memory digital normalization.