Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

The number of 10-mers needed to hit all 30-long sequences in four genomes: Two bacterial genomes A. tropicalis, C. crescentus, the worm C. elegans and a mammal genome, H. sapiens.

The genome sizes are quoted after removing all Ns and ambiguous codes. We tested three algorithms: minimizers picking the lexicographically smallest 10-mer, minimizer picking the first in a random k-mer ordering, and selection using the set produced by DOCKS. In case of multiple DOCKS-selected 10-mers in the 30-long window, the lexicographically smallest was chosen. # mers is the number of distinct 10-mers selected, and avg. dist. is the average distance between two selected 10-mers.

Table 2