Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing
The genome sizes are quoted after removing all Ns and ambiguous codes. We tested three algorithms: minimizers picking the lexicographically smallest 10-mer, minimizer picking the first in a random k-mer ordering, and selection using the set produced by DOCKS. In case of multiple DOCKS-selected 10-mers in the 30-long window, the lexicographically smallest was chosen. # mers is the number of distinct 10-mers selected, and avg. dist. is the average distance between two selected 10-mers.