Fig 1.
Minimizer and syncmer schemes.
In both examples the lexicographic order is used, and only forward k-mers are considered. The underlying sequence is shown at the top. By convention the leftmost position is selected in the case of a tie. (A) Minimizers. Here w = 3 and k = 5, so the minimizer is the least 5-mer in every window of length 7. The minimizer of each window is highlighted in yellow; (B) Syncmers. Here we show the 1-parameter syncmer with k = 5, s = 2 and x1 = 3, . It selects 5-mers if their 2-minimizer appears at position 3. The 2-minimizer in each 5-mer is underlined in red, and selected k-mers are highlighted in yellow. The start positions of the k-mers in the underlying sequence that are selected by each scheme appear in red and are marked with red arrows at the top. Sequence positions 6–7 constitute a gap in the syncmer selection as they are not covered by any selected k-mer.
Fig 2.
The selected positions of three different selection schemes S1, S2 and S3 on the same sequence. Selected k-mers are highlighted and underlined. All schemes have the same number of selected k-mers, but the metrics are different. S1: ℓ = 0.529, ℓ2 = 2.974. S2: ℓ = 0.529, ℓ2 = 1.81. S3: ℓ = 0.647, ℓ2 = 2.808. While S1 and S2 have the same ℓ value, the k-mers selected by S2 are more evenly spread and thus S2 has much lower ℓ2. Some of the k-mers selected by S3 overlap, resulting in a higher ℓ value than the other schemes. However, because the gaps between covered bases are more evenly spread, the ℓ2 value is lower than that of S1. Intuitively, it will be easier to map reads using seeds selected by S3 than S1 despite the higher ℓ value, suggesting that ℓ2 is a more appropriate metric.
Fig 3.
Illustration of s-minimizers generating syncmers.
A window of α = 5 consecutive 11-mers. A: When s = 5 and t = 3, then the s-minimizer of the entire window generates a syncmer when its starting index is in the green region. If the s-minimizer is in one of the red regions then a syncmer may be generated by the s-minimizer of the remaining part of the window. For a two parameter scheme the s-minimizer creates two syncmer generating regions that may be disjoint (B) if s > t2 − t1 or overlapping (C) if s < t2 − t1. In this example, t1 = 3 and t2 = 9 in B and t2 = 6 in C.
Table 1.
Basic information about the reference genomes used in our experiments. # scaffolds is the number of individual sequences present in the reference genome fasta file and can include unplaced scaffolds, alternates, etc. Length is the total length (in nt) of all of the scaffolds together, excluding ambiguous bases.
Table 2.
The long-read datasets used in our experiments. Source names are from Table 1 where relevant. PB = PacBio, ONT = Oxford Nanopore Technologies.
Table 3.
Performance metrics of minimizer and syncmer schemes on real sequences with simulated mutations.
Substitutions were introduced in the references at a rate of 15%. The values shown are for the conserved selected k-mers. # conserved is the number of k-mers selected by a scheme that were conserved under mutation. Best performance is shown in bold. “Optimal PSS” refers to the PSS with the lowest theoretical ℓ2,θ (Table SD2) for θ = 0.15.
Fig 4.
The percentage of unmapped and incorrectly mapped reads—simulated data.
Top: Percent unmapped for low, medium and high compression. (A) PacBio reads simulated from the CHM13X sequence mapped against ChrX sequences from GRCh38; (B) 1000 ONT reads simulated from CHM13 mapped against GRCh38. Bottom: The percentage of incorrectly mapped reads for low, medium and high compression. (C) PacBio reads simulated from the CHM13 ChrX sequence mapped against CHM13X; (D) PacBio reads simulated from the 15 bacterial species in BAC pooled together and mapped against the union of their references.
Fig 5.
Percentage of unmapped reads—Real datasets.
Percentage is shown as a function of compression rate, PSS parameters were chosen to achieve the desired compression with lowest ℓ2,mut. (A) Pooled PacBio bacterial reads mapped against BAC. (B) ONT human cell-line reads mapped against GRCh38.
Fig 6.
Impact of percent sequence identity on mapping quality.
We varied the mutation rate of 1000 PacBio simulated reads from CHM13X. The figures present the % unmapped and incorrectly mapped by each method. (A) % unmapped reads. (B) % of the mapped reads that were incorrectly mapped.
Fig 7.
Memory usage and runtime vs. compression—Real data.
(A,B) Runtime in seconds to index the reference and map reads by each method. (C,D) Peak RAM usage in GB to index the reference and map reads. (A) and (C) are on PacBio bacterial reads. (B) and (D) are for ONT human cell-line reads.
Table 4.
Time (in seconds) and RAM (in GB) needed to index the reference and map the simulated reads by each of the tools. The second and third dataset use the same reference. Syncmer variant parameters were selected to match the minimap2 compression rates as above.