Parameterized syncmer schemes improve long-read mapping

doi:10.1371/journal.pcbi.1010638

Fig 1.

Minimizer and syncmer schemes.

In both examples the lexicographic order is used, and only forward k-mers are considered. The underlying sequence is shown at the top. By convention the leftmost position is selected in the case of a tie. (A) Minimizers. Here w = 3 and k = 5, so the minimizer is the least 5-mer in every window of length 7. The minimizer of each window is highlighted in yellow; (B) Syncmers. Here we show the 1-parameter syncmer with k = 5, s = 2 and x₁ = 3, . It selects 5-mers if their 2-minimizer appears at position 3. The 2-minimizer in each 5-mer is underlined in red, and selected k-mers are highlighted in yellow. The start positions of the k-mers in the underlying sequence that are selected by each scheme appear in red and are marked with red arrows at the top. Sequence positions 6–7 constitute a gap in the syncmer selection as they are not covered by any selected k-mer.

More »

Expand

Fig 2.

ℓ vs. ℓ₂ metric.

The selected positions of three different selection schemes S₁, S₂ and S₃ on the same sequence. Selected k-mers are highlighted and underlined. All schemes have the same number of selected k-mers, but the metrics are different. S₁: ℓ = 0.529, ℓ₂ = 2.974. S₂: ℓ = 0.529, ℓ₂ = 1.81. S₃: ℓ = 0.647, ℓ₂ = 2.808. While S₁ and S₂ have the same ℓ value, the k-mers selected by S₂ are more evenly spread and thus S₂ has much lower ℓ₂. Some of the k-mers selected by S₃ overlap, resulting in a higher ℓ value than the other schemes. However, because the gaps between covered bases are more evenly spread, the ℓ₂ value is lower than that of S₁. Intuitively, it will be easier to map reads using seeds selected by S₃ than S₁ despite the higher ℓ value, suggesting that ℓ₂ is a more appropriate metric.

More »

Expand

Fig 3.

Illustration of s-minimizers generating syncmers.

A window of α = 5 consecutive 11-mers. A: When s = 5 and t = 3, then the s-minimizer of the entire window generates a syncmer when its starting index is in the green region. If the s-minimizer is in one of the red regions then a syncmer may be generated by the s-minimizer of the remaining part of the window. For a two parameter scheme the s-minimizer creates two syncmer generating regions that may be disjoint (B) if s > t₂ − t₁ or overlapping (C) if s < t₂ − t₁. In this example, t₁ = 3 and t₂ = 9 in B and t₂ = 6 in C.

More »

Expand

Table 1.

Reference genomes.

Basic information about the reference genomes used in our experiments. # scaffolds is the number of individual sequences present in the reference genome fasta file and can include unplaced scaffolds, alternates, etc. Length is the total length (in nt) of all of the scaffolds together, excluding ambiguous bases.

More »

Expand

Table 2.

Reads information.

The long-read datasets used in our experiments. Source names are from Table 1 where relevant. PB = PacBio, ONT = Oxford Nanopore Technologies.

More »

Expand

Table 3.

Performance metrics of minimizer and syncmer schemes on real sequences with simulated mutations.

Substitutions were introduced in the references at a rate of 15%. The values shown are for the conserved selected k-mers. # conserved is the number of k-mers selected by a scheme that were conserved under mutation. Best performance is shown in bold. “Optimal PSS” refers to the PSS with the lowest theoretical ℓ_2,θ (Table SD2) for θ = 0.15.

More »

Expand

Fig 4.

The percentage of unmapped and incorrectly mapped reads—simulated data.

Top: Percent unmapped for low, medium and high compression. (A) PacBio reads simulated from the CHM13X sequence mapped against ChrX sequences from GRCh38; (B) 1000 ONT reads simulated from CHM13 mapped against GRCh38. Bottom: The percentage of incorrectly mapped reads for low, medium and high compression. (C) PacBio reads simulated from the CHM13 ChrX sequence mapped against CHM13X; (D) PacBio reads simulated from the 15 bacterial species in BAC pooled together and mapped against the union of their references.

More »

Expand

Fig 5.

Percentage of unmapped reads—Real datasets.

Percentage is shown as a function of compression rate, PSS parameters were chosen to achieve the desired compression with lowest ℓ_2,mut. (A) Pooled PacBio bacterial reads mapped against BAC. (B) ONT human cell-line reads mapped against GRCh38.

More »

Expand

Fig 6.

Impact of percent sequence identity on mapping quality.

We varied the mutation rate of 1000 PacBio simulated reads from CHM13X. The figures present the % unmapped and incorrectly mapped by each method. (A) % unmapped reads. (B) % of the mapped reads that were incorrectly mapped.

More »

Expand

Fig 7.

Memory usage and runtime vs. compression—Real data.

(A,B) Runtime in seconds to index the reference and map reads by each method. (C,D) Peak RAM usage in GB to index the reference and map reads. (A) and (C) are on PacBio bacterial reads. (B) and (D) are for ONT human cell-line reads.

More »

Expand

Table 4.

Runtime and memory.

Time (in seconds) and RAM (in GB) needed to index the reference and map the simulated reads by each of the tools. The second and third dataset use the same reference. Syncmer variant parameters were selected to match the minimap2 compression rates as above.

More »

Expand