JASPER: A fast genome polishing tool that improves accuracy of genome assemblies

doi:10.1371/journal.pcbi.1011032

Fig 1.

A typical k-mer count histogram for low-error-rate sequencing data (Illumina or PacBio HiFi).

The red region contains error k-mers that are due to sequencing errors in the reads. The blue region represents the distribution of the counts of correct k-mers in the reads. The x-position labeled R_t is defined as half of the x coordinate of the local minimum of the distribution.

More »

Expand

Fig 2.

Illustration of the six most common error types that JASPER can detect and fix.

Error bases are in lowercase. The error k-mers (K = 5) are shown above the sequence. Black arrows indicate locations of deletions.

More »

Expand

Fig 3.

Number of errors remaining after polishing (green, left axis) and the run time of JASPER (green, right axis) for different number of polishing iterations (-p parameter), on A. thaliana data with simulated errors with K = 63.

The number of errors stops decreasing after three iterations, and it never increases for up to 10 iterations. The run time increases approximately linearly with the number of iterations. We do not show the original number of errors (pass 0), because it is large (118,563).

More »

Expand

Table 1.

Comparison of polishing tools on assemblies of A. thaliana.

Numbers correspond to the number of bases in substitutions, insertions, or deletions detected by aligning the reads to the assemblies and calling variants with the freebayes software. Any variant with reference allele frequency 0 and alternative allele frequency >1 was considered an error. For substitution errors we counted the number of bases in substitutions, and for indel errors we counted the number of inserted or deleted bases. Columns labeled “Simulated” refer to experiments where we introduced random errors into a reference, and that used short reads simulated from the unaltered reference for polishing. Columns labeled “Real” refer to experiments that used real Illumina reads to polish a draft assembly of the data. Timing data used a 24-core Intel Xeon Gold 6248R @ 3.0Ghz server with 24 threads. Execution time is given for the real data. The best value in each column (lower is better) is shown in bold.

More »

Expand

Table 2.

Statistical analysis of the corrections made by different polishing methods on an A. thaliana with simulated errors using simulated reads.

True Positives are errors that we introduced and that were corrected; False Positives are spurious corrections that polishers made where there were no errors introduced; and False Negatives are errors in the genome that were not corrected. The best numbers in each column are in bold. JASPER is the most accurate polisher with the lowest number of False Positives.

More »

Expand

Fig 4.

Effectiveness of ntEdit and JASPER polishing on real A. thaliana data as a function of the k-mer size used for polishing.

We show the number of remaining substitution errors in blue, the number of remaining indel errors in orange and the total number of remaining errors in green. Solid horizontal lines mark the number of errors before correction. The number of errors remaining is plotted as solid lines for JASPER and dashed lines for ntEdit. JASPER corrects more substitution errors at every k-mer size, and more indel errors at all but two values of k. For all values of k, the total errors remaining is lower for JASPER.

More »

Expand

Table 3.

Comparison of the polishers on the human CHM13 assembly polished with Illumina reads.

Timing data used a 24-core Intel Xeon Gold 6248R @ 3.0Ghz server with 24 threads. JASPER and ntEdit are the two fastest polishers. JASPER is slower than ntEdit, but it produces more accurate polished assembly.

More »

Expand