Fig 1.
Overview of the AIEdit polishing pipeline.
In the pre-processing stages, ntCard and ntStat construct Bloom filters (orange cylinders) containing high-confidence k-mers and spaced seeds from sequencing reads (upper left grey box). These Bloom filters are used to identify unsupported k-mers in the assembly and form the input to the error pattern model. Black (white) cells in the inputs to the error pattern model represent spaced seed “care” (“do not care”) positions, and patterns and hits are passed to the pink and cyan GRUs, respectively. The final hidden states of these two GRUs are concatenated and passed to the linear layer (dark blue) for predicting the error pattern, which guides base-level corrections. Finally, a single round of ntEdit is applied as a post-processing step to correct potentially missed errors, reusing the k-mer Bloom filter. Input and output files are shown in grey boxes.
Fig 2.
Polishing accuracy and computational performance of AIEdit compared to other tools.
Polishing accuracy is reported as the average number of mismatches and indels per 100kbp. The number of mismatches and indels in the unpolished assemblies are shown as grey circles, and their QV scores are shown as grey horizontal lines. For simulated short-read datasets (a, b, and c), AIEdit (dark cross), ntEdit alone (blue cross), and POLCA (green cross) achieved similar error correction rates. Their results are shown in logarithmic scale to highlight minor differences. The run times for the simulated long-read experiments (d, e, and f) are also in log-scale due to POLCA and Medaka’s (orange cross, when available) high run times compared to other tools. All other plots are in linear scale. “Baseline” (grey cross) refers to the unpolished assembly. Run time and peak memory usage includes the computational resources allocated for each tool’s entire pipeline reported by/usr/bin/time -pv. For experimental long read datasets (g, h, and i), k-mer QV scores of the polished assembly are calculated by Merqury.