Fig 1.
Graphical representation of the CLAW workflow.
A linear reference genome is circularised (see Fig 2). Long reads (ONT or PacBio) are mapped to the circularised reference genome. Mapping reads are filtered for length and quality, then a random subsample of these reads are used for genome assembly via Flye and/or Unicycler.
Fig 2.
Demonstration of ‘circularising’, in silico, linear reference chloroplast genome sequences after download from online databases.
Linearised chloroplast genome sequences introduce artificial breakpoints in the sequence (orange and purple circles). Artificial breaks may lead to poor long read (green lines) alignment, which may affect chloroplast read enrichment/baiting. In silico re-circularisation of the linear sequence may allow long reads to map across the artificial break points.
Table 1.
Information on ONT longreads used as input for CLAW and the Flye-generated chloroplast genome assembly statistics.
Fig 3.
Representative reference-sample chloroplast genome alignments for (A) a monocot species, (B & C) two dicot species, and (D) an algal species. The reference genome is represented on the X-axis, and the genome assembled by CLAW is represented on the Y-axis. Species names and reference genome NCBI accession numbers appear on the X-axis. Please refer to Table 1 for the ONT long read accession numbers used by CLAW for genome assembly using Flye. Each alignment presented here follows the canonical chloroplast genome-genome alignment patterns and the LSC (dark blue line in A), SSC (red line in A), and the two IR (green lines in A) regions are clearly identifiable (broken black line in A indicates the boundaries of each region).
Table 2.
Mean (± SE) time to completion, RAM used, percent identity to the reference genomes, and number of contigs generated by CLAW following the Flye and Unicycler workflows with ONT or PacBio data as input.
All jobs were run on Intel X86-64 Haswell and Skylake CPUs.
Fig 4.
Repeat graphs for the 19 chloroplast genomes (4 algal species, 4 monocot species, and 11 eudicot species) assembled by CLAW using ONT long reads as input for Flye.
The ‘lasso’ style genome plots represent assemblies in which the IR regions (the flat lines connecting two circular pieces) are perfectly palindromic in the assembly, while the circular style genome plots represent assemblies in which the IR regions are not identical. The order of species in this plot follows the order of species in Table 1. The colouring of segments of each genome represents genome annotations assigned using the BLAST-based genome annotation feature of Bandage. Genomes were annotated using publicly available coding region annotations from each of the reference chloroplast genomes. (C) and (K) have additional contigs that could not be annotated using chloroplast coding regions as they are mitochondrial genome fragments.