Can we use it? On the utility of de novo and reference-based assembly of Nanopore data for plant plastome sequencing

The chloroplast genome harbors plenty of valuable information for phylogenetic research. Illumina short-read data is generally used for de novo assembly of whole plastomes. PacBio or Oxford Nanopore long reads are additionally employed in hybrid approaches to enable assembly across the highly similar inverted repeats of a chloroplast genome. Unlike for PacBio, plastome assemblies based solely on Nanopore reads are rarely found, due to their high error rate and non-random error profile. However, the actual quality decline connected to their use has rarely been quantified. Furthermore, no study has employed reference-based assembly using Nanopore reads, which is common with Illumina data. Using Leucanthemum Mill. as an example, we compared the sequence quality of seven chloroplast genome assemblies of the same species, using combinations of two sequencing platforms and three analysis pipelines. In addition, we assessed the factors which might influence Nanopore assembly quality during sequence generation and bioinformatic processing. The consensus sequence derived from de novo assembly of Nanopore data had a sequence identity of 99.59% compared to Illumina short-read de novo assembly. Most of the errors detected were indels (81.5%), and a large majority of them is part of homopolymer regions. The quality of reference-based assembly is heavily dependent upon the choice of a close-enough reference. When using a reference with 0.83% sequence divergence from the studied species, mapping of Nanopore reads results in a consensus comparable to that from Nanopore de novo assembly, and of only slightly inferior quality compared to a reference-based assembly with Illumina data. For optimal de novo assembly of Nanopore data, appropriate filtering of contaminants and chimeric sequences, as well as employing moderate read coverage, is essential. Based on these results, we conclude that Nanopore long reads are a suitable alternative to Illumina short reads in plastome phylogenomics. Few errors remain in the finalized assembly, which can be easily masked in phylogenetic analyses without loss in analytical accuracy. The easily applicable and cost-effective technology might warrant more attention by researchers dealing with plant chloroplast genomes.


Introduction
The chloroplast (cp) is the most characteristic organelle of plant cells. It carries its own unique genome, whose genes are mainly related to housekeeping functions and photosynthesis [1,2]. Most cp genomes are highly conserved regarding structure, size, and functionality of their genes. Their molecular conformation has long been thought to be exclusively circular, but growing evidence suggests that plastid chromosomes often form concatemers of several molecules and can occur in circularized and/or linear form [3]; for further references see [1]. One genome equivalent has a quadripartite structure consisting of two regions of unique DNA termed the large and small single-copy region (LSC and SSC, respectively), and a pair of near-identical large inverted repeats (IR B and IR A ) situated in between them [4]. Low substitution rates [5], very low levels of recombination, and high copy numbers in cells make the haploid genomes an easily accessible source of valuable phylogenetic information [6], for example by Sanger sequencing of molecular markers following PCR. Since the advent of next-generation sequencing (NGS) techniques, a growing number of research projects has taken advantage of the possibility to build phylogenetic analyses upon sequences of whole cp genomes instead of only a few molecular markers. In consequence, as of 19/10/2019, 2,982 angiosperm chloroplast genomes have been available in NCBI's Genbank (https://www.ncbi.nlm.nih.gov/genbank/). Apart from a substantially higher number of informative characters, larger-scale features like major inversions and gene losses can now be studied in detail.
The standard method applied for plastome recovery is Illumina short-read sequencing and assembly. The reads are of high quality (app. 0.3% error rate for the Illumina MiSeq [7]), but currently limited to 300 basepairs (bp) paired-end reads. This means that a cp genome containing two IRs cannot be reliably assembled into one continuous contig, as repeat regions can only be resolved up to the length of the sequenced insert size. By contrast, thirdgeneration single-molecule sequencing yields reads of much higher lengths: with PacBio SMRT DNA sequencing, average read length is now up to 20 kb, (according to Pacific Biosciences website at https://www.pacb.com/products-and-services/sequel-system/latestsystem-release/, accessed 30/10/2019). The alternative provided by Oxford Nanopore MinION features average read lengths of more than 6-8 kb, with also more than 20 kb regularly achieved depending on input DNA quality; and maximum read lengths exceeding 150 kb [8]. The main problem with these techniques is their high initial error rate. PacBio raw reads have an error rate of 11-15%; circular consensus sequencing can lower this value to < 1% [9], however at the cost of reduced sequencing depth and/or multiplexing capacity. Nanopore reads, despite several improvements, still feature a raw error rate of 5-15% [10].
This problem can be alleviated by the use of consensus techniques [via Partial Order Additionally, unlike Illumina or PacBio, the device is low-priced and easily installed in an average lab.
The aims of the present study were therefore: 1) to compare the sequence quality of a plastome derived from Nanopore data de novo assembly to a corresponding assembly based on Illumina data, with the goal to quantitate their differences on the basepair level; 2) to explore the potential of reference-based assembly using Nanopore data and to evaluate the quality of these assemblies with respect to Nanopore or Illumina de novo assemblies; 3) to categorize the error types associated with Nanopore-derived plastome sequences and to assess if and how these errors and possible biases in their distribution might impact the use of Nanopore-only cp genomes for phylogenomic research; and 4) to provide practical guidelines for both de novo and mapping assembly of Nanopore data for researchers interested in establishing Nanopore sequencing in their lab.
To this end, we here present complete cp genomes from two species of the genus Leucanthemum Mill., which belongs to subtribe Leucantheminae in the Compositae tribe Leucanthemum. Subsequently, the L. vulgare plastome served as "gold standard" reference for comparisons of different assembly methods and types of data. To achieve this, the cp genome sequence was re-generated six times: first, reference-based assembly of the short reads was performed, with a closely (L. virgatum) as well as a more distantly related taxon (Artemisia frigida Willd., belonging to subtribe Artemisiinae of tribe Anthemideae [25]) as reference for mapping, to assess the influence of reference choice on assembly quality. De novo and reference-based assembly approaches were then repeated using long-read Nanopore data instead of Illumina short reads. In a third step, we tested whether a correction of the error-prone Nanopore raw reads by high-quality Illumina reads improves the results of a subsequent Nanopore de novo assembly (hybrid approach).

Materials and Methods
We provide a detailed account on the methods, tools and parameters employed in the present study, as correct setting of suitable options is essential for obtaining good results in several cases. A schematic workflow is available in

Fig 1. Schematic bioinformatic workflow for comparative assembly of chloroplast genomes.
Tools used for each processing step are given in italics. Leucanthemum virgatum was assembled de novo with Illumina data and subsequently used as reference for mapping.

Plant material and DNA extraction
Silica-gel dried leaf tissue from two accessions of L. vulgare and L. virgatum, collected during field excursions, was used for the present study (for voucher details see Supporting Information S1 Table). Genomic DNA (gDNA) was extracted following a CTAB extraction protocol adapted from [26] and [27]. Each extraction was quality-checked via agarose gel electrophoresis, and DNA concentration and purity was determined using a NanoDrop spectrophotometer One (Thermo Fisher Scientific, Waltham MA, USA).

Amplification via long-range PCR
Chloroplast DNA was amplified using 32 primer pairs by [28] [30]. Based on the estimated product sizes and overlap lengths in A. frigida, for fragments 2-5, other primers were chosen (and sometimes slightly modified) from the list of primers given in [28]; these fragments span the region of the LSC harboring the two inversions which are characteristic for Asteraceae [31,32], but are not found in Orobanchaceae [28]. Furthermore, primer ndhI.194R (fragment 15) was replaced by the reverse-complement of primer ndhA.535F (fragment 14) due to amplification problems, and several primers were slightly modified. All used primer sequences are available in Supporting Information S2

Complementary Sanger sequencing
For exact determination of the LSC/IR/SSC boundaries, primers were designed spanning the four junctions (Supporting Information S2 Table)  and primer sequences deriving from the long-range PCRs were trimmed sequentially from 5'and 3'-ends of reads. This is important as primer sequences need not necessarily match the true sequence of the sample and thus might introduce wrong signal into an assembly.
Afterwards, paired-end reads were subjected to quality trimming and length filtering also via BBDuk, keeping only reads composed of regions with Phred quality scores of Q = 20 or higher, and of minimum length = 50 bp. Trimming and filtering success was verified by examining the fastq files using FastQC v.0.11.8 [36]. Processed reads were used as input for de novo as well as mapping assemblies.

Reference-based mapping assembly of Illumina data
Reference-based assembly of L. vulgare reads used two different references (see also section 2.7.2), the sequence from the Illumina de novo assembly of L. virgatum and the chloroplast genome of Artemisia frigida [further referred to as "mapping (virgatum)" and "mapping (Artemisia)"]. The gapped read aligner BBMap (bbmap.sh in BBTools software suite v. 38.58), which also provides several mapping statistics, was chosen for mapping paired-end reads in order to accommodate larger indels possibly present between the two genomes. BBMap was run in very slow mode (vslow = t); the maximum indel size to search for was set to 300 bp for mapping to L. virgatum and to 1,500 bp for A. frigida. Unpaired reads were considered unmapped (pairedonly = t), killbadpairs set to true and pairlen to 180 bp. If a read had more than one top-scoring mapping location, one site was selected randomly (ambiguous = random), so that the resulting .sam files did not contain secondary mappings.
The .sam files were converted to .bam format, sorted and indexed with SAMtools v.1.9 [37]. The resulting .vcf files were subsequently compressed and indexed with bgzip and tabix from the HTSlib package v.1.9 [37]. Variant statistics were calculated using VCFtools v.0.1.15 [40]. The consensus sequences were generated from the variants using the BCFtools [37] command bcftools consensus, setting missing genotypes to be represented by Ns instead of skipping them.

De novo assembly of Illumina data
De novo assembly of L. vulgare and L. virgatum reads was done with the Unicycler pipeline v.0.4.7 [41], which mainly relies on the de Bruijn graph-based de novo assembler SPAdes v.3.13.0 [42] and the assembly polishing tool Pilon v.1.22 [43]. Running the pipeline in conservative mode ensured the lowest possible misassembly rate. As a result, contigs linked in a circularized assembly graph were obtained, which was visualized and assessed with Bandage v.0.8.1 [44]. and copyundefined = t. As references, primer sequences were given in normal, complement, reverse, and reverse-complement orientations. Trimmed reads were then subjected to a length-and quality-filtering step in NanoFilt v.2.2.0 [45], leaving only reads with lengths ≤ 12,600 bp and a minimum average read Q-score of 7 in the datasets. Raw reads as well as trimmed and quality-and length-filtered reads ("processed reads") were examined using FastQC. As base quality values were to be preserved for reference-based mapping assembly, no further processing of reads was done for this approach. length-filtered, decontaminated, and putatively chimera-free reads are referred to as "fully processed reads" hereafter.

Reference-based mapping assembly of Nanopore data
Processed reads from L. vulgare were analyzed to evaluate the suitability of the mapping approach for Nanopore data. To avoid problems caused by too high read coverage (see Discussion, section 4.6.), a reduced dataset similar to that employed for de novo assembly (see below) was used: only reads with a minimum length of 4,200 bp were kept after another filtering step with NanoFilt executed with option -l 4200. References for mapping were identical to those from the mapping assembly of Illumina data. Importantly, to avoid secondary mappings, but also mismappings and subsequent false variant calls as far as possible, the references were designed to contain only those parts of the second inverted repeat (IR A ) needed to assure correct mapping of long reads from fragments 15 and 16, which span the borders between the SSC and the IR A , and the IR A and the LSC, respectively. Mapping was done using the mapPacBio.sh script (BBTools software suite v. 38.58), with settings analogous to those employed for Illumina data in bbmap.sh regarding the vslow, ambiguous, and maxindel parameters. Additionally, as the script only correctly processes reads up to a certain length, the long Nanopore reads were split into 3,000-bp chunks using the maxlen = 3000 option. Processing and evaluation of the resulting mappings also followed procedures described in section 2.6.2., except for the removal of duplicates, as Nanopore reads are derived from unfragmented PCR products, and hence were expected to feature high portions of identical starting positions in the mapping. Variant calling again was done with callvariants2.sh. Settings for ploidy and minreadmapq corresponded to those for Illumina data, and no trimming was performed. However, to improve mapping accuracy around indels (which are expected to occur more often when mapping Nanopore data), as including a pairing rate in score calculation was not applicable, usepairing was set to false. The resulting .vcf files were processed and the consensus sequences generated as described for Illumina data.

De novo assembly of Nanopore data
Fully processed Nanopore reads were used for de novo assembly with Canu v.1.8 [48]. The program was run with the following parameters: the estimated genome size was set to 180 kilobases (kb), somewhat above the 151 kb known from A. frigida. To account for the small overlaps present between some of the long-range PCR fragments, minOverlapLength was lowered to 300 bp. Correspondingly, the minimum length for reads being available for assembly was set to 300 bp as well. The correctedErrorRate was lowered to 0.134 and the three MhapSensitivity parameters set to low, as recommended by the program's parameter reference for datasets with high coverage > 60-fold. corOutCoverage was set to 6,000, i.e.
higher than the total input coverage to ensure that in principle, all loaded reads would be corrected. However, to avoid a detrimentally high coverage of reads during assembly (see Discussion, section 4.6.) while assuring that the program uses the longest reads first, readSamplingCoverage was given a value of 400 and readSamplingBias was set to 2.0, which results in the shortest reads (initially loaded into the sequence store) being flagged and excluded from analysis until a coverage of 400-fold remains. After assembly, resulting contigs supported by fewer than 10 reads and / or with a read coverage below 5-fold for more than 50% of their length were filtered by Canu. For checking purposes, stopOnReadQuality = true was applied. The remaining settings were kept on default.
Assembled contigs were first locally blasted (megaBLAST) against the A. frigida reference without the IR A to assure that no chimeric or contaminant contigs were present. Chimeric contigs with all parts mapping to A. frigida were split with the fastasubseq option in Exonerate [49] and (where necessary) reverse-complemented for further use. Contigs were then polished with Nanopolish v.0.10.2 (available at https://github.com/jts/nanopolish; [11]) as follows, using the original .fast5 files from the sequencing run. First, Minimap2 v.2.14 [50] with setting -ax map-ont was used for mapping raw reads to assembled contigs. After indexing the mapping with SAMtools, the nanopolish variants command was executed on each contig in consensus-calling mode, with --max-haplotypes set to 30,000 and --mincandidate-frequency to 0.2. Based on the extracted variants, the improved contigs were then generated via nanopolish vcf2fasta. Polished contigs were mapped to the A. frigida IR A -free reference with Minimap2, using setting --ax asm20, and their correct order and starting position identified in IGV. Merging of the contigs, together with the Sanger sequence of the gap between long-range PCR fragments 14 and 15 was done in BioEdit v.7.2.5 [51]. In one case, where the overlapping regions between two contigs were not identical but contained five single basepair differences, one of the contigs was arbitrarily chosen over the other. As the resulting assembly still lacked the IR A , the latter was added by copying and reversecomplementing the respective sequence of the IR B and pasting it at the end of the assembled sequence after the SSC. Read coverage values were obtained by mapping the reads used by Canu for the first assembly step after correction and trimming (the trimmedReads.fasta) to the de novo assembled cp genome sequence without the IR A , using the mapPacBio.sh script as described in section 2.7.2. but setting maxindel = 100.
Additionally, Qualimap was used for plotting coverage across the reference.

Hybrid de novo assembly of Nanopore and Illumina data
Hybrid de novo assembly was performed using two approaches, once by analyzing reads from both platforms simultaneously in Unicycler, which also supports hybrid assembly, and once by using the information contained in high-quality Illumina data for improving the errorrich Nanopore raw reads before assembly. For this, fully processed Nanopore reads from L.
vulgare were corrected based on sequence information from corresponding processed short Illumina reads using Nanocorr v.0.01 [52]. The resulting reads were used for an improved de novo assembly with Canu. Settings were kept identical to the assembly with unimproved reads, with a few exceptions. Due to their now increased quality, higher read coverage is not expected to be as problematic as with unimproved reads; hence, coverage was only limited by setting minReadLength to 2000; readSamplingBias was turned off by setting it to 0.0 (meaning that readSamplingCoverage being set to 200 would have no effect). Additionally, for these settings stopOnReadQuality was set to false. Assembled contigs were processed and merged and the final assembled sequence generated as described in section 2.7.3.
However, no polishing was performed since polishing with the original, unimproved reads could have deteriorated the quality of the contigs.

Annotation of the chloroplast genomes
The two complete, finished sequences from Illumina de novo assemblies of L. vulgare and L.
virgatum were annotated using the online tool GeSeq [53]

Long-range PCR, sequencing, and genome assembly
The lengths of long-range PCR fragments obtained from Leucanthemum vulgare and L.
virgatum were similar to those found in Artemisia frigida, with the shortest and longest fragments in L. vulgare being 6,086 bp (fragment 4) and 12 Number of raw reads is given for each of both MiSeq 300-bp and 80-bp paired-end sequencing runs. Lengthfiltered reads include those which were too short after trimming. Read numbers for Nanopore contaminant and filtered reads as well as remaining reads include only processed, not fully processed reads (see text). L., Leucanthemum.
For the mapping assemblies, detailed statistics are available in for the Nanopore mapping (Artemisia). The respective reference used for mapping (L. virgatum Illumina de novo assembly or A. frigida as obtained from Genbank, both without the IRA) is given in parentheses. Mapped reads include ambiguously mapped reads. The percentages given for both refer to all input reads, the fraction of remaining reads after deduplication is relative to mapped reads. No deduplication was performed for Nanopore reads. Nanopore read lengths range from 1-3,000 bp due to necessary splitting of long reads into 3,000-bp chunks by the mapping software. L., Leucanthemum; stdev, standard deviation; n.a., not applicable.
De novo assembly of Illumina data from L. virgatum produced five contigs (see Table 3  Three de novo assemblies were produced for L. vulgare, using Illumina data, Nanopore data or Nanopore data improved by Illumina data (hybrid approach). Assembly was done with Unicycler for Illumina reads and Canu for Nanopore reads. Nanopore input reads are given after Canu correction and trimming steps. Values for 'fraction at coverage x or higher' refer to the final assembled sequence without the IRA and denote the percentage with a certain read coverage after mapping of input reads. L., Leucanthemum; stdev, standard deviation; no, number; n.a., not applicable.

Fig 3. Comparison of the inverted repeat (IR) boundaries in Leucanthemum vulgare and L. virgatum.
Genes above the green bars are transcribed in reverse direction, those below in forward direction. For both taxa, the IR extends into the ycf1 and rps19 genes, resulting in two pseudogenes (denoted by a ψ) at the single-copy / IR junctions. Lengths are given for whole genes and their duplicated fragments as well as the large single-copy (LSC), small single-copy (SSC) and IR regions (IRA and IRB). Arrows show basepair (bp) distance from the junctions for the ndhF and trnH genes. The SSC in L. virgatum is exemplarily reverse-complemented with respect to that in L. vulgare; both configurations exist in individual plants according to [73]. The figure is not to scale.
The arrangement and order of genes is identical in the two species: the plastomes contain 80 predicted protein-coding genes, four rRNA genes, and 30 tRNA genes coding for all 20 amino acids. Seven protein-coding genes, all rRNA genes and seven tRNA genes are duplicated due to the IR, raising the total number of genes to 132. Six of the 30 tRNA genes and 12 of the 80 protein-coding genes contain introns; of the latter, two harbor two introns while the rest has one single intron. Two pseudogenes are present in Leucanthemum: ψ-ycf1 and ψ-rps19 are located at the IR/SC boundaries (Fig 3), their lack of functionality being due to only partial duplication. The trans-spliced rps12 gene is found in the LSC (5'-end) and the IR regions (duplicated 3'-end). A summary of all genes is given in Table 4.   showed six regions of high variance (Fig 4), corresponding to three intergenic spacers and two genes (ndhF and ycf1), which vary in 8-28 positions and have a variability of 0.55-1.38% (

Regions chosen based on high-variability (> 8 variants) windows in a mapping of Leucanthemum virgatum
Illumina reads to the L. vulgare Illumina de novo assembly (Fig 4). For each region, the position on the L. vulgare genome and the variable fraction of the marker relative to its length are given. Hotspot regions pertain to peak regions in Fig 4; their position is given based on the first and last variant in the region.

Illumina and Nanopore data
The seven assembly methods tested on L. vulgare resulted in plastome sequences of lengths ranging between 125,633 bp (100%) in the Illumina de novo assembly and 124,851 bp in the Illumina mapping (Artemisia) assembly ( the other approaches, most erroneous substitutions were introduced by Nanopore de novo assembly (18.5%). Table 6. Comparison of six assemblies of the Leucanthemum vulgare plastome, obtained using different methods and data types, with a "gold-standard" Illumina de novo assembly.
Two data types (Illumina, Nanopore reads) and three analysis pipelines were used to assemble the L. vulgare plastome: de novo assembly, reference-based (mapping) assembly and hybrid de novo assembly using Nanopore reads corrected by Illumina data. Comparisons were made based on alignment of the respective assembly to L. vulgare de novo (Illumina). Percentages of identity and 'alignment positions with N' are referable to the length of the alignment; 'alignment positions with N' denotes the amount of Ns required for alignment to the Illumina de novo assembly. L., Leucanthemum; IRA, inverted repeat A; n.a., not applicable; d.n., de novo; map, mapping assembly; virg, L. virgatum Illumina de novo assembly used as reference for mapping; Art, Artemisia frigida as obtained from Genbank used for mapping; both references without the IRA.

Enrichment of chloroplast DNA using long-range PCR
For the present study, we relied on enrichment of cp DNA via long-range PCR. For this approach, several primer sets are available (e.g., [60], producing nine fragments). The protocol chosen here [28] resulted in 16 fragments with a maximum length of 12,499 bp.
Compared with other approaches, the advantage of long-range PCR is that in most cases, it will exclusively enrich for plastid DNA, including segments possibly transferred there from the nucleus or mitochondrion, while omitting nuclear inserts of plastid DNA and requiring substantially less leaf material. Of course, the inclusion of paralogous cp genes from the nucleus or mitochondrion cannot be fully ruled out [2]. Another limitation is that highly degraded DNA cannot be used for long-range PCR, which limits its use with herbarium material [61,6]. The method is also unsuitable if larger rearrangements in gene order are to be expected, as have been found for example in Passiflora L. [62]. In any case, optimization of primers for the study group will almost always be necessary to amplify all fragments (in the present study, different primers had to be used for five out of 16 fragments, and several primers were slightly modified). It is also essential to use a high-quality-, proofreading polymerase to avoid misinterpretation of erroneously incorporated bases as true variants, and to remove unwanted primer sequence during read processing.
The main disadvantage of PCR-based enrichment however seems to be the resulting uneven read coverage, which is partly due to doubled sequencing at PCR-fragment overlaps.
However, apart from that, although fragments were pooled at equimolar ratios before sequencing in the present study, fragment coverage across the cp genome turned out to be very variable for Illumina as well as Nanopore reads (Fig 5). Possible reasons for such a behavior include adapter ligation biases and differences among targets regarding denaturation and annealing during library preparation [2]. The same effect was mentioned by [63], who by contrast found much lower variation in coverage when using DNA from isolated chloroplasts. Altogether, long-range PCR thus cannot be regarded the optimal approach for plastome sequencing, also because exceptionally high coverage in only a subset of regions might cause problems during Nanopore de novo assembly (see section 4.6.). A suitable alternative might be "genome skimming" [64,6] or solution-based hybridization ("target enrichment") via custom-designed oligonucleotide probes representing the cp genome [65,66]. However, both approaches require a closely related reference for chloroplast read extraction or bait design. Chloroplast DNA can also be obtained by isolating whole chloroplasts before sequencing. However, a high amount of fresh leaves is required for isolation [6], which may exceed 5 g or even 20 g [67].

Assembly of chloroplast genomes using NGS
The assembly of cp genomes using Illumina short reads is still regarded as the "gold standard" regarding data quality and integrity. The most common outcome of de novo plastid assembly today are three long contigs corresponding to the LSC, SSC, and IR regions [6], as presented for example in [21] and the present study. The G+C content of the two Illumina de novo assemblies presented here is typical for cp genomes [68] and suggests that no large portions of nuclear DNA have been erroneously incorporated. Both plastomes are highly similar (0.83% sequence divergence); as either species belongs to one of two main clades found in Leucanthemum [69,70], this might represent the average cp genome divergence within the genus. The cp genome of Artemisia frigida, belonging to a different subtribe, still shows only 3.69% divergence from L. vulgare, which points toward close relationships within Anthemideae as a whole. The SSC region of L. virgatum was assembled to be reversecomplemented with regard to that in L. vulgare. While such differences have been highlighted as "unique sequence rearrangement event" by [71] and also in other studies (see [72]), it has already been shown by [73] that molecules of both SSC orientations exist equimolarily within the same individual. It is also important to realize that with the use of PCR fragments that do not span the entire IR, it is per se impossible to distinguish between the two orientations.
The comparison of the two new plastomes with regard to sequence variability yielded five potential markers. The intergenic spacers trnE-rpoB, ndhF-rpl32 and rpl32-trnL were already described by other studies as being phylogenetically informative within angiosperms or specifically within Asteraceae [74,75]. The ycf1 gene has been recognized as an interesting region for barcoding, since it contains a high percentage of parsimony-informative characters, which are mostly due to the frequent occurrence of SSRs in the region [76,77].
Another interesting aspect is the high variability at the end of the ndhF gene, in the region close to the IR/SSC boundary. However, sequencing of the entire ycf1 or ndhF gene will not be too effective in terms of variability per sequenced length, so if their use as a molecular marker is intended, it would be advisable to design suitable primers around their hotspot regions. The generally lower substitution rate of the IR region, which has been demonstrated in several plant groups (see [78]), is attributed to its duplicative nature in combination with the effects of biased gene conversion [79,80].

Hybrid assembly approaches
Three basic procedures can be used for hybrid genome assembly: either, an assembly is first generated based on Nanopore data alone and is subsequently polished with Illumina data (e.g., in [81]), often by using the tool Pilon. Alternatively, the recently published pipeline Unicycler follows a short-read-first approach, by assembling short reads to an assembly graph which is subsequently resolved using long reads. Third, long reads can be corrected by aligning short reads prior to assembly, for the benefit of providing high-quality input to the latter; this can be done by using the tool Nanocorr.
In the comparison of hybrid and non-hybrid approaches presented by , the authors found that the hybrid approach performed best, yielding a single high-quality (final error rate: 0.0007 per base of a set of validation reads) contig representing the whole plastome, as long as 20-fold coverage of long and short reads was provided [21]. Hybrid assembly was also superior to Nanopore-only assembly in [82], where the authors sequenced a mitogenome from Chrysanthemum L. The same effect was observed in the present study. Hybrid assembly using Unicycler (results not shown) yielded results identical to de novo assembly of short reads alone (three contigs), but was not able to assemble a continuous contig covering the two IRs. This came as no surprise as the longest Nanopore read was too short with only 12,532 bp long, matching the length of the longest PCR fragment sequenced. The use of long-range PCR thus limits the possibilities of Unicycler and Nanopore reads in general. [21] also note that the performance of the hybrid assembly was "highly dependent on the length of the long-reads": assembly of a single contig was already possible with a coverage of 5-fold of Nanopore reads of lengths > 30 kb (together with at least 8-fold short-read coverage), but failed when Nanopore read lengths were < 20 kb long (i.e., shorter than the IR region). Another hybrid approach pursued in the present study (correction of long reads using Nanocorr before assembly) resulted in five contigs of which two had to be split, and the final sequence was not completely identical to the Illumina de novo (and thus, the Unicycler hybrid) assembly (99.98%). This means that assembly based on short reads but supplemented with information from long reads produced better results than proceeding the other way around.

De novo assembly based on Nanopore data alone
Only very few studies have applied Nanopore sequencing exclusively for plastome assembly.
Bethune et al. (2019) used Nanopore combined with target enrichment to sequence seven taxa from Poaceae and Palmaceae, of which one was subsequently used as reference for the others [24]. From the six assemblies which were conducted with the Flye assembler [83], none recovered the plastomes in a single contig, but two or more contigs of varying length.
The authors [24] specifically noted the influence of input DNA quality on assemblies. DNA isolated from fresh tissue resulted in longer median read lengths and only two contigs with higher plastome coverage (with respect to the reference) than silica gel-dried samples. As mentioned above, this again underscores the importance of read length for Nanopore assembly. , in their comparative study mentioned above, found that Canu was able to assemble the Eucalyptus pauciflora Sieber ex Spreng. plastome into one or two contigs depending on read coverage, which however lacked some genome regions or covered others twice [21]. No assembly covered the complete cp genome; the final error rate after polishing with Nanopolish decreased to 0.0022 (compared to 0.0007 in the hybrid assemblies). Our Nanopore-only assembly yielded slightly better results: the Canu assembly produced six contigs, which however covered all of the cp genome, except the first 152 bp.
Although overlap among the contigs was up to ~830 bp, no regions were covered twice. As a side note it is important to realize that despite the often advocated advantage of de novo assembly, namely no need for a reference genome, is not completely met for cases in which more than one contig is assembled. In our analyses, the six contigs all overlapped, but with very variable overlap lengths, with one overlap being as little as seven bp, five of which were a C-homopolymer. In such cases, reliable merging of contigs is often realized by mapping to a reference sequence (e.g., [84]; see also [85]), as was done here.
Although the Nanopore-only de novo assembly presented in our study could not fully keep up with that from the hybrid approach using Nanocorr (sequence identity to Illumina de novo 99.98%, see Table 6), it reached a comparatively high sequence identity of 99.59%. This is comparable to the final 99.5% nucleotide identity of the assembly by Loman et al. (2015) [11], whose methods were also applied in the present study (read correction with a POA graphs -dependent consensus algorithm as implemented in Canu, polishing of the final assembly with Nanopolish). Regarding the remaining errors in our assembly, almost 70% of all mismatches were gaps; the accumulation in deletion errors in homopolymer runs is clearly visible in visualizations of mappings prepared for the mapping approach and increases with homopolymer length. This fits the expectation for non-random Nanopore sequencing errors mentioned in [7].

Reference-based assembly using Nanopore data
Apart from de novo assembly, plastome sequences can also be obtained by simply aligning reads to a reference genome and creating a consensus sequence from the mapping. This approach is only suitable for studies of closely related taxa with uniform chloroplast genomes [2,6]. The reason for that is that a mapping consensus will always adopt characteristics of the reference from which it was inferred. Structural rearrangements or additional transferred genes might easily be missed, especially when mapping short reads; the structure of repeat regions cannot be distinguished by mapping based approaches at all [86]. For the assembly of the globe artichoke (Cynara L.) cp genome with Illumina reads, [76] compared mapping and de novo approaches and found the results to be almost identical. [87] analyzed intraspecific diversity in Brachypodium P.Beauv. with Illumina data. According to their results, the reference-guided approach yielded fewer and longer contigs than de novo assemblies in most cases. In the present study, mapping assemblies based on Illumina data were found to be of somewhat lower quality, being 99.27% and 99.74% identical to the de novo assembly, respectively, depending on the mapping reference used.
However, to our knowledge, no study has yet employed Nanopore sequencing reads for reference-based assembly of plant chloroplast genomes. The mapping assemblies of L.
vulgare presented here were produced using read lengths of up to 3,000 bp and two closelyrelated references, which do not show any structural reorganization compared to L. vulgare and also have the same gene order (nucleotide identities to L. vulgare Illumina de novo: L. virgatum 99.17%, A. frigida 96.31%). Illumina mappings were superior to Nanopore mappings for both references, but surprisingly, nucleotide divergence (from the Illumina de novo assembly) was only slightly worse in Nanopore mappings, with the difference between the two being as low as 0.23% when using the L. virgatum reference. However, this difference rose to 0.98% when using Artemisia. This loss in quality caused by using the more distantly related Artemisia instead of L. virgatum was found for both datatypes, but seemed to be worse with Nanopore: Illumina assemblies suffered a quality decline of 0.47% nucleotide identity when switching to the more distant reference, Nanopore assemblies declined by 1.22%. This emphasizes the great importance of choosing a suitable reference in mapping-based assemblies of Nanopore data. Using the L. virgatum reference produced a consensus of almost comparable quality to the de novo Nanopore assembly, and superior to the Illumina assembly based on the more distant reference (see Table 6).
A second essential point for obtaining high-quality reference-based assemblies with Nanopore as well as Illumina data is the choice of software used for mapping and variant calling. Preliminary tests with popular short-read mappers (Bowtie2 [88], BWA-MEM [89], and BBMap) showed that despite all three tools claim to support gapped alignment, only BBMap (which also supports long reads) was able to correctly map reads spanning large gaps (for mapping against the Artemisia reference, the largest deletion occurring in L. vulgare had a length of 459 bp). Bowtie2 and BWA-MEM either soft-clipped the respective reads or did not map them at all, leaving the site of deletion with a coverage gap. However, correct mapping across deletions is of utter importance as most variant callers will call the reference sequence at positions where read information is lacking, which would distort the resulting consensus sequence. Tested consensus-calling pipelines also included BCFtools mpileup plus call for Illumina data, which however was not able to correctly call indels as variants. For Nanopore data, variant calling was only successful using callvariants2.sh after mapPacBio.sh mapping.
An advantage of reference-based assembly is that the assembly process can be easily monitored by comparing the visualized mapping (e.g., in IGV) to the final called consensus.
Specific patterns produced by the used datatype or algorithm can thus be taken into account and potential errors avoided. For example, when mapping Nanopore reads, especially insertions will often not be mapped at a precise location, but slightly shifted up-or downstream in several reads, thus blurring the signal for callvariants2.sh, which will ultimately call the reference in these cases. This problem is aggravated with more distant references as exact alignment gets more difficult. Regarding the variant-calling process, mismatches and small (1-bp or 2-bp) indels were usually captured correctly in both Illumina and Nanopore mappings. By contrast, larger insertions in L. vulgare with respect to the reference were more likely to be ignored, leading to the large number of deletions observed in the consensus sequences (56.2-89.4% of all mismatches, see Table 6). This effect seems to be owed to the used algorithm, intensifies with insertion length as well as more distant references, and is generally more pronounced with Nanopore data due to blurred mapping as described above. Deletions in the mapped taxon with respect to the reference were however more often called correctly from Nanopore reads, and almost always from Illumina reads. Callvariants2.sh also deals well with declining read coverage in homopolymer runs and mostly makes the correct full-length call.

Optimizing Nanopore de novo and reference-based assembly
A read coverage of at least 30-fold is generally recommended for Illumina de novo as well as reference-based plastome assembly [64,6]. In our Illumina de novo assembly of L. vulgare, only 0.26% of bases received coverage below that threshold (Table 3). Accordingly, the Unicycler pipeline yielded the lowest possible number of contigs when using short reads and no read extension methodology [6]. The L. virgatum assembly, with 6.62% of bases having fewer than 30-fold coverage, resulted in five contigs. Although a complete plastome sequence was easily obtained, this somehow suboptimal result might be due to a lowered kmer frequency as a consequence of low coverage in certain regions. Regarding Nanopore de novo assembly, a coverage of 30-fold will likely not yield optimal results (but see [11]).
However, a minimum of 50-fold coverage in over 99.5% of the bases seemed to be sufficient for reliable assembly (Table 3). If coverage is too low, problems might arise with reads derived from PCR fragments with only small initial overlap; after adapter-, primer-, and quality-trimming, too many reads might get too short, precluding a continuous assembly. Fragment 9 and 10 in our analyses had an overlap of only 134 bp between the two primer sequences; this resulted in a very low coverage of down to 5-fold in the Nanopore-only assembly, and even a gap in a preliminary run with different settings (not shown). In Nanopore mapping assemblies, very low coverage might result in variants not being called, which will compromise the resulting consensus.
However, a much more important problem observed with Nanopore data is the detrimental effect of too high coverage.  recognized an effect of too low or high coverage on assembly quality, stating that "there is a complex relationship between assembly accuracy and input read coverage when assembling chloroplast genomes with ONT data using Canu" [21]. [6] mention that de novo assembly performance could be improved by downsampling the input reads, and [86]  coverage-reduced dataset should be used for de novo assembly, for example via the readSamplingBias parameter in Canu. This has the side-effect of preferentially using the longest reads for analysis, which might also help to improve the results. With long-range PCR based reads, however, care must be taken to ensure that reads originating from short fragments are not accidentally filtered completely.
Filtering plastome reads from possible nuclear or contaminant reads, for example by mapping to a known reference, is a common procedure prior to assembly. However, applying too stringent filters might result in the loss of true plastid reads in the case of atypical plastomes or cp genomes that have incorporated mitochondrial or nuclear DNA [6]. For this reason, read filtering other than removing the PhiX spike-in was omitted in Illumina de novo analyses shown here. Unicycler proved relatively robust to this kind of perturbation: apparently, present contamination reads in the L. vulgare sample were assembled into an additional contig (showing BLAST matches in Proteobacteria) that could simply be excluded from further processing. In contrast to Unicycler, Canu was much more susceptible to contaminant-related errors. Even though input reads were subjected to BLAST filtering prior to de novo assembly, the latter still yielded a chimeric contig with one large part matching the Artemisia chloroplast and a small part blasting within Proteobacteria. Although the removal of this contig did not deteriorate the quality of the final assembly, the risk of obtaining erroneous contigs might be higher with Canu.
Another severe problem with the Nanopore reads used here for de novo assembly was the occurrence of chimeric reads. Two types of chimeras were observed. The first comprised reads with a first part being identical to a second part, but in reverse-complement orientation (so-called palindromes). It remains unclear what the source of this kind of error was. [47] describe a similar phenomenon in long reads that can occur during whole-genome amplification, so palindrome formation in cycles of long-range PCR at least seems conceivable. In contrast to the short, fragmented Illumina reads, the long Nanopore reads will introduce these errors into assembly. To avoid this, before starting de novo assembly we subjected reads to Pacasus, which splits reads containing palindromic sequences. Splitting proved essential, as initial assemblies with uncorrected reads produced a large number of (sometimes palindromic) contigs, rendering impossible the generation of a proper consensus sequence.
According to [90], who analyzed several types of chimeric reads originating from MinION sequencing, these can also consist of dissimilar fragments which are concatenated by the same sequencing adapter accidentally ligated in between them during library preparation.
This type of chimera might also result from a very rare, Nanopore-specific process which has been called "in silico chimerism" by [90]. Here, too fast pore reloading results in the recognition of two fragments as one by the base-calling algorithm. A corresponding filtering step is incorporated in the Nanopore bioinformatic pipeline (Porechop). In the analyses presented here, 22.6% of the total flow cell reads were discarded due to putative middle adapters. Although the filtering was set to be quite stringent, it seems possible that chimeric reads passed the filter, and acted as bridges to connect sequence regions in Canu contigs which are actually distant from each other on the plastome. Indeed, from the ten total contigs produced by Canu for Nanopore and hybrid assemblies, three had to be split due to such effects. It must be stressed here that without a reference plastome, detection of chimeric contigs would not have been possible, and this technique of course harbors the risk of missing true rearrangements in the plastome to be assembled.
As a concluding remark, the frequency of chimeras in our dataset seems to be unusually high, considering the fact that all types of chimeras found by [90] together comprised only 1.7% of the reads and [47] described palindromic reads as occurring during whole-genome amplification, not long-range PCR. Other, unknown processes may thus be responsible for the phenomena observed. At least, both contaminant as well as chimeric reads are negligible problems for reference-based assemblies, as the initial mapping step will result in dismissal of the former as unmapped, while the latter will likely result in supplementary mappings, which are also excluded from the output if BBMap is used with the settings described here.

Conclusions -Lessons learned
Based on the results obtained in this study, we provide some rough guidelines on the sequencing of plant cp genomes. Regarding data production, PCR-mediated genome reduction has the advantage of being easy to realize in an ordinary lab, but will depend on the performance of universal primer sets in the studied group. Several shortcomings must also be taken into account (see section 4.1.). For enrichment of plastid DNA with subsequent Nanopore sequencing, the use of PCR must be regarded as suboptimal as it nullifies some of the main advantages of the latter. While Nanopore sequencing can now produce reads with lengths of more than two Mb [91], PCR-derived reads will currently have a maximum length of 23 kb [60]. This also precludes the disambiguation of the two IRs, which may otherwise be possible. If the use of long-range PCR plus Nanopore sequencing is intended, it is advisable to • use a high-quality, proofreading polymerase for minimal entry of PCR artifacts into reads; • choose fragment sizes as large as possible; this also requires the use of appropriate DNA extraction techniques; • avoid too short overlaps among PCR fragments (preferably, > 300 bp) as in combination with low Nanopore read coverage, these might lead to fragmented assemblies; • downsample regions of uneven coverage, for example fragment overlaps with doubled coverage, using VariantBam [92] or related tools; and • account for palindromic sequences which might occur whenever PCR is used; these must be properly handled prior to assembly with the Pacasus tool or other appropriate software.
Hybrid assembly approaches with both Nanopore and Illumina data still yield better results than only using Nanopore reads. However, depending on the research question, Nanopore sequencing is also capable of delivering high-quality results at a comparatively lower cost.
Our results show that the vast majority of the 221 deletion errors within the Nanopore de novo assembly lies within homopolymer regions of five or more bases; the same is true for a number of insertion errors. Exclusion of such regions will greatly reduce the error rate of the respective assembly. If phylogenetic analysis of several cp genomes in a plant group is intended, the genomes can be easily sequenced and analyzed after masking the respective homopolymer regions, which should be excluded from this type of analyses anyway. Of course, the degree of relatedness in the respective plant group must be considered. While in the Nanopore de novo assembly, the error rate of 0.41% sequence identity (including all errors) is much lower than the average divergence between A. frigida and L. vulgare To ensure that high-quality assemblies can be obtained from analyzing Nanopore data alone, several important points should be taken into account: • reads with middle adapters (i.e., chimeras consisting of two different reads) must be rigorously removed or at least bioinformatically split, to avoid the introduction of wrong proximity information into assembly • for de novo assembly using Canu, the elimination of contaminant reads (nuclear DNA, other organisms) prior to assembly is vital. This can be done by blasting against a curated database of known plant genomes • read coverage should be kept at reasonable heights for de novo assembly, especially when using uncorrected reads (as a gross starting point, not higher than 300-fold for the final unitigging step), for example by subsampling reads before assembly (e.g., with SAMtools or BBTools) or by applying the Canu parameters readSamplingBias / readSamplingCoverage • preferentially including the longest reads might improve assembly quality; use the above Canu parameters or the minReadLength parameter in Canu In some situations, for example with limited computational resources or large amounts of samples to assemble, reference-based assembly might be favored over de novo approaches. However, it should not be conducted in groups with known plastome variability or structural reorganization, but rather for population genomics or in genera of very closely related species. If these requirements are met, mapping assemblies can yield a consensus comparable to that of de novo assembly, as shown in the present study. For this, it is essential to • use suitable software, for example BBMap plus the callvariants2.sh script. However, discrimination of the two IR copies will not be possible with BBMap as it cannot handle very long reads; • correctly set software parameters. This may require some optimization; for a new plant group, it might be advisable to use parallel Illumina sequencing in the first plastome. for example, variant calling based on a majority vote for alleles might not work for the errorprone Nanopore reads. The minimum allele fraction required for a call could therefore be lowered (e.g., to 0.45); and • choose the closest reference available (from the same genus, ideally the same species).
According to our results, the quality of Nanopore reference-based assemblies decreases disproportionally fast with more divergent reference genomes, even if these still have sequence identities above 95% to the studied taxon. The reference should include the complete IR A or only parts of it, depending on read length, to allow unambiguous mapping of all reads.

Acknowledgements
The authors wish to thank Gunter Meister and Norbert Eichner at the Department of Biochemistry I (RNA-Biology) at the University of Regensburg, for kindly providing access to their Illumina sequencing facility and Tapestation device, and for helpful advice regarding library preparation and sequencing. We are also grateful to Dina Grohmann and Felix Grünberger from the Department of Microbiology (Archaea center) for providing equipment and laboratory material needed for Nanopore sequencing, and for kind help regarding Nanopore data processing and analysis. Michael Rehli and Alexander Fischer (Laboratory for Mononuclear Phagocyte Biology at the University Medical Center Regensburg) are acknowledged for spike-in of two library samples into one of their MiSeq sequencing runs and helpful suggestions regarding size selection. Special thanks are due to Ulrich Lautenschlager and Tankred Ott for their patient and constant help regarding all aspects of bioinformatic analyses, and for providing scripts and useful advice. We thank Anja Heuschneider for technical assistance and invaluable help in the lab. Susann Wicke is acknowledged for her kind advice regarding chloroplast genome annotation.