Mitogenome sequence accuracy using different elucidation methods

Mitogenome sequences are highly desired because they are used in several biological disciplines. Their elucidation has been facilitated through the development of massive parallel sequencing, accelerating their deposition in public databases. However, sequencing, assembly and annotation methods might induce variability in their quality, raising concerns about the accuracy of the sequences that have been deposited in public databases. In this work we show that different sequencing methods (number of species pooled in a library, insert size and platform) and assembly and annotation methods generated variable completeness and similarity of the resulting mitogenome sequences, using three species of predaceous ladybird beetles as models. The identity of the sequences varied considerably depending on the method used and ranged from 38.19 to 90.1% for Cycloneda sanguinea, 72.85 to 91.06% for Harmonia axyridis and 41.15 to 93.60% for Hippodamia convergens. Dissimilarities were frequently found in the non-coding A+T rich region, but were also common in coding regions, and were not associated with low coverage. Mitogenome completeness and sequence identity were affected by the sequencing and assembly/annotation methods, and high within-species variation was also found for other mitogenome depositions in GenBank. This indicates a need for methods to confirm sequence accuracy, and guidelines for verifying mitogenomes should be discussed and developed by the scientific community.


Introduction
Mitochondrial DNA (mtDNA, mitochondrial genome or mitogenome) is used as a fundamental genetic marker in many research areas, including population genetics, evolutionary biology, phylogenetics and phylogeography, biodiversity studies, and molecular ecology [1][2][3]. Insect mitochondrial genomes are mostly about 15-22 kbp, circular, double-stranded DNAs that encode a set of 37 genes and a large control region (designated as the A+T-rich region in insects) [3,4]. The genes comprise 13 protein-coding genes (PCGs), 22 transfer RNAs (tRNAs) and two ribosomal RNAs (rRNAs). The A+T-rich region is responsible for regulating a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Despite acknowledgment that assembly and annotation software/methods vary considerably in their quality [25], there has been limited information about the accuracy of the deposited sequences and their annotations, introducing uncertainty about the mitogenome sequences. Public DNA databases are repositories (libraries) and the depositor is expected to provide sequences that meet basic quality criteria regarding the PCG annotations (e.g., presence of start and stop codons, no internal stop codon, etc). The depositor is assumed to have accurate sequences and is not required to provide methodological detail or a proof of quality.
Pooling multiple species in a single library would reduce the cost of library construction compared to single species libraries [20,22], but may increase the probability of interspecific chimeras, resulting in less accurate sequences. De novo assembly methods may generate more fragmented assemblies, but use of reference genomes may bias the sequence to the reference or propagate errors in the reference. In this work, we examined the effect of using different sequencing methods (number of species in a library, insert size, sequencing platform) and different assembly and annotation methods in the elucidation of mitogenomes of three species of ladybird beetles.

Sample preparation and DNA shotgun-sequencing
Adult Cycloneda sanguinea, Harmonia axyridis and Hippodamia convergens (Coleoptera: Coccinellidae) were collected from agricultural fields in the Federal District, Brazil. All collections were authorized by SISBIO (authorization number 36950), and access to the genetic heritage and transportation of biological material were authorized by IBAMA (authorization number 02001.008598/2012-42). These species are widespread and important natural enemies of agricultural pests in Brazil and other countries. Three specimens of each species (about 25 mg of body weight) were used to extract the total DNA separately from each species by DNAeasy Blood and Tissue Kit (Qiagen, Germany), following the manufacturer's instructions. The purified DNA was quantified with Qubit HS (Invitrogen, USA). Two sequencing methods were used. In the first, the three species were pooled without individual tagging after the DNA in the samples was normalized to an equal concentration of 1 μg to make one TruSeq Nano library (Illumina, USA). The library was sequenced on the MiSeq (2×250 bp, insert size 850 bp, 500 cycle kit) using 15% of the flowcell (Illumina, USA). In the second, single species libraries were used with 1μg of DNA per sample to construct individual TruSeq Nano libraries and sequenced with HiSeq 2500 Rapid Mode (2×250 bp, insert size 550 bp, 250 cycle kit) using 2% of the flowcell (Illumina, USA). We switched to the HiSeq platform due to its much higher sequencing depth and cost-effectiveness.
To assign the assembled mitogenomes to particular species in the pooled samples, we used barcodes cox1 and cytb as 'baits'. Fragments of the 5' portion ('barcode region') of the mitochondrial genes cytochrome oxidase 1 (cox1) and cytochrome oxidase b (cytb) were determined independently for each species by sequencing the PCR amplicons generated respectively by the mitochondrial universal primer pairs Jerry & Stev-Pat_R and Sytb-F & Sytb_R, according to Timmermans et al. [30]. The amplicons were checked on 1% agarose gel, purified using QIAquick PCR Purification kit (Qiagen), and sequenced by Sanger methodology (ABI3700 sequencer). The barcode baits were assigned to the assembled mitogenomes using the Custom BLAST tool (E-value = 0) in Geneious v7.0.5 [29], selecting hits with a minimum length of 95% of the query coverage with at least 98% identity.
In the second method, the quality of the raw dataset was evaluated by FastQC v0.11.3 (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and the adaptors were trimmed by Fastqc-mcf v1.04.807 [31] and Cutadapt v.1.9.1 [32] before assembly. The mitogenomes were assembled by MITObim v1.8 [15] using as reference the mitogenome of the coccinellid Coccinella septempunctata (GenBank: JQ321839.1) and the 5' portion of the genes cox1 and cytb of each target species as baits. The same reference mitogenome was used for the three target coccinellid species because it was the most closely related species to all three targets that was available at GenBank at the time of the analysis. The reference and all three targets are in the subfamily Coccinellinae and tribe Coccinellini, and no other species in the tribe was available at GenBank to use as a reference, except for H. axyridis. However, we considered that using H. axyridis as a reference would unduly bias the assembly of the mitogenomes.
The sequencing coverage [33] for each of the four elucidated mitogenomes for each species was obtained by mapping the reads (Q20<90%) in a source library to the respective mitogenome with Geneious v7.0.5 [29] using the parameters: no gaps allowed, minimum overlap 30, maximum mismatches per read 0, minimum overlap identity 99%, maximum ambiguity 4. For methods 2 and 4, in which the assembly method was MITObim, we mapped the reads retrieved in the final iteration.

Annotations
Two annotation methods were used. The first was used with the de novo assembly method and consisted of annotating the obtained mitogenome scaffold for the expected tRNA sequences using COVE v2.4.4 [24] with covariance models of Coleoptera [30] according to Crampton et al. [20]. The annotated tRNAs were imported to Geneious v7.0.5 [29] and the target mitogenome was aligned with the annotated mitogenome of the coccinellid Co. septempunctata (GenBank: JQ321839.1), which was used again as a reference, using standard parameters in MAFFT v7.017 [34]. After this, manual curation was performed to check: a) the expected order of the PCGs and tRNAs; b) the match of the predicted open reading frames (invertebrate mitochondrial genetic code) with the pattern of the reference mitogenome (cox1 does not start with an ATG as methionine and some nad genes start with ATT); c) the presence of stop codons in the middle of the coding regions of the target sequence; d) any deletions or insertions that would change the translation frame; e) the stop codons at the end of the coding sequence (the mitochondrial stop codons are TAA, TAG, TA or just T); and f) the orientation of the genes (e.g., some nad genes are reverse-oriented). Then, the annotation was transferred to the target mitogenome (selection "Annotation">"Copy to all selected region"), and the annotated target mitogenome was extracted from the alignment. Mitogenome completeness was checked when the length was longer than 16 kb and when both nad2 and 12S rRNA were present by searching for overlap between the starting and ending regions by aligning the first and last few 100 bp using MAFFT v7.017 [34].
For the second method, which was used with the MITObim assembly method, mitogenome annotations were carried out by MITOS (v763), available for online use at http://mitos.bioinf. uni-leipzig.de/ [18]. The sequences (Fasta format) of the mitogenomes were uploaded individually for each species to the MITOS server, using the default parameters and invertebrate genetic code. The GFF files generated by MITOS were imported to Geneious v7.0.5 and combined with the original Fasta sequences.
The annotated mitogenomes generated using the combination of all these methods ( Table 1)

Mitogenome comparisons
The mitogenomes from the four different methods (two sequencing methods each with two assembly/annotation methods) were compared regarding number of genes obtained and sequence identity using MAFFT alignment v7.017 [34] with the algorithm auto, scoring matrix BLOSUM62, gap opening penalty 1.53, and offset value of 0.123 within Geneious v7.0.5. Annotations were compared regarding gene location/position, length, and order, number of genes annotated and gene transcriptional direction.

Results
The number of raw reads after quality control for the pooled library was 1,663,759 and for the single libraries: C. sanguinea 4,587,115; H. axyridis 3,233,855; Hi. convergens 2,827,085. The lower number of raw reads in the pooled library might have been related to pooling of the species, the longer insert size (850 versus 550 bp) and the use of the MiSeq platform. There was variation in the number of reads mapped to the assembled mitogenomes across pipelines (Table 2), which was directly related to the coverage (Fig 2), but not related to the number of raw reads. Different methods conferred different average coverage. The mitogenomes of C. sanguinea and Hi. convergens were elucidated for the first time.

Mitogenome completeness
The mitogenomes of the three ladybird beetle species differed across pipelines within species regarding completeness (number of genes) ( Table 2). Overall, the coverage was good (>10 [35]) except for parts of C. sanguinea method 2 and a few nucleotides in the other mitogenomes (Fig 2). The mitogenomes assembled de novo with customized annotation generally were incomplete, lacking some PCGs and/or tRNAs. Overall for all the species, assembly using MITObim (methods 2 and 4) resulted in the complete number of mito-genes, whether species were pooled or not in the libraries. The only exception was for the C. sanguinea mitogenome obtained by method 2, which for unknown reasons had no or low coverage in several regions (Fig 2). However, even for methods 2 and 4, it is not clear if the length is complete, as none of the sequences could be circularized.

Mitogenome sequence identities across pipelines
In addition to differences in completeness, there were differences in the sequence identities across pipelines for a same species. Table 3    3 would reveal the combined interaction effects of the two sets of factors assembly and annotation method with sequencing method. The low identities (Table 3) found among the mitogenome sequences are related in part to gaps in one of the assemblies, but there are substantial differences in the sequences even in regions without gaps (Fig 3). The effect of the assembly and annotation method differed depending on the species. The mitogenome sequences of a species from the pooled library were more similar (1 vs 2) than from the separate species libraries (3 vs 4), suggesting that the different assembly and annotation methods generated greater differences in the sequences when the species were in individual libraries than when pooled in the same library. However, mitogenomes from C. sanguinea showed the opposite; being more similar from the individual species library.
Similarly, the effect of sequencing method differed depending on the assembly and annotation method and the species. For two of the species, the mitogenome sequences of a species were more similar when assembled by MITObim (2 vs 4) that when done de novo (1 vs 3). This indicated that the effect of sequencing method was greater when using de novo assembly than MITObim. Again, mitogenomes from C. sanguinea showed the opposite; being more similar using de novo assembly than MITObim.
Finally, for all species, mitogenome sequences of a species were more similar comparing pooled de novo assembly with single-species MITObim assembly (1 vs 4) than methods 2 vs 3. Thus, the effects of the two sets of factors depended on each other.
The gray and black bars above the annotated and aligned sequences in Fig 3 show the points of dissimilarity among the four methods for each species mitogenomes. While some of the dissimilarity may have been due to low coverage (e.g., C. sanguinea method 2 for nad2 through the first half of cox1), dissimiliarity persisted even when coverage was high (e.g., C. sanguinea method 2, second half of cox1 to the first half of cox2 and the second half of cox3 to the first half of nad5). Overall, there was no clear relation between low coverage and dissimilarity among the sequences.
A more detailed examination of the differences generated by the four pipelines is shown in Fig  4 for C. sanguinea. For example, at position 2,347 of the consensus sequence of the nad2 gene, methods 1 and 3 resulted in an adenine (A), but methods 2 and 4 resulted in a guanine (G), indicating that different assembly methods gave different sequences. At position 5,288, methods 1 and 2 had an A and methods 3 and 4 had a G, indicating that different sequencing methods gave different sequences. Finally at positions 2,379 and 4,795, methods 1 and 4 resulted in the same base (thymine (T) or G) while methods 2 and 3 resulted in a different base (A), indicating that both sequencing and assembly methods matter. These differences were not associated with insufficient (considered to be <10-fold redundancy [35]) or differences in coverage (Table 4).

Mitogenome annotations
As expected, the annotations of a species varied across different combinations of libraries and assembly/annotation methods, even when the same annotation method was used  (Figs 2 and 3). Methods 1 and 4 provided the most similar annotations for C. sanguinea, going from the gene trnQ to rrnS, although with a length difference in the nad2 gene. These were the only methods providing a sequence with all of the expected mito-genes. Methods 2 and 3 provided shorter mitogenomes with fewer annotated genes. For H. axyridis, methods 2, and 4 provided sequences with all of the expected mito-genes. These were equivalent to that from method 1, except for the length of the nad2 gene, which in method 1 was split into two pieces, and in the presence and position of trnI. Method 3 provided the most incomplete annotation (missing initial tRNAs and cox1, and rrnS at the other end) and had different gene size annotations for cox2, rrnL, rrnS, although it was the only one that presented trnW before rrnL gene. The gene nad2 is also smaller and presented immediately after rrnL. For Hi. convergens, methods 2 and 4 provided sequences with all of the expected mito-genes, starting from trnI to rrnS. The genes nad2, cox1 and rrnL differed in size according to the pipeline used. Method 1 provided a shorter mitogenome sequence and an incomplete annotation, although it was the only one that presented trnW before rrnL gene. Overall, method 4 (use of MITObim to assemble and MITOS to annotate) seemed to be the one that provided the most complete annotation for all the species.

Discussion
Our main result is that for three related species, different sequencing and assembly/annotation methods produced dissimilar mitogenome sequences within each of the three species. In many cases, different library data sets resulted in different nucleotides at the same position when the assembler was the same; the same library data set resulted in different nucleotides when the assembler was different; and different library data sets and assemblers resulted in different nucleotides. In addition, the effects of this methodological variation differed among the species. This sensitive dependence on sequencing and assembly/annotation method raises concerns. First, if we had relied on only one method, we would not know to question the veracity of the resulting mitogenome sequence. Three of the four methods produced at least one complete mitogenome sequence for the three species, and the remaining method might also produce complete sequences if we had used it on more species. Second, good coverage (>10-fold redundancy) did not eliminate dissimilarity, but might provide illusory confidence in the accuracy of the resulting sequence. Third, the methodological effects differed among the three species, suggesting that finding a "best" method may be challenging, possibly because the mitogenomes of some species may be more difficult to elucidate than others. Although MITObim often provided the mitogenomes with the full mito-genes (Table 2), it did not always do so (e. g., C. sanguinea with the pooled library).
We expected that the individual species libraries on the HiSeq platform with smaller insert size would provide higher mitogenome completeness because of the greater sequencing depth and elimination of complications due to the relatedness of the species. However, the mitogenomes of all three species obtained by method 3 (single libraries and de novo assembly) were incomplete, despite the high coverage and sequencing depth. Pooling different species in a same library without tagging did not seem to affect significantly the quality of the resulting sequences. Other researchers have successfully pooled species [22,36,37]. According to Tang et al. [22] species as close as congeners can be pooled and mitogenomes successfully assembled with more than 50 species in a single lane of HiSeq 2000 (with 100 ng genomic DNA). Although the creation of chimeras is a concern, especially among closely related species, in our case, their effect was minimal as the annotations were equivalent across sequencing pipelines (Figs 2 and 3), with no differences in gene order or transcriptional direction, and only a few differences in gene length. We compared the identity of the mitogenome sequences of H. axyridis obtained in this work with the H. axyridis KR108208 mitogenome deposited in GenBank [38]. The KR108208 sequence is likely to be a high quality sequence, although it is missing trnI and trnQ, because it was elucidated using LR-PCR and primer walking coupled with Sanger sequencing, and BLAST search in GenBank and tRNAscan-SE to assemble and annotate the mitogenome (personal communication, SJ Wei). There was 91.07, 97.81, 74.05 and 89.62% identity, respectively with our four sequences (Fig 5). The dissimilarities mostly occurred in non-coding regions such as the A+T-rich region and in the terminus. Method 2 was most similar to this comparator and perhaps provided better results for H. axyridis mitogenome elucidation and annotation. However, method 4 differed mainly in the A+T-rich region that was missing from the others.
Our results led us to question the reliability of the mitogenome sequences deposited in Gen-Bank. In general, these sequences do not have information about the sequencing, assembly or annotation method, so it is not possible to evaluate independently the completeness or accuracy of the deposited sequences. We examined the 2,469 complete Insecta mitogenomes deposited at GenBank and found 60 species with mitogenomes provided by different research groups on species that did not have subspecies or biotypes. Of those, eight species had mitogenome sequence identity <90%. For example, Bemisia afer, KR819174 (15, [39] found mitogenomes of Nilaparvata lugens (14,371 bp) that were much smaller than that reported previously (15,190 bp) [40]. Unfortunately, in many cases, the methods used to elucidate and annotate the mitogenomes were not provided in the GenBank deposition or in the associated publication (when available). While it is possible that these dissimilarities might be due to misidentification of the species, several of the eight species are easy to identify, so this is an unlikely explanation for all of the large intraspecific variations in sequence similarity in GenBank. Studies on intraspecific variation in parts of the mitogenome typically reported identities >97% [8,39,41]. Hence, similarities below 90% are much lower than expected for intraspecific variation, and are more characteristic of interspecific or intergeneric variation.
We suggest that there is a need to adopt additional reporting requirements and a validation method for mitogenome deposition at public DNA databases. Based on our results, at minimum, information about the sequencing, assembly and annotation methods are relevant, and should be associated with a Sequence Read Archive (SRA) deposition. In addition, coverage should be reported, although this should not be relied on to validate the sequence. As biological inferences in several biological fields have been made and are being made relying on these deposited sequences, a check on accuracy is an issue that urgently needs to be addressed by the scientific community as more and more sequences are expected to be provided (Fig 1) through different bioinformatics pipelines [42].
More broadly, we suggest that guidelines need to be developed by the scientific community to verify the accuracy of mitogenome sequences prior to deposition, perhaps modeled on those for qPCR analysis (The MIQE guidelines) [43]. Until then, one alternative to verify accuracy before database deposition might be based on an approach similar to that used by Tang et al. [22] to assure sequence quality when 49 species were pooled in a library. This alternative could include the use of more than one assembly method (a hybrid combination of de novo and reference genome assemblers might be an interesting choice), concatenate their scaffolds in another assembler, and map the reads to identify regions with lower coverage. Sanger sequencing could be used to verify regions that are assembled only by one of the assemblers with lower coverage or that exhibit dissimilarity or gaps among assemblers.

Conclusions
Mitogenome completeness and sequence accuracy was affected by a number of methodological factors in the experimental design. The scientific community urgently needs to address the issue related to the lack of information regarding the quality of the mitogenomes sequences  [38] using standard parameters in MAFFT v7.017 [34]. The KR108208 has 16,387 bp, 13 PCGs, 20 tRNAs (missing trnI and trnQ) and 2 rRNAs. The mitogenomes sequenced here are displayed with highest similarity on top. From the top to the bottom are: the mitogenome obtained by method 2; KR108208; followed by methods 4, 1 and 3, respectively. The annotated mitogenomes are in color, with arrows representing the transcriptional sense of the gene, and small arrows representing tRNA genes. Grey bars represent similar nucleotides and vertical black bars represent dissimilar nucleotides in the target sequence compared to the consensus sequence, and horizontal black lines represent gaps in the target sequence.
https://doi.org/10.1371/journal.pone.0179971.g005 deposited considering that they have been generated by different methods for sequencing, assembly and annotation, among other factors. The deposition rate of mitogenomes has exploded due to the technical advancements in DNA sequencing methods and bioinformatics pipelines. Guidelines for verifying and validating mitogenome sequences should be discussed and established to assure mitogenome quality for future studies.