Characterization of Penaeus vannamei mitogenome focusing on genetic diversity

The diversity of the Penaeus vannamei mitochondrial genome has still been poorly characterized, there are no validated mitochondrial markers available for populational studies, and the heteroplasmy has not yet been investigated in this species. In this study, metagenomic reads extracted from the muscle of a single individual were used to assemble the mitochondrial genome (mtDNA). These data associated with mitochondrial genomes previously described allowed to evaluate the inter-individual variability and heteroplasmy. Comparison among 45 mtDNA control regions led to the detection of conserved and variable segments and the characterization of two hypervariable regions. The analysis of diversity revealed mostly low frequency polymorphisms, and heteroplasmy was found in practically all mitochondrial genes, with a high occurrence of indels. These results indicate that the design of mitochondrial markers for P. vannamei must be done with caution. The mapping of conserved and variable regions and the characterization of heteroplasmy presented here will contribute to increasing the efficiency of mitochondrial markers for population or individual studies.


Introduction
Mitochondrial genomes are short circular sequences composed of a small conserved gene repertoire and are poor in non-coding regions. These properties make the mitochondrial markers ideal for phylogenetic studies, allowing for the assessment of the population genetic structure [1]. The most common mitochondrial molecular markers are cytochrome oxidase and a noncoding region called the mitochondrial control region (CR). Inside the CR there is a highly variable sub-region called the "displacement loop" (d-loop) which is involved in mitochondrial genome replication [2,3].
Among penaeid shrimp, the characterization of CR has only been performed for Artemesia longinaris and Farfantepenaeus duorarum [4,5]. A study analyzing five F. duorarum populations revealed that CR presents more informative sites and haplotypes than the 16S rRNA or cytochrome oxidase subunit I gene [4]. In Penaeus vannamei, mitochondrial markers have a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 been used to assess the genetic diversity in few studies. These studies presented over 40 CR haplotypes across farmed shrimp from Mexico [6,7].
Currently, newer sequencing technologies, such as next-generation sequencing (NGS), are being applied to investigate the genetic variability in P. vannamei using shotgun sequencing approaches [8,9], transcriptomic studies to discover SNPs associated with growth [10], genotyping for genetic selection in brood stocks [11], acquisition of high-density linkage disequilibrium maps [12], and even a combination of short reads, long reads, and BAC-end sequencing to obtain its first genomic draft [13].
Despite these advances, to date, no studies have explored the heteroplasmy in P. vannamei. In this study we explored the genetic variability of the P. vannamei mitochondrial genome and its CR at single individual (intra-individual) and inter-individual levels, characterizing their respective genetic variability, and discuss the potential impacts of the application of mitochondrial markers to study shrimp diversity.

Data acquisition
The data used to investigate heteroplasmy were obtained from a previous study of shotgun sequencing of the caudal muscle of juvenile shrimp infected with the white spot syndrome virus-WSSV [14]. These data is available at https://www.ncbi.nlm.nih.gov/sra/PRJNA524694. To asses inter-individual diversity, sequences from four P. vannamei complete mitochondrial genomes (EF584003.1, NC_009626.1, KT596762.1, and DQ534543.1) were downloaded from GenBank. The sequence EF584003 was removed because it was curated by NCBI, receiving the new code NC_009626. In addition, 41 sequences of the P. vannamei mitochondrial control region (CR) were obtained from NCBI's PopSet database (n = 41; https://www.ncbi.nlm.nih. gov/popset/260667508) [7].

Mitochondrial genome assembly and annotation
For mitochondrial genome (mtDNA) assembly and variant calling, raw reads were trimmed using Fastpv0.19.6 [19], removing bases with a PHRED score below 20 at both ends (--cut_front 20 --cut_tail 20) and removing reads with size inferior to 50 bp (--length_required 50). Processed reads were then mapped to the reference P. vannamei mitochondrial genome (NC_009626.1) using the 'mem' algorithm from Burrows-Wheeler Aligner [20]. Aligned reads were extracted with Samtools 1.9 [21] and used in SPAdes v3.14 [22] for de novo assembly of the mitochondrial genome using the '--careful' parameter. Geneious r9 [23] was used to circularize the largest contig and then transfer annotations from the reference sequence using a 95% similarity threshold. The read coverage depth of the assembly was obtained directly from the SPAdes' assembled scaffold sequences. Detailed mapping statistics of the assembly is presented in S1 File.

Alignment of the mitochondrial genomes and control region sequences
The four complete mitochondrial genomes (mitogenomes) were rotated to a common starting point using the software MARS [24]. The rotated mitogenomes were then aligned with the other 41 control region sequences using MAFFT online v7.423 [25] (https://mafft.cbrc.jp/ alignment/software/), using the option 'adjust direction according to the first sequence'. The mtDNA assembled in this study or its control region were used as first sequence in both MAFFT alignments. The control regions from the complete mitogenomes were then extracted from the alignment using Geneious and re-aligned with the other 41 CRs via MAFFT. An alignment with only the four rotated mitogenomes was also performed using the same MAFFT parameters.

Characterization of conserved and variable segments in the P. vannamei mitochondrial control region
The alignment of the 45 CRs in MAFFT was used to detect conserved regions with DnaSP v6.12 [26] adopting the "Conserved DNA regions. . ." analysis. The following user-defined parameters were provided: "Minimum window length" set to 20 and "Conservation threshold" set to 100. The regions found were labeled as conserved segments (CS) and the remaining regions were considered variable segments (VS). We considered CSs with a p-value less than 0.05.

Metrics of genomic variability
To measure genomic variability, we used three parameters: (1) the number of variable sites per gene, distinguishing the genes with more polymorphic sites. (2) a variability index (VI) to express variability as a proportion of polymorphic sites per gene length (VI = n˚of polymorphic sites/gene length); since, by chance, one can expect that longer genes will have more polymorphic sites. (3) the distribution of the number of polymorphic sites over frequency intervals,to distinguish the frequency of each variant in each respective gene.
In the case of both inter-individual analyses, which use multiple sequence alignments, the table generated by Geneious' SNP/polymorphism detection includes reference alleles as variants, distorting the variant count in a few situations. To prevent this, the following criteria were adopted: (1) variants with frequencies above 50% were assumed as reference alleles and, therefore, were not considered as variants for the given site; (2) in cases of bi-allelic sites, where both variants have a 50% frequency, the site would be considered as having a single variant since one should be the reference allele; (3) in cases of tri-and tetra-allelic sites, the most frequent allele was also considered the reference and the others, variants. To prevent that multiple variants assume the same frequency interval, we used an weighted distribution of the variant alleles per site. (4) classification of polymorphic sites in transversion-containing, transition-containing, and/or indel-containing sites to assert the types of variability found in genes.

Measurement of inter-individual variation
Inter-individual variability analysis was determined based on the alignment of the four mitogenomes and 41 CRs. In both cases, the Geneious' "Find Variations/SNPs . . ." function was used using "minimum variant frequency" set to 0 and unchecking the "Maximum Variant P-value" and "Minimum strand-bias P-value" filters. P-values were obtained using "Calculate Variant P-values" option assuming a PHRED quality of 20 and using a 30% penalty because the reads were acquired from an Ion Torrent.

Measurement of heteroplasmy
Trimmed reads were mapped back to the assembled mitogenome (mtDNA-BR) using BWA's "mem" algorithm. Variants were first called using Samtools v1.9 "mpileup" command using the "-B" parameter and providing the assembled mtDNA as reference to the "-f" parameter. The output was then piped to VarScan 2.3.9 (KOBOLDT et al., 2012), where variants were called using the parameter "mpileup2cns" with the following parameters: --min-reads2 4 -min-freq-for-hom 0.5 --min-var-freq 0 --output-vcf. The resulting VCF file was imported and visualized in Geneious, obtaining the read coverage, and variants from the annotation track "not called for any genotype" were not taken into consideration in our analysis.

Assembly of P. vannamei mitochondrial genome and cluster analyses
The mapping of reads to the reference mitogenome NC_009626.1 yielded 8.362 mapped reads with a mean coverage of 128.4 (standard deviation = 51.3). De novo assembly of the reads resulted in a large contig (15.989 bp, 43.9 × mean coverage according to SPAdes) and three short contigs (< 260 bp) with mean coverages of 34.34, 56.13, and 1.5x. The large contig obtained comprised all annotations in accordance with the reference P. vannamei mitochondrial sequence ( Fig 1A). The mtDNA-BR had similar characteristics to previously described mitochondrial genomes (Table 1).
UPGMA clustering of the whole mitochondrial genomes ( Fig 1B) and of the 45 CRs ( Fig  1C) corroborates that the genomes NC_009626 and KT596762 are the most similar. Considering the information regarding CR, three groups can be considered. Among these, the group 3 had greater sequence diversity. The mtDNA-BR sequence was grouped close to group 1 with DQ534543 ( Fig 1C). In some subgroups within groups 1 and 2 the resolution of the analysis was insufficient to discriminate all individuals.

Inter-individual variability
The alignment of the 45 CRs revealed 7 CS comprising about 20% of the CR (196 bp) and 7 VS containing 182 polymorphic sites (ps) with an average distance of 5.2 bp (std. dev. = 5.8) ( Table 2). The length of the conserved regions varied greatly, ranging from 11 bp (VS5) to 354 bp (VS3). The regions with the most polymorphic sites were VS3 (95 ps) and VS6 (26 ps), while VS1 (7 ps) and VS5 (3 ps) had the least. Considering the proportion of ps per length, VS3 and VS5 were the most variable regions (VI = 27%), while VS1 was the least variable (VI = 13%) ( Table 2). The analysis of the distribution of the variable sites per frequency interval revealed that most variants occurred in the lowest frequency interval ( Table 3). The most prevalent polymorphisms were transition-containing sites (143 ps; 78.3%), followed by transversion-containing sites (16 ps; 8.8%), sites containing alleles for both transition and transversions (11 ps; 6%), indel-containing sites (9 ps; 4.9%), and lastly, site containing alleles for indels and transitions (3 ps; 1.6%) (S1 Table).
Analysis of the four complete mitochondrial genomes revealed 212 ps, with an average distance of 75.4 bp (std. dev. = 148.7) ( Table 4). Considering the VIs, the top 10 VS or genes included VIs ranging from 9.94 to 1,79% including all VS, and with COX1 as the most variable gene, followed by several tRNA associated genes ( Table 4). The distribution of the polymorphic sites in the frequency intervals 25% and 50% (fi 25 and fi 50 , respectively) revealed that among the 27 genes evaluated, more variants were found within fi 25 (168 ps) than within fi 50 (44 ps), especially in the COX1 gene that concentrated most of the fi 25 variants (92 ps) ( Table 4). The CR alone showed more fi 50 variants (30 ps) than all the other genes combined (14 ps), with more than half of the variants located in VS3 (17 ps). Since all polymorphic sites were bi-allelic, they were distributed in a single category per site. Transition-containing sites were the most prevalent, corresponding to 79.7% of polymorphic sites (n = 169), followed by transversion-containing sites 17.9% (n = 38), and indel-containing sites 2.3% (n = 5) (S2 Table).

Heteroplasmy
Mapping the 8,496 processed reads back to the assembled mitochondrial genome revealed 186 polymorphic sites, showing an average distance of 85.9 bp (std. dev. = 99.4) ( Table 5). Among these, 184 polymorphic sites correspond to indels (112 insertions and 72 deletions) and 2 were substitutions (two transitions). Detailed information is presented in S3 Table. The intervals between the polymorphic sites reached up to 548 bp, revealing large invariable regions. Among the insertions, single insertions of thymine (n = 66) and adenine (n = 44) were the majority. Among the indels, most were single-base deletions of thymine (n = 36) and adenine (n = 20). Indels involving cytosine (n = 9) and guanine (n = 6) occurred mainly in cases of deletions, representing 21% of the 72 deletions found. The largest indels found were 2 bases long (n = 3): AT and CT (insertions) and AA (deletion).
The CR showed only four ps at VS2, VS3, VS7, and CS6, with the CS6 region being the most polymorphic. The top 10 VIs considering the whole mitochondrial genome ranged from 1.52% to 3.69%, with the ND3 gene as the most variable, followed by tRNA genes (Arg, Ala, Lys), two CR regions (CS6 and VS7), ND1, ND6, and ATP8 genes ( Table 5). from the 5' to 3' of the mitochondrial sequence. The tRNA (purple) and protein coding genes (green) were also presented. (B) UPGMA phylogeny comparing full mtDNAs and (C) control regions (Bootstrap N = 1,000). The numbers 1, 2 and 3 indicate the occurrence of paraphyletic and monophyletic groups. The "R" in the sequence accessions indicate that these sequences were inverted in MAFFT to match the first sequence (see methodology for details).  In general, protein-coding genes differed from genes that do not encode proteins because their variants (almost all indels) showed a pattern of occupying a frequency interval before occupying the next one (Table 5). For example, the lowest frequency observed in protein-coding genes were between 0 and 10% while in non-coding genes the variants could directly start at a higher frequency interval (ex.: CS6, VS2, tRNA-Gly, etc.) or would start at low frequency intervals but presenting "frequency gaps" (i.e.: l-rRNA has no variants at~30-50% frequencies). The only exception to this was the ND4L gene having a single insertion with a 23.7% frequency.

Comparison of inter-and intra-individual variability in the mitogenome
Among the 37 genes/regions analyzed, 9 showed only inter-individual variation, 10 showed only intra-individual variation, and 18 showed variation in both individuals and within the same individual (Fig 2A). In general, the variable regions differ more inter-individually, and the genes for tRNAs vary more intra-individually. Considering the sum of the VIs of the two analyses, the VS3 region had the largest and the NDL4 gene had the least number of variations among all the analyzed genes/regions. Among the 13 genes that encode proteins, only COX1 and COX2 showed inter-individual variation higher than intra-individual variation. The VS7 region and ND4L gene showed the same VI inter-and intra-individually. Considering the inter-individual analysis, VS3 and VS6 were the segments with the largest number of variations. A bimodal pattern around these two VS was observed, indicating two distinct hyper variable regions (Tables 2 and 3 and Fig 2B).

Discussion
This work was motivated by the great deficiency that still exists in the characterization of mitochondrial markers for P. vannamei. It was possible to explore heteroplasmic (intra-individual) and species-wide (inter-individual) variabilities, using full genomes and partial sequences already available, and a metagenome from a muscle sample.

PLOS ONE
Characterization of Penaeus vannamei mitogenome focusing on genetic diversity CR phylogeny presented relative consistency with the full genome analysis, but it did not bring necessary information to individual identification (Fig 1C). The 41 CR sequences used in this study were from haplotypes found in broodstocks of five hatcheries from Mexico and were found to descend from a recent common ancestor [6,7]. This fact and the low difference among the sequences in the polyphyletic branches can explain the polyphyletic groups within the haplotype sequences. However, DQ534543.1 [29] and mtDNA-BR also clustered in this same polyphyletic group, suggesting a recent common ancestor among them.
The presence of CS and VS indicates differential selective pressures in CR. Considering the inter-individual analysis, VS3 and VS6 were the segments with the largest number of variations. A bimodal pattern around these two VS was observed, indicating two distinct hyper variable regions (Tables 2 and 3 and Fig 2B). Similarly, regions known as "hyper variable segments" (HVS) characterized in humans were often used to haplotype individuals [30]. Our results suggest that P. vannamei CR also contains two highly informative HVS: a more variable HVS1 (ranging from VS1 to VS4) and HVS2 (from VS5 to VS7) (Fig 2B).
In the intra-individual analysis, the CR had a lower read coverage than other regions of the mitogenome (~23x), probably due to its high mutation rate, which makes detection by similarity unfeasible. This possibility is corroborated by the low variation observed in the CR (only four indel variants were found) but with high frequency (20% to 46%), which unexpectedly included the CS6. The high frequency variant from CS6 could also indicate a differential selective pressure at the muscular tissue level. In fact, the metabolic diversity in tissues can exert different pressures on their mitochondria [31]. This also raises the possibility of direct or indirect influence in the observed intra-individual variation by the concurrent WSSV infection, requiring further studies.
The amount of indels identified, especially in the intra-individual analysis, caught our attention. It was no possible to verify the impact of these indels on the ORFs, since there are stop codons even considering the sequences already annotated. In agreement, Liu et al, suggest that P. vannamei mitochondrial genomes often use ATN as start codons [28]. These data show that the identification of mitochondrial genes in P. vannamei is not complete, and it is not possible to determine precisely which mutations influence each protein.
In the intra-individual analysis, a distinct frequency pattern of indels for protein-coding and non-coding genes was observed, indicating distinct selective pressures among them (Table 5). In general, protein-coding genes had variants (mostly single-base indels) that progressively regressed with increasing frequency intervals, whereas in non-coding genes, they were up to 40% more frequent than the second most frequent indel or were simply lone super frequent indels (i.e.: CS6). This indicates that small indels need to accumulate in protein-coding genes before increasing their frequency, probably by reducing their deleterious effects. In other words, this suggests that the occurrence of high frequency indels in coding genes requires other indels, perhaps to correct their frame shifts. It is most likely that the occurrence of 'solitary' indels is more disadvantageous for protein-coding genes than for non-coding genes (rRNA, tRNAs, and mtDNA control region). In fact, it was observed that purifying selection is stronger for indels found in genes related to elementary biochemical reactions [32]. Hence, this raises concern about which type of gene (coding or non-coding) is being used to trackindividual or populational variabilities.
Other factors affecting the observed intra-individual variation can be PCR errors, neutral mutations, or genomic paralogs that cannot be distinguished by the presented methodology [33,34]. On the other hand, despite their eventual presence, these confounding factors did not affect the "invariable regions" observed in this study, which reached up to 548 bp. In the presence of these multiple factors, the length of these invariable regions would be shorter. This suggests that these invariable regions are loci under strong selective pressure. These regions could even influence the variation observed in other regions. Therefore, these results suggest the potential of using NGS to find regions under selective pressure, even without prior knowledge.
Outside the CR, genes with low-frequency variants were more abundant in both analyses (Tables 3-5 and Fig 2). Most of the genes encoding proteins analyzed showed greater intraindividual variation than inter-individual variation. This can be explained by the small number of mitochondrial sequences available, and by the fact that, generally, the diversity of these sequences is usually suppressed because only the most frequent allele could be represented during the sequence assembly. Despite this, greater inter-individual diversity was observed in the COX1 and COX2 genes. However, COX1 variants were low in frequency, indicating a high mutability but low retention of variants in this gene. The large variation in the COX1 and COX2 genes, also corroborates the fact that these genes may be related to adaptation to their cellular environment or still, to the WSSV infection, known to upregulate genes related to oxidative respiration to increase energy and, supposedly, raising the survival of infected cells to oxidative stress [35].

Conclusions
The existence of intra-individual variation indicates that the use of CR as a marker for genetic and phylogenetic identification should be adopted with caution. In this context, the mapping of conserved and variable regions presented in this study will contribute to increasing the efficiency of markers for population and/or individual studies.
The possibility of the occurrence of an indel affecting the frequency of others is also a point to be considered when applying mitochondrial markers. This behavior, associated with the fact that most of the variations found are of low frequency, indicates that most of the identified polymorphisms do not have the appropriate characteristics to be used as SNPs.