The Complete Female- and Male-Transmitted Mitochondrial Genome of Meretrix lamarckii

Bivalve mitochondrial genomes show many uncommon features, like additional genes, high rates of gene rearrangement, high A-T content. Moreover, Doubly Uniparental Inheritance (DUI) is a distinctive inheritance mechanism allowing some bivalves to maintain and transmit two separate sex-linked mitochondrial genomes. Many bivalve mitochondrial features, such as gene extensions or additional ORFs, have been proposed to be related to DUI but, up to now, this topic is far from being understood. Several species are known to show this unusual organelle inheritance but, being widespread only among Unionidae and Mytilidae, DUI distribution is unclear. We sequenced and characterized the complete female- (F) and male-transmitted (M) mitochondrial genomes of Meretrix lamarckii, which, in fact, is the second species of the family Veneridae where DUI has been demonstrated so far. The two mitochondrial genomes are comparable in length and show roughly the same gene content and order, except for three additional tRNAs found in the M one. The two sex-linked genomes show an average nucleotide divergence of 16%. A 100-aminoacid insertion in M. lamarckii M-cox2 gene was found; moreover, additional ORFs have been found in both F and M Long Unassigned Regions of M. lamarckii. Even if no direct involvement in DUI process has been demonstrated so far, the finding of cox2 insertions and supernumerary ORFs in M. lamarckii both strengthens this hypothesis and widens the taxonomical distribution of such unusual features. Finally, the analysis of inter-sex genetic variability shows that DUI species form two separate clusters, namely Unionidae and Mytilidae+Veneridae; this dichotomy is probably due to different DUI regimes acting on separate taxa.

Species with DUI are characterized by the presence of two different sex-linked mitochondrial lineages, being transmitted independently by the two sexes. One lineage is called F (from Female-transmitted) and it is transmitted by females through eggs, while the other is called M (Male-transmitted) and it is transmitted by males through sperm.
Reaction conditions (S1 Table) were first set up according to manufacturer's instruction and further modified whenever necessary. After the initial denaturation step (92°C for 2'), 30 cycles were used as follows: denaturation at 92°C for 20", annealing at 48-56°C for 20-30" and extension at 68°C for 10'. The final extension step was carried out at 68°C for 8'. Primers were designed using the Primer3 online tool [21].
PCR results were visualized by electrophoresis onto a 1% agarose gel stained with ethidium bromide and then purified through a standard isopropanol protocol, Wizard1 SV Gel and PCR Clean-Up System (Promega) and, in some cases, with an empirically modified PEG precipitation protocol [22]. Successfully purified products were sequenced with the Sanger method thanks to Macrogen Europe facility (Amsterdam, The Netherland).
Data analysis and genome annotation MEGA 6.06 [23] was used to examine and edit electropherograms and further merge the entire mitochondrial genomes according to overlapping sequences. This software was also used to compute codon usage and nucleotide percentages.
Nucleotide trends and A-T skew at four-fold degenerate sites were used to identify the Control Region (CR) and the Origin of Replication (OR) of either strand. A dedicated R [24] script was written to (i) compute these statistics over a sliding window, (ii) plot results and (iii) test for significance of the linear correlation. Any size and step for the sliding window can be specified: for the present work, we used a 700-bp sliding window with a step of 300 bp. Autocorrelograms for each nucleotide are also produced in order to evaluate the amount of autocorrelation between sliding windows and therefore the validity of the linear model approach. The user is allowed to analyze several genomes at the same time and to set any gene as starting point for the four-fold degenerate sites analysis; the script is available as S1 Script along with test files and a detailed tutorial. A GitHub repository was created, which can be found at the URL https://github.com/mozoo/4F.git.
Protein Coding Genes (PCGs) were predicted using both MITOS [25] and NCBI's ORF Finder online tool [26], using invertebrate mitochondrial genetic code and alternative start codons. Potential PCGs were identified through homologous sequence similarity using BLAST [27][28]. F-and M-cox2 sequences were aligned using the software MUSCLE [29] and the alignment was graphically edited thanks to the TeXshade package [30]. Putative tRNA genes were detected using MITOS, tRNAScan-SE [31][32] and ARWEN v1.2 [33] online softwares, using default settings. rRNA sequences have been identified by comparisons with other rRNAs present in GenBank using BLAST. In both F and M genomes, the rRNAs sequences were then annotated assuming that the first base comes immediately after the last base of the previous gene and that the last base comes immediately before the first base of the following gene. Finally, the two whole-genome maps were created using GenomeVx online tool [34], conventionally setting cox1 as the starting point of the mtDNA. All putative secondary structures of rRNAs and non-coding regions were predicted using Mfold server [35] and then graphically edited through VARNA 3.7 [36].
Repeats were identified through the online software Tandem Repeat Finder [37]. Phobius [38], InterProScan [39], TMPred [40] and HMMTOP [41] online softwares were all used to predict additional Trans Membrane Helices (TMHs). Introns were searched using @TOME2 [42]. The analysis of additional putative ORFs was done using the software Glimmer3 [43]. The EMBOSS [44] package was used to extract all the possible ORFs from all available bivalve complete mitochondrial genomes (GenBank consulted in August, 2014). An alignment was computed for F_ORF141 and M_ORF138 using HHBlits [45] and the last UniProt release. The computed hidden Markov model was used as a query against the database of all bivalve mitochondrial ORFs, which was built using HHBlits and the Pdb70 database.

Phylogeny and evolutionary comparisons
A phylogenetic analysis was carried out using other complete mitochondrial genomes from the family Veneridae that were available at August, 2014. Three heterodonts, Coelomactra antiquata (Mactridae), Hiatella arctica (Hiatellidae), and Acanthocardia tuberculata (Cardiidae) were selected as outgroups; the complete dataset is available as S2 Table. The 13 PCGs and the 2 rRNAs were extracted and aligned with PSI-BLAST [46], MUSCLE, ProbconsRNA [47], RNAplfold [48], and MAFFT [49] through the T-Coffee algorithm [50][51], using the pipeline PSI-Coffee > Expresso > accurate for PCGs and the MR-Coffee mode for rRNAs. Alignments were masked using BMGE 1.1 [52]; the best partitioning scheme was selected using Partition-Finder 1.1.0 [53] under the Bayesian Information Criterion and a greedy approach. The final Maximum Likelihood (ML) tree search was carried out with RAxML 8.2.0 [54] performing 1,000 bootstrap replicates. The consensus tree was computed with PhyUtility 2.2 [55] and graphically edited with Dendroscope 3 [56].
Finally, we compared sex-linked divergence in different DUI systems; all the DUI species whose complete mitochondrial genomes were available in January, 2015 were selected for this analysis. M and F coding sequences of M. lamarckii and other DUI species were aligned gene by gene using the T-Coffee algorithm: as above, the accurate method was used for PCGs, while MR-Coffee was chosen for rRNAs and tRNAs. To account for multiple substitutions, nucleotide Jin-Nei [57] and aminoacid Kimura [58] corrected distances were then computed using the EMBOSS suite along with uncorrected p-distances. Principal Component Analysis was carried out through R on concatenated Jin-Nei and Kimura distances using the packages FactoMi-neR [59] for computations and ggplot2 [60] for graphics. We also computed average distances within Unionidae, within Amarsipobranchia sensu [61] (i.e., Pteriomorphia + Heterodonta; in this case, Mytilidae + Veneridae), and within the complete dataset. Nucleotide single-gene average values were ranked and rankings within Unionidae and Amarsipobranchia were compared through R using the Spearman ρ and the Kendall's τ.

Overall genomic features
The Meretrix lamarckii complete F and M mitochondrial genomes are 20,025 bp and 19,688 bp long, respectively (Fig 1). Sequences are available in GenBank under the accession numbers KP244451 and KP244452, respectively. All genes are located on the same "+" strand and the two lineages share the same gene order. The only exceptions are three tRNAs, which were found only in M-mtDNA: trnL(AAG), an additional copy of trnQ(UUG), and trnF(AAA). Genome annotations are shown in Table 1.
Nucleotide composition is reported in Table 2. M. lamarckii A-T content reaches 66.01% in F-mtDNA and 67.17% in M-mtDNA. This value increases considering only the 3 rd base nucleotide composition of PCG codons (73.31% for F-mtDNA and 75.08% for M-mtDNA).  Both F-type and M-type mtDNAs contain a large numbers of Unassigned Regions (URs; 27 in F-mtDNA and 29 in M-mtDNA), which are detailed in S3 Table. Protein Coding Genes (PCGs) We found all 13 canonical protein coding genes, including the atp8 gene, reported as missing in several bivalve species [7,[62][63]. ATG start codon is used in cox2, cytb, atp6, nad3, and cox3 of the F genome and in nad2, cox2, atp8 and cox3 of the M one. Like most invertebrate mitochondrial genomes, the two M. lamarckii mtDNAs show alternative start codons: GTG, ATC, ATA, ATT and TTG (see Table 1). Observed stop codons are TAG and TAA, as expected. Overall, TAA is the most common stop codon, while TAG is used in F-cox1, nad4L, cox2, cytb, atp6, nad3, and nad5 and in M-nad4L, cytb, atp8, atp6, and nad3. Truncated TA-/T -stop codons ( [8,63]; and reference therein) were not found in M. lamarckii mtDNAs.   (Table 3). In both F and M mtDNAs, the most used codons is UUU (410 and 434 hits, respectively). Less used codons are CGC and ACC (both 7 hits) in F-mtDNA and UGC (2 hits) in M-mtDNA. The most common aminoacid in both F-and M-mtDNA is leucine, while the rarest is glutamic acid.
Post-transcriptional cleavage sites could be indicated by the presence of a tRNA between two PCGs [64]. In absence of a tRNA, the cleavage role can be played by intergenic non-coding sequences that form a stem-loop secondary structure ( [8]; and reference therein). According to the previous statement, for each M. lamarckii unassigned region located between a pair of PCGs, the predicted hairpin was determined and reported in S1 Fig. Finally, M-cox2 gene is significantly different from the F one. More specifically, it includes a 100-aminoacid long region in the middle of the gene, which is not present in F-cox2 (Fig 2).

rRNAs and tRNAs
Standard rRNAs were found in both genomes: rrnS is located between trnT and trnC, while rrnL between cytb and atp8. The F and the M rRNAs predicted secondary structures are reported in S2 Fig. F-mtDNA shows all 22 canonical tRNAs, with two serine-encoding tRNAs and two leucineencoding tRNAs. They may differ from each other in terms of anticodon. Like many other metazoan taxa (see, f.i., [65,8,63]), both F-and M-trnS(UCU) present a shortened DHU arm.
To better understand their origin, M supernumerary tRNAs were compared with the URs mapped in the same position of F-mtDNA. In all cases, a very similar sequence was found, albeit the canonical cloverleaf structure is essentially unrecoverable (S5 Fig).

Long Unassigned Region (LUR)
A Long Unassigned Region (LUR) is located between trnG and trnQ. F-LUR measures 1,855 bp, whereas M-LUR is apparently divided in two regions (LUR1 and LUR2 of 694 bp and 690 bp, respectively) by the putative supernumerary trnF(AAA) (see above).
A complex secondary structure was found in both M and F mtDNAs in the middle LUR sequence. This highly folded structure is comprised between bases 15,698 and 15,952 in F-LUR and between bases 15,838 and 16,512 in M-LUR. In the F-LUR two tandem-repeated motifs were also found, both with two tandem copies. The first motif is 15  The first PCG downstream of the LUR is cox3; therefore, we set cox3 as the starting point of the sliding window computing nucleotide composition at four-fold degenerate sites. We found 1,911 degenerate sites in the F M. lamarckii genome (15.01% of PCG sites) and 1,907 in the M one (14.67%). In both cases, four-fold degenerate sites are highly T-rich (59.65% for F and 59.20% for M) and definitely weak, but significant trends were uncovered (Fig 3A): while a significant A trend was never found, we detected a significant increase in C in both sexes, even if with very low R 2 values. A negative trend for G was found in F-mtDNA, while a positive one for G and T was found in M-mtDNA, again with very low R 2 values. The autocorrelograms show a significant value of the autocorrelation function (acf) only at lag-1 for F genomes, or at lag-1 and lag-2 (or lag-5) for three M nucleotides (S8 Fig). Given the T-richness of four-fold degenerate sites, the A-T skew is always negative or equal to 0, but two peaks were found, corresponding to the LUR and to the atp6/nad3 boundary (Fig 3B).
The structure of the F-LUR is comparable to the LUR of the published genome of M. lamarckii (GenBank Accession Number NC_016174), with some differences. The LUR of the available M. lamarckii mtDNA is found at positions 14,982-18,044; again, a highly folded region can be inferred (15,433). At the 5' side of the highly folded region there is a sequence very similar to that of the F-ORF (15,392); this sequence would be a putative ORF located on the reverse strand, were it not for a stop codon (TAA) right after the start one and for an insertion of a G, which triggers a frameshift mutation leading to the loss of the stop codon (S9 Fig). Conversely, at the 3' side of the highly folded region a 100-bp repeated motif was found; the repeat unit shows some similarities with the 109-bp motif of our F genome (S10 Fig). However, in the GenBank M. lamarckii mtDNA it is repeated 13 times; this higher number of repeats accounts for the great difference in length between the two LURs (1,855 against 3,063 bp).

Supernumerary Open Reading Frames (ORFs)
Several additional putative Open Reading Frames (ORFs) were found within the LUR of both F and M M. lamarckii mtDNAs. Among all these sequences, we found only one ORF in each genome (F_ORF141 and M_ORF138) that does not overlap with the highly folded structure revealed in the LUR (see above). To better understand whether these putative ORFs are expressed or not, the prediction software Glimmer3 was used. At first, the software was trained with M. lamarckii standard gene data. All mitochondrial PCGs were given a score comprised between 8.71 and 16.34 (for F) and between 10.47 and 18.08 (for M). According to these values, the two potential supernumerary ORFs should not be considered as expressed, because they showed extremely low scores (i.e., 2.34 for F_ORF141 and 3.36 for M_ORF138; see S4 Table).
The presence of F_ORF141 was also searched for in all available bivalve complete mitochondrial genomes using HHBlits. In all the other available Meretrix species, a homolog was found in the reverse strand, within the LUR. All homologous ORFs have a probability over 90%, while E-value and p-value were always lower than 0.05; this holds also for the F_ORF141/ M_ORF138 comparison (S5 Table).

Genetic variability
Nucleotide Jin-Nei distances and aminoacid Kimura distances were calculated between M. lamarckii F-mtDNA and M-mtDNA for each PCG, tRNA, rRNA, UR and for concatenations  Table 4 and S6 Table (for single tRNAs).
Between the F-and the M-mtDNA the nucleotide Jin-Nei distance of the complete genome (coding + non-coding) is 53.13, corresponding to a 16.19% divergence. Jin-Nei nucleotide distances are 81.53 (25.81%) for non-coding regions and 49.65 (14.89%) for coding genes. PCG concatenation has an aminoacid Kimura score of 19.57 and, within that, the highest values belong to cox2 (43.17) and nad5 (33.45), while lowest values are associated with nad1 (4.82) and atp8 (9.71). The average Jin-Nei distance between rRNAs is 37.86; the average Jin-Nei distance between tRNAs is 27.41.
We also compared the two M. lamarckii mitochondrial genomes obtained in this paper (F and M) to the already sequenced M. lamarckii mtDNA present in literature (GenBank Accession Number NC_016174). The uncorrected distance (p-distance) between this genome and our F genome is 9.81% for all the coding DNA, being 10.67% for all PCGs, 8.17% for all rRNAs and 5.33% for all tRNAs. On the other side, the divergence between this genome and our M genome scored is 14.65% for all coding DNA, scoring 15.76% for all PCGs, 11.43% for all rRNAs and 10.72% for all tRNAs. In both cases, the most divergent gene is cox2 (15.09% and 21.72%, respectively), whereas the less divergent is atp8 (4.17% and 9.17%, respectively). The aminoacid Kimura distance was also computed to account for synonymous substitutions (S7 Table): again, the NC_016174 sequence is always more similar to the F-mtDNA than to the M-mtDNA and the most divergent gene is cox2 (19.28 and 39.86, respectively), while the less divergent genes are atp8 (0.00 and 8.13) and nad1 (0.67 and 4.12). Generally speaking, with the exception of cox1 and cox2, the Kimura distance from M-mtDNA is always one order of magnitude higher than from F-mtDNA.
The F vs M distances were also computed for all known DUI species whose complete mitochondrial genomes have been published (see Table 4 The resulting PCA plot (Fig 5) uses the first two components to explain the 73.20% + 6.19% = 79.39% of distance variability (using Jin-Nei and Kimura distances together). Datasets are roughly arranged by overall divergence levels along the first principal component (Hyriopsis spp. < Mytilidae+Veneridae < Unionidae); the second principal component further separates venerids and mytilids from unionids. Finally, it is impossible to reject the null hypothesis that nucleotide distance rankings among single PCGs in Unionidae and Mytilidae + Veneridae are unrelated, using both the Spearman ρ (p = 0.2392) and the Kendall's τ (p = 0.2044). For example, cytb and nad1 are highly divergent for Unionidae, but they are among the least variable for Amarsipobranchia, while the opposite is true for nad4 and nad5.

Comparison with the previously published Meretrix lamarckii mitogenome
The two mitochondrial genomes of Meretrix lamarckii (F and M) sequenced here are slightly shorter than the one previously reported in GenBank: 20,025 (F) and 19,688 (M) bp against 21,209 bp [18]. This genome was extracted from a somatic tissue (the foot muscle; [18]) and, indeed, it was previously attributed to the female type by Plazzi and colleagues [10]. The phylogenetic analysis of the present work further corroborates this hypothesis (Fig 4). It shares the same gene content and gene order, with the exception of trnL(AAG), trnQ(UUG) and trnF (AAA), which have been found only in M-mtDNA. This, again, strengthens the idea that the previously published genome is from the female lineage.  However, comparison between either sex and the published F genome showed surprisingly high divergence values. These were generally one order of magnitude higher when comparing it with our M genome, and, as in the case of our F/M comparison, divergence ranking is similar: f.i., highest values are obtained from cox2, while lowest scores are obtained from atp8. Significant divergence values are still observed at the aminoacid level for both sexes; the distances between our M-mtDNA and the published F genome are comparable to those computed between the two sexes in the present study (Table 4 and S6 Table).
It is possible to find similarities in the LUR structure between the two F genomes, a highly folded region being the divide between a supernumerary ORF and a region with tandem repeats of about 100 bp in length. However, in the published F genome it was not possible to find a functional ORF (S9 Fig). This may be due to sequencing errors; if the available sequence is confirmed, it is hard to say whether an ORF was originally present in the species and was subsequently pseudogenized in some populations or a novel ORF appeared in some others. Given the widespread presence of supernumerary mitochondrial ORFs in bivalves [71,62,72,63,5,[73][74], we largely favor the first hypothesis. On the other side, it is possible to align the 3' repeated motifs (S10 Fig). The great variation in length between the two LURs is due to the different number of repeats: 2 in our F genome, 13 in the published one. This difference, in turn, accounts for the aforementioned difference in length between the two genomes.
Interestingly, intra-specific variability in the number of repeats in a mitochondrial LUR has been reported elsewhere [75][76] for (DUI) bivalves. The two M. lamarckii specimens sampled for this research come from the Tokyo area (Japan), whereas the F genome available in Gen-Bank comes from Zhejiang (China) [18]. Recall that the Chinese specimen was only tentatively identified as M. lamarckii due to its similar morphology, despite showing some differences in color, shell shape and thickness [18], we cannot completely rule out the hypothesis that these specimens belong to different species; however, it is not unconceivable that the differences found here simply reflect the high degree of taxonomic distinctness between Japanese and Chinese clams belonging to very distant populations.

Nucleotide composition and codon usage
A-T content in M. lamarckii F and M genomes is slightly higher with respect to the average A-T content in bivalves, but it is comparable with those of other DUI organisms such as V. philippinarum and M. senhousia [63]. The coding strand (+) is G-T rich: this is expected [77][78] and in good agreement with [63], where it was stated that a higher G-T percentage is related with mtDNAs characterized by most (if not all) genes located on the same strand.
Codons usage reflects the general nucleotide composition of the two genomes, with a high presence of T in most used codons. In almost all cases, except for trnL(TAA), trnK(TTT), and trnM(CAT) in F-mtDNA and for trnL(TAA), trnF(AAA) and trnK(TTT) in M-mtDNA, within four-fold or two-fold degenerate codon families the most used codons do not have a complementary anticodon in mitochondrially-encoded tRNAs (Table 3). Moreover, they differ for only one base (the third one) with respect to the synonymous codon for which a complementary tRNA exists in the mitochondrion. The codon usage table demonstrates the presence of high degrees of third-base wobbling in M. lamarckii, as previously seen in other bivalves [8,63]: a tRNA can have a non-standard base at the first anticodon position pairing with more than one base and allowing to bind codons that are not perfectly complementary.

PCGs and the cox2 insertion
With the exception of the supernumerary ORF, all genes are located on the same strand in both F and M mtDNAs of M. lamarckii. This is commonly found in all Amarsipobranchia, while unionids [71,9,63] and Solemya velum [63] encode genes on either strand. This finding reinforces the hypothesis that the one of M. lamarckii is a derived state, which evolved once in the common ancestor of Pteriomorphia and Heterodonta ( [9]; and reference therein).
The atp8 gene was declared as missing in several bivalve species [8], especially in the genus Mytilus [79], even if it was recently found in some bivalves like Solemya velum [63], Musculista senhousia [8], Venerupis philippinarum [80], and presently in M. lamarckii. In addition, recent studies [9,81] found this gene in species in which it was not previously annotated.
The use of Jin-Nei corrected distance to evaluate nucleotide divergence unveiled that atp8 is not less conserved than other mitochondrial genes (see Table 4). As suggested elsewhere [5,82], there is a strong possibility that the absence off this gene is simply due to past annotation difficulties or inaccuracy. Up to now, the presence/absence of ATPase subunit 8 does not appear linked with DUI [9], but rather, if confirmed, to phylogeny [63].
It was suggested to be the commonest situation in metazoans that the two ATPase subunits, atp8 and atp6, are adjacent and overlapping [79]. This especially holds if the co-translation of these genes from a bicistronic transcript (as is the case in mammals; [83]) is confirmed as a widespread rule. In fact, this association is present in the Unionidae [71] and it was recently found in S. velum [63]; however, a disjointed location of atp8 and atp6 has already been highlighted for some heterodont bivalves, like Hiatella arctica [80] and Macoma balthica [82]. Similarly, in M. lamarckii atp8 and atp6 are not neighboring in either F-and M-mtDNA: again, the contiguity of these genes may be an example of an ancestral state that was subsequently lost in derived bivalves.
M M. lamarckii genome presents an insertion of 100 codons in cox2 gene, which is totally absent in the F counterpart (Fig 2). It is not the first time that a M-cox2 gene is longer than F-cox2; generally, however, these extensions map to the 3' end of the gene. In fact, the M-cox2 3' tail is present in all three subfamilies of Unionidae [5]. This extension (Mcox2e) has been found only in M mtDNAs and varies in length between 177 and 192 bp [84][85]. Mcox2e has been found in poly-adenylated transcripts of cox2 obtained from male gonads, and also proved to be translated and localized in both inner and outer mitochondrial membranes [84][85][86]. The structural analysis of unionid Mcox2e sequences reveals the presence of the two canonical Nterminal trans-membrane helices (TMHs). In addition to that, several additional TMHs were found in Mcox2e [87]. For the above mentioned reasons, a proposed hypothesis was that such extension may be a mitochondrial tag implicated in male mitochondria survival to elimination and differential segregation during development [87].
Outside from the unionid family, the pattern of cox2 variations among DUI M and F lineages is unclear and not easy to unravel. In mytilids, no extensions were found in M genomes of Mytilus [5], but a duplicated cox2 gene (cox2b) is found in M M. senhousia [8], with the duplicated gene being longer than the original one at 3'. A putative TMH of 41 residues was found in the cox2b tail [8], allowing the authors to hypothesize a correlation between the unionid Mcox2e and the cox2b tail of M. senhousia.
V. philippinarum is with M. lamarckii the only known DUI species of the family Veneridae: a duplication of the cox2 gene, similar to that of M. senhousia (i.e. longer at 3'), was found, but, contrastingly, it is located in the F genome [8]. However, additional TMHs (either in insertions or tails) are not detectable in V. philippinarum cox2, nor in M. lamarckii (S11 Fig). Moreover, @TOME analysis did not find any intron, and the coding frame is apparently kept (see Fig 2).
Concluding, it was impossible to properly assign a function in silico to this region, and further analyses are therefore mandatory in this regard. Non-canonical features in cox2 gene are often coupled with DUI, but a general rule is still not evident and each DUI system seems to follow its own evolutionary pathway. However, despite the relationship between cox2 variations and DUI phenomenon has not been demonstrated yet, the finding of a new M-cox2 gene insertion (albeit differently located in the gene) in another DUI bivalve is an interesting clue.

Supernumerary tRNAs in M-mtDNA
As mentioned above, M and F genomes basically share the same gene arrangement, the only difference being three tRNAs in M-mtDNA. As a consequence of the high variability of their mitochondrial genomes, additional tRNA copies are common in bivalves [88][89][90]. In fact, when aligning the M additional tRNAs with the region mapped in the same position in the F-mtDNA, high levels of sequence similarity were always detected (see S5 Fig). Therefore, we may hypothesize that the duplication of trnL, trnQ, and trnF took place before the separation of the two sex-linked lineages, and that, afterwards, the F copies became pseudogenes or remain functional tRNAs that the in silico methods are not able to retrieve. Anyway, it has also to be noted that the anticodon region of the F counterpart of M-trnQ(TTG) would be complementary to the stop codon TAA.
The presence of a tRNA in the middle of M-LUR (trnF(AAA)) is intriguing and deserves further investigation: possibly, the cloverleaf structure of a tRNA was co-opted as part of the signaling structure of the putative control region (see below) and, thus, would not correspond to a functional tRNA. However, it is noteworthy that the anticodon of the middle-LUR tRNA is AAA, which is complementary to TTT, the most used codon in both genomes (see Table 3).
The presence of a functional tRNA in the middle of a control region, where it may work also as a signaling sequence, would make of the trnF(AAA) gene of M-mtDNA of M. lamarckii a good example of an evolutionary spandrel [91] and/or a case of molecular exaptation: this region, being a tRNA, necessarily had a complex secondary structure, and this became useful in the wider context of the control region as well (or vice versa, even if the presence of a degenerated tRNA in the F-mtDNA makes us to prefer the first hypothesis). The presence of a tRNAlike structure was already signaled by [67] in the Mytilus spp. LUR, but in the case of M. lamarckii it seems that the tRNA maintained its functionality.
Other expected non-canonical tRNA structures are found in our genomes: f.i., in both F-and M-mtDNA two trnS were found and the DHU arm was not recovered in trnS(UCU). However, as mentioned by [8], this unusual tRNA has been found in several other animal groups and it evolved early in Metazoans group [92]. In vitro analysis further confirmed its functionality [93].

Control Region (CR) and the Origin of Replication (OR)
Several parameters have been proposed to identify the mtDNA control region (CR). The most used are the presence of repetitive elements, palindromes, length, high A-T content and secondary structures with T-rich loops [67,94,71]. M. lamarckii Long Unassigned Region (LUR) is the longest UR in both F and M mtDNAs, although the M one is apparently split into two parts by a phenylalanine supernumerary tRNA (as mentioned above). A-T content is roughly the same found in the entire genomes, 64.2% for F and 65.8% for M, even if several poly-T have been found during (and heavily hampered) sequencing.
The short (15-bp long) repeat is essentially a stretch of G and A and may simply reflect the general G-T-/A-T-richness of both genomes in a region where less selective constraints are working; however, this repeat is conserved in both F and M genomes and it is known that similar G-rich sequences are present in Mytilus and human control regions, being related with replication and/or transcription [67].
The 109-bp long repeats are located near to the 3' end of the F-LUR sequence, and, due to their proximity with the putative origin of replication (see below), they may play a functional role in F mtDNA duplication (but recall that they are not detectable in M). Both M-and F-LUR present a central region which appears to be heavily folded (S6 Fig): again, this secondary structure may play some role for the replication/transcription process to begin [67,71].
The nucleotide composition at fourfold degenerate sites is related with single-strand state duration during mtDNA replication. As detailed in ( [95][96][97][98]71]; and reference therein), the more the heavy (H) strand remains unpaired, the more the spontaneous hydrolytic deamination of C to U and A to hX (hypoxanthine) takes place. Such an increase of T and hX in the H strand leads to a corresponding increase in the percentages of A and C in the complementary lagging (L) strand where the H strand remains for longer time in the single-stranded condition, i.e. near to the OR. Moreover, single-stranded-guanine may spontaneously oxidize to 8-hydroxyguanine, which basepairs with adenine: thus, in this case, G decreases and T increases on the H strand. In a nutshell, T will only tend to accumulate near to the origin of replication of the H strand, while the opposite is true for A and C; finally, G may behave in either way [95,98]. This asymmetrical composition can leave a neutral signature in fourfold degenerate sites, being them under no or weak selection.
The 700-bp sliding window analysis on these sites is in agreement with this model (Fig 3): with the exception of A (and of T in F-mtDNA), all correlations are significant, even if R 2 values are very low (<25%). Setting cox3 as the starting point of the pattern, T in M-mtDNA tends to decrease, C tends to increase, and G decreases in the F-mtDNA and increases in the M one, which would be expected if the OR is located upstream to cox3. This also point to the conclusion that the "+" strand is in fact the H strand (as predictable, being all genes located on it).
The A-T skew at four-fold degenerate sites is known to be correlated with the position of the ORs as well: extreme (i.e., closer to ±1) values are associated with PCGs located near to the OR of the H strand, while balanced (i.e., closer to 0) values are associated with PCGs located near to the OR of the L strand [95,97,99]. Given the overall high T-richness of these genomes, the A-T skew at four-fold degenerate sites is always negative: however, lowest values (i.e., closer to −1) are associated with the LUR and with the cox3 gene (Fig 3B), while highest values (i.e., closer to 0), are associated with the nad2/nad4L and atp6/nad3 regions. Therefore, we have further evidence that the LUR contains the OR of the H strand; moreover, it is tempting to conclude that either the nad2/nad4L or the atp6/nad3 region is the OR of the L strand. Both regions are neighbored by a two-or three-tRNA cassette, and it has been shown that an array of tRNAs on a strand may act as OR in the opposite one through alternative secondary structure [100]. If the OR of the L strand were located in the atp6/nad3 region, that shows A-T skews closer to 0 ( Fig 3B) and is near to three tRNAs in either sex (Fig 1), this would leave the OR of the L strand quite distant from the OR of the H strand, a situation very similar (if not more extreme) to that of Mytilus [98] and unionids [71]. However, it is possible that more complex patterns are shadowed by the presence of all genes on the same strand and by the high T-richness of both genomes (recall that the A-T skew for the third codon position of PCGs is −0.44 for both genomes; Table 2).
As a conclusion, we gathered seven pieces of evidence that the F-LUR and the M-LUR are the control regions of M. lamarckii mtDNAs; as detailed above, most of these features are shared with other DUI species, namely Mytilus spp. [67,98] and Unionidae [71]. First of all, we have (i) a complex secondary structure, that, if the supernumerary ORFs are expressed (see below), would involve the complete LUR. Within that, we found, approximately from 5' to 3': (ii) the presence of G-rich elements; (iii) the presence of a tRNA (only in M); (iv) a sequence with some homology to the human TAS element (only in F); (v) the 109-bp long repeats (only in F). Finally, downstream from the LUR, we detected (vi-vii) the two above-mentioned nucleotide composition trends. Being these features unique to this region, we propose the LUR to act as the CR and to contain the OR of the H strand.

Supernumerary ORFs
Many DUI species, like M. senhousia and Mytilus spp., present supernumerary ORFs with no known homologies with other proteins (i.e., ORFans; [101]), which are located in the LUR. M. lamarckii is no exception. For such ORFs a correlation with the DUI phenomenon has been suggested [71][72]5], even if the opposite was also proposed, interpreting the RNA transcripts as degradation intermediates [102]. Supernumerary ORFs were also found in the basal species S. velum, leading to the hypothesis that they constitute a plesiomorphy among bivalves [63]. Although some of these ORFs have uncontrovertibly proved to be translated [71][72][73], it is uncertain whether the M. lamarckii ones are even transcribed. This issue can be assessed only by looking at expression data, but, currently, without an available transcriptome/proteome of M. lamarckii, we cannot confirm nor disprove the functionality of either ORF.
However, a precise homology between F and M ORFans was detected, which would not be expected if these sequences did not share a common ancestor; furthermore, an ORFan with high homology to F_ORF141 has been found in all species belonging to genus Meretrix. Interestingly, this would make of this supernumerary ORF the only gene located on the reverse strand in all the Meretrix genomes (S5 Table).

Sex-linked mtDNA diversification and evolution in DUI bivalves
The two entire M. lamarckii genomes diverge by a 16% on average (see also [10]), hence the divergence between F and M mtDNAs is somewhat lower than other DUI species. On the other hand, the most diversified genomes belong to unionids (around 35%), followed by mytilids (around 25%). The other venerid, V. philippinarum, shows levels of divergence comparable to those of mytilids (26%).
In this work, nucleotide Jin-Nei and aminoacid Kimura distance values of all DUI species (whose complete mitochondrial genomes are available in GenBank) were calculated between M-and F-type mitogenomes to estimate divergences and give an idea of the rate of independent evolution between the two sex-linked genomes. We strongly advocate the use of corrected distance methods, like the Jin-Nei and Kimura formulae, over the uncorrected p-distance, because of the high divergence between sex-linked mtDNAs in many DUI species and the overall high variability of the molluscan mitochondrial genome, where significant level of saturation and multiple-hits events are quite common (see, f.i., [61]).
Both Hyriopsis cumingii and H. schlegelii Jin-Nei distance values are surprisingly low, in contrast with all other sequenced unionids (and, in general, DUI) mtDNAs. However, there is a chance that there was an error in assigning the paternal route of transmission to genomes retrieved from males. Actually, as reported in GenBank, H. cumingii M genome (GenBank Accession Number HM347668) was indeed extracted from mantle tissue, whereas the source of the F (GenBank Accession Number NC_011763) is not reported. Conversely, H. schlegelii F and M genomes (GenBank Accession Numbers NC_015110 and HQ641407, respectively) were extracted by gonad tissue-not from gametes-which is known to contain somatic cells carrying the F genome. Furthermore, the study of cox2 gene reveals that M mtDNAs of both species do not have additional putative TMHs typical of all other unionids M-cox2 (see above; S11 Fig). These evidences point to the fact that both genomes do belong to F-type and their minimal divergence is due to normal intraspecific variability.
More interestingly, in the PCA of distance scores (Fig 5), DUI species clustering follows the taxonomic arrangement of bivalves. Two large assemblages are visible: unionid species, one side, and Amarsipobranchia (i.e. Veneridae + Mytilidae), the other. In fact, the divergence of the two mtDNAs is higher in unionids (Table 4). This may point to the conclusion that DUI is somehow different in these two lineages, leading to distinct patterns of sequence evolution. This is not a new observation, since differences in many respects of DUI were repeatedly evidenced between Unionidae and Mytilidae + Veneridae (see, e.g., [5,14]; and reference therein). In particular, the main difference is that unionids have established M-and F-mtDNA lineages earlier than species radiation, thus leading to a higher divergence between sex-linked lineages and, thus, to a very strict "gender-joining" phylogenetic pattern [84][85]5].
Conversely, in Amarsipobranchia, two tentative, perhaps overlapping explanations were given to account for the observed "species-joining" pattern, with the only exception of the fairly recent Mytilus edulis species complex ( [5]; see also Fig 4 for Veneridae): (i) multiple role reversal events, as well as reversions to SMI, may have blurred the phylogenetic and diversification pattern [103][104]5], and/or (ii) DUI and the establishment of the two sex-linked mitogenomes may have happened many times in different lineages. This was a conceivable hypothesis given the model described in [105,5], where a relatively simple switch, called factor Z, is proposed to trigger DUI/SMI swaps.
However, it is also worth pointing out that genetic divergence behaves differently in singlegenes pairwise comparisons, and this is not expected if we consider the observed variability as a function of the DUI onset time only. For example, unionids show high distance values for cytb and nad1, which are among the most conserved within Amarsipobranchia, and vice versa for nad4 and nad5 (Table 4). Currently, it is not possible to speculate on the reasons of such divergence patterns, and more comparative and structural analyses have to be done.

Conclusions
The present phylogenetic reconstruction (Fig 4) corroborates previous evolutionary trees of venerids [106,10] and, above all, indicates future research lines: the detection of DUI in other genera of the family Veneridae and/or in other species of the genus Meretrix would add consistency in the single DUI origin hypothesis (at least for Heterodonta; [14]), while the direct observation of SMI in those groups would probably lead to a re-evaluation of the parsimony approach to the origin of DUI proposed in [14]. Furthermore, investigating the distribution of DUI within the genus Meretrix would open the field for comparisons with the Mytilus species complex, which is the only known case of a gender-joining pattern among Amarsipobranchia.
The great genome variability shown by bivalves at the mitochondrial level may somehow veil mtDNA similarities between distantly related DUI species, so that comparisons between taxonomically closer DUI species are needed to further characterize and understand the DUI mechanism and the related molecular machinery. This opportunity was unavailable for venerids so far. Therefore, the sequencing and characterization of M. lamarckii mtDNAs presented here makes this species a useful experimental counterpart of V. philippinarum, which in turn has been thoughtfully characterized in recent years (see, f.i., [107][108][109][110][111]74]). Phobius predictions of cox2 residue locations for all DUI species used for this work. The cox2 length is reported in the x axis; the y axis refers to the posterior probability of a given position to be part of a TMH (gray), cytoplasmic (green), non-cytoplasmic (blue), or part of a signal peptide (red). (PDF) S1 Script. R script used to compute nucleotide composition and A-T content at four-fold degenerate sites over a sliding window. The script is called 4F; example files and a tutorial are also provided. The same script can be downloaded at the GitHub repository https://github. com/mozoo/4F.git. (GZ) S1 Table. Primers used to amplify Meretrix lamarckii F and M mitochondrial genomes. Primers are listed by pairs, showing for each pair the forward (F) and the reverse (R) primers in the column "Strand". In the column "Sex" it is specified if a given pair was used for the female genome (F), for the male genome (M), or for both (both). For amplicons > 2,000 bp the Herculase enzyme was used (see text for details). Where two annealing temperatures are listed, the first one refers to the female genome and the second one to the male genome. (PDF) S2  Table. Present study vs published genome Kimura distances. The Kimura aminoacid distance is listed for each PCG. F, female genome; M, male genome; NC_016174, published Meretrix lamarckii mitochondrial genome. The F-atp8 gene has 11 aminoacids at the 5' end that are lacking in the published atp8 gene; as the remaining part of the peptide sequence is identical, the pairwise deletion led to a Kimura distance of 0. (PDF)