Chloroplast genome features of Moricandia arvensis (Brassicaceae), a C3-C4 intermediate photosynthetic species

Moricandia arvensis, a plant species originating from the Mediterranean, has been classified as a rare C3-C4 intermediate species, and it is a possible bridge during the evolutionary process from C3 to C4 plant photosynthesis in the family Brassicaceae. Understanding the genomic structure, gene order, and gene content of chloroplasts (cp) of such species can provide a glimpse into the evolution of photosynthesis. In the present study, we obtained a well-annotated cp genome of M. arvensis using long PacBio and short Illumina reads with a de novo assembly strategy. The M. arvensis cp genome was a quadripartite circular molecule with the length of 153,312 bp, including two inverted repeats (IR) regions of 26,196 bp, divided by a small single copy (SSC) region of 17,786 bp and a large single copy (LSC) region of 83,134 bp. We detected 112 unigenes in this genome, comprising 79 protein-coding genes, 29 tRNAs, and four rRNAs. Forty-nine long repeat sequences and 51 simple sequence repeat (SSR) loci of 15 repeat types were identified. The analysis of Ks (synonymous) and Ka (non-synonymous) substitution rates indicated that the genes associated with “subunits of ATP synthase” (atpB), “subunits of NADH-dehydrogenase” (ndhG and ndhE), and “self-replication” (rps12 and rpl16) showed relatively higher Ka/Ks values than those of the other genes. The gene content, gene order, and LSC/IR/SSC boundaries and adjacent genes of the M. arvensis cp genome were highly conserved compared to those in related C3 species. Our phylogenetic analysis demonstrated that M. arvensis was clustered into a subclade with cultivated Brassica species and Raphanus sativus, indicating that M. arvensis was not involved in an independent evolutionary origin event. These results will open the way for further studies on the evolutionary process from C3 to C4 photosynthesis and hopefully provide guidance for utilizing M. arvensis as a resource for improvinng photosynthesis efficiency in cultivated Brassica species.


Ethical statement
This study did not involve in any human or animal research participant data. The plant sample tested in this study is not endangered species, and the collection of sample didn't cause any environmental problem.

Plant materials and DNA library preparation
Seeds from a pure M. arvensis line were cultivated in a glasshouse at Guizhou Normal University (Guiyang, China). When the plants had seven true leaves, 5 g of fresh leaves were collected for total DNA isolation using a commercial DNA extraction kit (TIANGEN, KG203, Beijing) according to the manufacturer's instructions. After checking the integrity of the DNA with the Agilent Technologies 2100 Bioanalyzer (Agilent Technologies, USA),~1 μg DNA was fragmented to~450 bp to construct a short-insert library for Illumina sequencing (HiSeq X Ten). Approximately 5 μg of DNA were used to construct the library with insert sizes of 20 kb for PacBio sequencing (PacBio, Menlo Park, USA), according to the manufacturer's instructions. The raw sequence data reported in this paper were deposited in the Genome Sequence Archive of the National Genomics Data Center under the accession number CRA003542 (https://bigd. big.ac.cn/search/?dbId=gsa&q=CRA003542).

Cp genome assembly and genome feature analysis
The Illumina platform produced 150 bp of paired-end reads. Then, the Trimmomatic software (version-0.39) with default settings was used to remove low-quality reads and trim out the adapters to obtain clean reads [29]. To identify the cp-related reads, the clean reads were mapped to the published Arabidopsis thaliana cp genome (NC_000932) using the BLASR software under basic local alignment with default settings [30]. The cp-related reads were combined into contigs using the SOAPdenovo software (version 2.04) with default parameters [31]. After removing the debased PacBio reads with read quality < 0.80 or read length < 500 bp, long PacBio subreads were used to repair the gaps between contigs using PBjelly [32]. Then, the BWA software (version 0.5.9) was employed to correct possible misassembly and errors throughout these cp-related Illumina reads [33]. Finally, the frameshift errors were manually corrected during gene prediction.
GeSeq (https://chlorobox.mpimp-golm.mpg.de/geseq.html/) was used to annotate M. arvensis cp genome using default settings. Basic Local Alignment Search Tool (BLAST) was used to define the start and stop codons of each gene through homology searches [34]. The complete gene map of the M. arvensis cp genome was plotted using the OGDraw software (version 1.2) [35]. Finally, the well-annotated M. arvensis cp genome was submitted to public Gen-Bank under the accession number MW279233.

Codon usage bias of the coding sequences
Codon usage bias is believed to affect translational dynamics, including protein folding, translation accuracy, and efficiency [38]. The CodonW1.4.2 program (http://downloads.fyxm.net/ CodonW-76666.html/) with default settings was used to study the translational dynamics of the M. arvensis cp genome [39].

Comparison of the M. arvensis cp genome with C3 cp genomes of other Brassicaceae species
To detect sequence divergence between the M. arvensis cp genome and other related C3 species, the mVISTA web service was used to visualize genome divergence, using the M. arvensis cp genome as a reference. The related cp genome sequences included B. rapa (NC_040849), B. oleracea (NC_041167), B. juncea (NC_0282720), Raphanus sativus (NC_024469), and Orychophragmus diffuses (NC_033498), and they were downloaded from the NCBI. Additionally, IRscope was used to compare the LSC/IRB/SSC/IRA junction regions among the selected cp genomes.

Ks/Ka substitution rate calculation
Synonymous (Ks) and non-synonymous (Ka) nucleotide substitution rates are valuable markers for evaluating genomic evolution [40,41]. To calculate the Ks and Ka ratios, total pairwise comparisons of 77 common shared coding genes (S1 Table) [42]. Pairwise alignments of these genes were carried out using MAFFT [43] with default settings.

Phylogenetic analysis
The maximum likelihood (ML) method with the Tamura-Nei model which was automatically recommended by the MEGA7 was used to determine the genealogical relationship between M. arvensis and related Brassicaceae species [44]. The initial tree for the heuristic search was obtained automatically by applying the Neighbor-Join and BioNJ algorithms to a matrix of pairwise distances estimated by the Maximum Composite Likelihood (MCL) approach. A total of 60 cp Brassicaceae species genomes (S2 Table) were downloaded from GenBank to construct the phylogenetic trees. To increase the efficiency of the phylogenetic analysis, 66 homologous coding sequences (S3 Table) shared by the studied cp genomes were used. A total of 1,000 bootstrap replications were used to increase confidence.

The features of M. arvensis cp genome
A total of 6,834 Mb of raw data (45,560,522 raw reads) were obtained from the Illumina sequencing platform. After filtering, 6,543.7 Mb of clean data (43,846,822 clean reads) were obtained with an average Q20 value of 98.46%. A total of 3,530 PacBio subreads with a mean length of 12,382 bp were obtained (S4 Table). After being mapped to the Arabidopsis thaliana cp genome, 482.3 Mb (7.37% of clean reads) cp-related reads was obtained, representing an coverage of 3,306× over the cp genome. And 187 PacBio subreads (5.3% of total PacBio subreads) were demonstrated to cp-related subreads. Because of the relatively low coverage of the cp genome, these cp-related PacBio subreads were only used to fill the gaps between the contigs constructed by the Illumina reads.
The cp genome of M. arvensis had a quadripartite structure of 153,312 bp in size, including an SSC region of 17,786 bp and an LSC region of 83,134 bp, which were separated by two IR regions of 26,196 bp (Table 1 and Fig 1). Its average GC content was 36.37%, and the IR region had the highest GC content (42.34%), followed by LSC (34.15%), and SSC (29.17%). The M.

PLOS ONE
arvensis cp genome encoded 112 genes, including four rRNAs, 29 tRNAs, and 79 protein-coding genes. Among these genes, 20 genes (four rRNAs, eight tRNAs, and eight protein-coding genes) were duplicated. The coding sequence region was 78,396 bp in size, accounting for 51.13% of the whole genome. Among these genes, 82 genes, including 59 coding genes and 23 tRNAs, were identifed in the LSC region. Twelve genes were observed in the SSC region, whereas 18 genes, including seven tRNAs, four rRNAs, and seven protein coding genes, were duplicated in the IR regions. In the whole genome, 12 genes (four tRNAs and eight coding genes) had one intron, whereas two genes (ycf3 and clpP) had two introns ( Table 2). The rps12 was identified to be a trans-spliced gene harboring one intron.

Comparisons of whole cp genomes
In the comparison of the gene contents of cp genomes among the five related species (R. sativus, B. juncea, B. rapa, B. oleracea, and O. diffuses), novel genes were not observed in the M. arvensis cp genome. The mVISTA program was used to determine whether a change in gene

PLOS ONE
order has occurred in the M. arvensis cp genome compared to the above mentioned cp genomes using the complete cp genome of M. arvensis as a reference. The results showed that all selected cp genomes had high similarity (>95%), indicating a high degree of gene synteny, and no gene rearrangement was detected (Fig 2). However, the untranslated regions showed relatively high divergence among the selected cp genomes. The LSC/IR/SSC junction regions are believed to be changeable and are considered to be the main factor in structural variations in the cp genomes of higher plants [45]. To detect the divergence of the junction regions between the M. arvensis cp genome and related species, the LSC/IR/SSC junction regions and adjacent genes in the M. arvensis cp genome were compared with those in the five above-mentioned species (R. sativus, B. juncea, B. rapa, B. oleracea, and O. diffuses). The results showed that the M. arvensis cp genome was extremely similar to all tested Brassica cp genomes, except for the B. juncea and R. sativus cp genomes at the junction regions. Differences only occurred in the distance/cover of the ycf1 gene to the SSC/IRa and IRb/SSC junctions (Fig 3). The rps19 gene was always located at the LSC/IRB junction in all cp genomes.

Repeat sequence and SSR loci detection
We found 49 pairs of long repeat sequences ranging from 30 to 69 bp in the M. arvensis cp genome using REPuter, and we found that these sequences were predominantly composed of palindromic repeats (28 pairs), followed by forward repeats (19 pairs), and reverse repeats (two pairs), but no complementary repeats were observed (Table 3). Most long repeat sequence pairs (67.35%) were found in the same gene or in the same intergenic spacer region (IGS) region. Sixteen long repeat sequence pairs, most of which were belonged to the forward type, were identified in different genes or intergenic spacer (IGS) regions.

PLOS ONE
Chloroplast genome features of Moricandia arvensis ( Table 4). The longest SSR was a 22 bp long mononucleotide repeat located within ycf1. Only A-type (12 SSRs) and T-type (21 SSRs) mononucleotide repeats were observed. AT/TA was present in all 12 dinucleotide repeats, and the longest type of dinucleotide repeat was AT (20 bp). The unique trinucleotide repeat was ATT (12 bp), which was found in the intergenic spacer IGS region between trnT-UGU and trnL-UAA. Three tetranucleotide repeat types (CAAA, TAAA, and ATAG) and one hexanucleotide repeat (GAAAGT) were also detected in the M. arvensis cp genome. Among the SSR loci, 32 were found in the intergenic spacer IGS regions, accounting for 62.75% of the total SSRs. The remaining SSR loci were distributed in 11 genes, and the pseudogene ycf1 harbored most SSRs (six SSR loci).

Analysis of codon usage bias
The sequences of 79 protein-coding genes generated 25,447 codons. Among them, the leucine (Leu) codon was the most common, accounting for 10.46% of total codons, followed by those encoding isoleucine (8.67), whereas the encoding cysteine (Cys) showed the lowest frequency of only 1.21% (Table 5)

Analysis of Ka and Ks substitution rate
The Ka/Ks ratio has been extensively used to assess how genomic evolution and selection pressure affect genes [41,46]. The ratio of Ka/Ks < 1, Ka/Ks = 1, and Ka/Ks > 1 indicate genes that underwent purifying, neutral, and positive selections, respectively [40]. To determine whether the Ka/Ks ratio provides clues for the evolution of photosynthesis, we calculated the Ka/Ks ratios of 77 homologous coding genes (S3 Table) between the cp genome of M. arvensis and cp genome of the five related species (R. sativus, B. juncea, B. rapa, B. oleracea, and O. diffuses).
The results showed that almost all selected genes had Ka/Ks values < 1 (except petD in the comparison of B. juncea vs. M. arvensis and ycf2 in the comparison of R. sativus vs. M. arvensis). Some genes associated with "Subunits of ATP synthase" (atpB), "Subunit of acetyl-CoA" (accD), "Subunits of NADH-dehydrogenase" (ndhG and ndhE), and "Self-replication" (rps12 and rpl16), generally showed relatively higher Ka/Ks values (> 0.5) than those of other genes, indicating that these genes have likely undergone relatively higher purifying selection pressure.

Phylogenetic analysis
To determine the genealogical position of M. arvensis within Brassicaceae and to determine whether M. arvensis with C3-C4 characters underwent a unique evolutionary origin event, we downloaded cp genomes of 60 Brassicaceae species from the NCBI to construct a phylogenetic tree. The phylogenetic tree (Fig 4) demonstrated that M. arvensis was not clustered into an

Discussion
C4 plants have been studied extensively because of their effective photosynthetic carbon fixation efficiency, accounting for up to 25% of the Earth's primary productivity [14,47]. These C3-C4 intermediate plants are believed to be a bridge during C3--C4 evolution [48,49].
Understanding the cp genome of such C3-C4 intermediates will likely provide essential information on the evolution of photosynthesis. The present study, which reports the complete plastome sequence of M. arvensis confirmed, that joint application of two sequencing platforms, PacBio with long reads and Illumina with short reads, provide efficient and reliable approach for assembling and annotation of chloroplast genomes [17,50]. The M. arvensis cp genome had a typical quadripartite structure of 153,312 bp, which is comparable with the size of other Brassicaceae cp genomes [17,[51][52][53]. The gene content, gene order, and LSC/IR/SSC junction regions of the M. arvensis cp genome were highly conserved compared to those of related C3 species, indicating that no significant change has occurred in the cp genome during C3 to C3-C4 intermediate evolution. A recent study aimed at deciphering the cp genomes of C3, Kranz type C4 and single cell C4 photosynthetic members in Chenopodiaceae also showed that the cp genomes of C3, C4, and single cell C4 species had similar organizations, gene orders, and contents [54].
RLSB binding of chloroplastic rbcL mRNA is associated with C3 to C4 evolution in Flaveria [14]. However, rbcL did not reveal high Ka or Ks ratios in the M. arvensis cp genome (S3 Table) compared to those in related C3 species. In contrast, the genes associated with "Subunits of ATP synthase" (atpB), "Subunits of NADH-dehydrogenase" (ndhG and ndhE), and "Self-replication" (rps12 and rpl16) had relatively higher Ka/Ks values than those of other genes, indicating that these genes have likely accompanied the evolution of C3 to C3-C4 intermediates.
Phylogenetic studies have demonstrated that C3-C4 intermediates are more closely related to their C4 relatives than to their C3 relatives in families comprising C3, C3-C4 intermediate, and C4 (C4-like) species [24]. However, C4 plants are not distributed equally in the plant kingdom [55], and they have not been identified in Brassicaeae. To determine whether M. arvensis underwent a special evolutionary origin event compared to its C3 relatives, a phylogenetic tree was constructed based on common coding sequences. M. arvensis was not clustered into an independent branch, but it constituted a subclade with the cultivated Brassica species and R. sativus. This result is in accordance with the results of the phylogenetic analysis carried out by Warwick and Sauder [56] based on cp trnL intron sequences. A recent phylogenetic study of different Moricandia lines based on ITS data revealed a close relationship between C3-C4 intermediates and their C3 siblings [8]. These results indicate that the C3-C4 intermediate M. arvensis did not evolve through an independent evolutionary origin event.

Conclusion
In the present study, we obtained a well-annotated cp genome of M. arvensis with a C3-C4 intermediate character. We did not detect conspicuous genomic divergence in the M. arvensis cp genome compared to that of other related C3 species. However, the Ka/Ks analysis showed that some genes associated with photosynthesis had high Ka/Ks values compared with the genes of several related C3 species, indicating that these genes have likely accompanied the evolution of C3 to C3-C4 intermediates. Our phylogenetic analysis demonstrated that M. arvensis was clustered into a subclade with cultivated Brassica species and R. sativus, indicating that M. arvensis was not involved in an independent evolutionary origin event. These results will provide guidance for utilizing M. arvensis as a resource for improving photosynthesis in cultivated Brassica species and enable the collection of more data regarding the evolutionary path from C3 to C4. In future studies, deciphering nuclear genomic information will be vital to reveal the key steps in the evolution from C3 to C3-C4 intermediates.