Complete chloroplast genome features and phylogenetic analysis of Eruca sativa (Brassicaceae)

Eruca sativa Mill. (Brassicaceae) is an important edible vegetable and a potential medicinal plant due to the antibacterial activity of its seed oil. Here, the complete chloroplast (cp) genome of E. sativa was de novo assembled with a combination of long PacBio reads and short Illumina reads. The E. sativa cp genome had a quadripartite structure that was 153,522 bp in size, consisting of one large single-copy region of 83,320 bp and one small single-copy region of 17,786 bp which were separated by two inverted repeat (IRa and IRb) regions of 26,208 bp. This complete cp genome harbored 113 unique genes: 79 protein-coding genes, 30 tRNA genes, and four rRNA genes. Forty-nine long repetitive sequences and 69 simple sequence repeats were identified in the E. sativa cp genome. A codon usage analysis of the E. sativa cp genome showed a bias toward codons ending in A/T. The E. sativa cp genome was similar in size, gene composition, and linearity of the structural region when compared with other Brassicaceae cp genomes. Moreover, the analysis of the synonymous (Ks) and non-synonymous (Ka) substitution rates demonstrated that protein-coding genes generally underwent purifying selection pressure, expect ycf1, ycf2, and rps12. A phylogenetic analysis determined that E. sativa is evolutionarily close to important Brassica species, indicating that it may be possible to transfer favorable E. sativa alleles into other Brassica species. Our results will be helpful to advance genetic improvement and breeding of E. sativa, and will provide valuable information for utilizing E. sativa as an important resource to improve other Brassica species.


Introduction
Eruca sativa Mill. is an annual or perennial species of Eruca (Brassicaceae), mainly distributed in Europe and Western Asia. E. sativa is believed to have originated from the Mediterranean region and has been widely used as an oil crop and edible vegetable [1]. Due to the fragrance of its leaves, E. sativa is also a popular salad and spice in Middle Eastern and European countries [2]. Moreover, recent studies have demonstrated that E. sativa has many medicinal and reads were aligned to the published Arabidopsis thaliana (NC_000932) cp genome using BLASR (Basic Local Alignment with Successive Refinement) [26] (E-value: . Then, these selected short reads were used to assemble scaffolds using SOAPdenovo v2.04 according to the default parameters [27]. The low-quality PacBio reads (minimum read length of 500 bp and minimum read quality of 0.80) were removed from the raw data. The long selected PacBio reads were employed to fill the gaps within the scaffolds with PBJelly [28]. To correct possible mis-assemblies and errors, the Illumina reads were aligned to the assembled cp genome using BWA (version 0.5.9) with the default settings [29]. Frame-shift errors were manually corrected during gene prediction. The Dual Organellar Genome Annotator was used to annotate the E. sativa cp genome with default settings [30]. The start and stop codons of each gene were verified by homology searches using BLAST (Basic Local Alignment Search Tool). Then, the E. sativa circular gene map was drawn in OGDraw software version 1.2 [31]. The well-annotated cp genome of E. sativa is available in the public GenBank database (https://www.ncbi.nlm.nih.gov/) under the accession number of MT013255.

Codon bias usage analysis
To understand the translation dynamics of the E. sativa cp genome, the CodonW1.4.2 program [34] (http://downloads.fyxm.net/CodonW-76666.html) was employed to calculate the synonymous codon usage of the protein-coding genes with default settings. The relative synonymous codon usage (RSCU) of all coding genes was also analyzed.

Analysis of the molecular evolution of coding genes
Pairwise comparisons of 79 protein-coding genes shared between E. sativa and six related Brassicaceae species were employed to calculate non-synonymous (Ka) and synonymous (Ks) substitution rates. Pairwise alignments of these genes were carried out by with MAFFT, and the Ka/Ks value was determined with KaKs calculator (version 2.0) according to the default parameters.

Cp genome assembly and features
The Illumina sequencing platform generated 8,078 Mb of raw data, resulting in an average coverage of more than 50,000 over the cp genome. After removing the adapters and low-quality reads, 7,722 Mb of clean data were obtained with an average Q20 of 97.62%. The PacBio platform generated 20,921 subreads with an N50 length of 4,698 bp and an average length of 4,002 bp (S2 Table). Both the Illumina reads and the PacBio subreads were used together to assemble the E. sativa cp genome (see Materials and Methods section). The complete E. sativa cp genome had a quadripartite structure comprised of 153,522 bp, including an SSC region of 17,786 bp and an LSC region of 83,320 bp, which were separated by a pair of inverted repeats (IRa and IRb) of 26,208 bp ( Table 1, Fig 1). The average GC content of the cp genome was  [37]. The average gene length was 867 bp, and protein-coding gene regions accounted for 65.65% of the total sequence. The total length of the genic regions was 74,547 bp, representing 48.46% of the whole cp genome. A total of 82 genes, including 59 protein-coding genes and 23 tRNAs, were observed in the LSC regions. A total of 28 genes: five protein-coding genes, five tRNAs, and four rRNAs, were repeated in the IR regions, while only 11 genes were found in the SSC regions. Among the 113 genes, 14 genes (eight protein-coding genes and five tRNAs) harbored a single intron, whereas three genes (rps12, clpP, and ycf3) possessed two introns (Table 2). Moreover, rps12 was a trans-spliced gene, as reported previously [38]. Detailed information about the gene copy number, the number of introns, and the gene functions are listed in Table 2.

Long repeat sequence and SSR analysis
Long repeat sequences exist widely throughout the genome, and play an essential role in gene expression and regulation. Furthermore, due to the high polymorphism present in these regions, long repeat sequences are ideal for generating genetic and physical maps [39,40]. In the present study, 49 pairs of long repeat sequences were identified, including 19 forward repeats, 28 palindromic repeats, one reverse repeat (44 bp), and one complementary repetition (49 bp) ( Table 3; Fig 2). The longest repeat (85 bp) was a forward type located in the LSC region. Among these long repeats, 30 (61.2%) repeats were found in the LSC region. A majority of the repeat pairs (37 of 49, 75.5%) were found in the same regions, including 30 repeats in the LSC region, five repeats in the IR regions, and two repeats in the SSC region. However, 12 repeats, comprising nine palindromic types and three forward types, were detected in different regions. Complementary repeats are infrequent in other cp genomes in Brassicaceae [41,42]. SSR loci are widely found in various species and are useful for studies of molecular evolution and genetic diversity as well as the development of molecular markers essential for plant breeding [43,44]. Sixty-nine SSRs were identified in the E. sativa cp genome (Table 4; Fig 2), including 44 mononucleotides, 15 dinucleotides, four trinucleotides, two tetranucleotides, two pentanucleotides, and two hexanucleotides with a length of at least 10 bp. Fewer SSR loci were detected than the number of SSR loci reported in other cp genomes of Brassicaceae [16,18,41,45]. Among the 44 mononucleotide repeats (including 15 A type, 27 T type, one G type, and one C type), the longest SSR was one T type of 17 bp, which was found in the SSC regions.
Large subunit of ribosome rps3, rps4, rps7 a , rps8, rpl14, rpl16 b , rpl2 a,b , rpl20, rpl22, rpl23a,rpl32, rpl33 Other genes Maturase matK Similar distributions of mononucleotide repeats were observed in the cp genomes of B. napus [45], Raphanus sativus [18], Nasturium officinale [41], and Sinapis alba [16]; however, the mononucleotide repeats of the G type was only observed in this cp genome. The AT/TA type contributed to all 14 dinucleotides, and the longest type of dinucleotides was AT type of 20 bp. Four trinucleotide repeats were detected, including two AAT (12 bp) types and two ATT (12 bp), which were located in the LSC and IR regions respectively. Two tetranucleotides repeats, CAAA (12 bp) and ATAG (12 bp), were found in the LSC and SSC regions, respectively. Two pentanucleotides (TGTTG and CAACA) and two hexanucleotides (GAAAGT and GTTAGA) were also detected in this cp genome. The number and types of SSR loci varied extensively compared to other cp genomes in Brassicacea using the same identified software and criteria, which supports the idea that these SSRs can be made into lineage-specific markers for genetic diversity analysis. Furthermore, SSRs have been used as markers to understand the evolutionary history [46,47].

Codon biased usage analysis
Codon usage bias exists widely in plastoms and is believed to play a key role in reshaping the cp genome [48]. Moreover, the codon usage bias of some genes in plastoms likely responds to outside pressure [49]. In this study, the codon usage bias and RSCU were analyzed based on 85 CDs (coding sequences) sequences in the E. sativa cp genome. These CDs generated a sequence of 74,547 bp in length, which encoded 24,849 amino acids (Table 5). Of these acids, 2,658 (10.70%) were leucine, representing the most popular type, followed by isoleucine (2,134 codons, 8.59%) and serine (1915 codons, 7.71%), whereas only 300 (1.21%) cysteines were detected (Fig 3). The detailed information of codon usage of the 85 CDs in the E. sativa cp genome is listed in Fig 4. The RSCU values of 29 codons were >1, indicating that these codons

PLOS ONE
The complete chloroplast genome of Eruca sativa were preferentially used. Among these biased codons, the codon for leucine (UUA), was the most preferred codon with an RSCU value of 2.03. The UGG (tryptophan) and AUG (methionine) codons showed no biased usage (RSCU = 1). All of the biased codons ended with A/U, except UUG, which agrees with the results for the N. officinale [41] and S. alba cp genomes [16], suggesting that codon usage of the cp genome in Brassicaceae is conserved.

Comparative analysis of cp genomes of six related species
To detect divergence between the E. sativa cp genome and its related species, six reported cp genome sequences in Brassicaceae were downloaded from the NCBI, including five important Brassica species (B. rapa, B. oleracea, B. juncea, B. nigra, and B. napus), and the model species: A. thaliana. As shown in Table 6, these cp genomes were generally highly conserved. Briefly, the sequences ranged from 152,860 bp (B. napus) to 154,478 bp (A. thaliana) in length, and each component of the quadripartite cycle was comparable among the selected cp genomes. The overall GC content was also quite similar (36.29-36.39%) among these cp genomes. The gene content in these cp genomes was consistent, except that the ycf15 gene was missing in the A. thaliana cp genome. The ycf15 gene was also missing in the S. alba cp genome [16],

PLOS ONE
The complete chloroplast genome of Eruca sativa indicating that this gene was varied widely in Brassicaceae. The adjacent genes and boundaries of LSC/IRb/SSC/IRa among the seven related cp genomes were compared (Fig 4), because the variable boundary regions that are believed to be adriving force for the variation in the angiosperm cp genomes [50]. In this study, the IRb/LSC boundary was located within the coding region of the rps19 gene in all seven selected species. Furthermore, the rps19 gene had an expansion of 113/114 bp in the IRb region in all selected cp genomes. The trnH-GUG gene and rpl2 gene resided at the LSC/IRa border, respectively, in all seven cp genomes, and the trnH gene was 2-30 bp from the border. Additionally, the boundary of IRb/SSC was located in the repetitive regions of the ycf1 and ndhF genes in all seven species, with only 1 bp of ycf1 and 36 bp of ndhF located in the SSC region. The IRa/SSC junction extended into the ycf1 gene in  the other five species and the length of the ycf1 genes ranged from 1,027 bp to 1,030 bp in the IRa region, but the ycf1 gene was missing from the IRb/SSC regions in the E. sativa and B. juncea cp genomes. Similar results were observed in mustard species, such as S. alba [16] and S. arvensis.
To further detect divergence of the cp genomes among related species, and to verify whether genome rearrangement had taken place in the E. sativa cp genome, we used mVISTA to compare the homology of the whole cp sequence among the seven selected cp genomes of Brassicaceae, using the E. sativa cp genome as a reference (Fig 5). The results showed that no genome structural rearrangement had occurred in any of the selected cp genomes, and the selected cp genomes were highly conserved with a genome similarity of > 90%. However, the non-coding regions were more divergent than the coding regions, and the LSC and SSC regions were also more divergent than the IR regions. Furthermore, the matk, atpA, rpoC2, accD, rpoA, rps19, ycf2, ycf1 and ccsA genes were quite mutable. Synonymous (Ks) and nonsynonymous (Ka) nucleotide substitution patterns of genes are important indicators of gene evolution [18]. The Ka/Ks ratio is used to assess whether there are selective pressures on protein-coding genes or to evaluate the rate of gene divergence. Ka/Ks ratios > 1, close to 1, or < 1 indicate that the gene has undergone positive selection, neutral selection, or purifying selection, respectively [51]. In this study, we calculated the Ka/Ks ratios of the E. sativa cp genome compared to six closely related species, including B. rapa, B. oleracea, B. juncea, B. nigra, B. napus and A. thaliana (Fig 6). A total of 79 homologous CDs were selected to calculate the Ka/Ks among the selected cp genomes. The results showed that the average Ka/Ks ratio was 0.17, after removing the genes with Ka or Ks of 0, indicating that the genes in the E. sativa cp genome were subject to strong purifying selection pressures. The majority of genes

PLOS ONE
The complete chloroplast genome of Eruca sativa had a Ka/Ks ratio < 0.5 in all comparisons, and the Ka/Ks ratios of most of the genes were comparable among each of the comparisons, except the rps3, accD, ndhB, rpl22, ycf1, ycf2, and rps12 genes of which the Ka/Ks ratio was elusive. For example, the Ka/Ks ratio of ycf1 was 1.144 in the comparison of B. oleracea vs. E. sativa, but the ratio was < 1 in the other five comparisons. Similar results were observed for ycf2 and rpl22. The ycf1 and ycf2 genes are considered to be pseudogenes and have been believed to have no function in plants for a long time [52,53]. However, knockout studies have shown that the ycf1 gene is indispensable for plant cell survival [45]. The various Ka/Ks ratios of the ycf2 gene observed in the S. alba cp genome compared to other related species, indicate that the ycf2 gene was reshaped in response to outside selection stress. It is the largest plastid gene reported in angiosperms, but the function of ycf2 is largely unknown [54].

Phylogenetic analysis of the Brassicaceae
Cp genomes containing a large amount of genetic information have been more accessible with the development of high-throughput sequencing technology. Accordingly, several studies have employed cp genomes to detect the phylogenetic relationships in the Brassicaceae family [15,16,18,41]. Although several studies have systematically elucidated the phylogenetic relationships of Brassicaceae species based on nuclear gene [55][56][57] and cp gene information [16,41,58], the systematic position of E. sativa remains unclear. Based on the AG (cis-regulatory sequences of the floral homeotic gene) gene, E. sativa is closely related to S. alba [59]. Another study employing several cp genes to construct the phylogenetic relationship demonstrated that the genus Eruca had the closest relationship with Diplotaxis harra [58]. However, both studies indicated that the genus Eruca is closely related to the genus Brassica. In this study, the cp genomes of 59 Brassicaceae species were used for the phylogenetic analysis based on 62 homologous CDs. The phylogenetic tree generated 58 branches with node values > 48% (Fig 7). Among these branches, 50   closer to the genus Sinapsis than other Brassica species at the cp genomic level [16]. These studies support that the genera Brasscia, Eruca, Sinapsis, and Raphanus share a similar ancestor, or exchanges/captures of maternal genetic information occurred among these species during speciation.

Conclusions
In the present study, we obtained the complete cp genome of E. sativa using the combination of PacBio Sequel and Illumina HiSeq reads. The results showed that the cp genome had a typical quadripartite structure of 153,522 bp, consisting of two copies of inverted repeat (IRa and IRb) regions of 26,208 bp separated by one LSC region of 83,320 bp and one SSC region of 17,786 bp. The cp genome harbored 112 unique genes, including 79 protein-coding genes, 29 tRNA genes, and four rRNA genes. The synonymous (Ks) and non-synonymous (Ks) substitution rate analysis showed that protein-coding genes generally underwent purifying selection pressure, except ycf1 and ycf2. A phylogenetic analysis revealed that E. sativa is closely related to agriculturally important Brassica species, and most closely related to B. juncea, indicating that it may be possible to transfer favorable E. sativa alleles into other Brassica species. These results are helpful to further genetic improvement and breeding of E. sativa, and also provide valuable information for understanding the evolutionary history of E. sativa.
Supporting information S1