Complete Chloroplast Genomes of Erianthus arundinaceus and Miscanthus sinensis: Comparative Genomics and Evolution of the Saccharum Complex

The genera Erianthus and Miscanthus, both members of the Saccharum complex, are of interest as potential resources for sugarcane improvement and as bioenergy crops. Recent studies have mainly focused on the conservation and use of wild accessions of these genera as breeding materials. However, the sequence data are limited, which hampers the studies of phylogenetic relationships, population structure, and evolution of these grasses. Here, we determined the complete chloroplast genome sequences of Erianthus arundinaceus and Miscanthus sinensis by using 454 GS FLX pyrosequencing and Sanger sequencing. Alignment of the E. arundinaceus and M. sinensis chloroplast genome sequences with the known sequence of Saccharum officinarum demonstrated a high degree of conservation in gene content and order. Using the data sets of 76 chloroplast protein-coding genes, we performed phylogenetic analysis in 40 taxa including E. arundinaceus and M. sinensis. Our results show that S. officinarum is more closely related to M. sinensis than to E. arundinaceus. We estimated that E. arundinaceus diverged from the subtribe Sorghinae before the divergence of Sorghum bicolor and the common ancestor of S. officinarum and M. sinensis. This is the first report of the phylogenetic and evolutionary relationships inferred from maternally inherited variation in the Saccharum complex. Our study provides an important framework for understanding the phylogenetic relatedness of the economically important genera Erianthus, Miscanthus, and Saccharum.


Introduction
The Poaceae is the grass family comprised of approximately 700 genera and more than 10,000 species and grouped into two major clades, BEP (the subfamilies Bambusoideae, Ehrhartoideae, and Pooideae) and PACMAD (the subfamilies Panicoideae, Arundinoideae, Chloridoideae, Micrairoideae, Aristidoideae, and Danthonioideae) [1][2][3]. The Andropogoneae is one of the tribes of the Panicoideae that includes many economically important C 4  Recent advances in pyrosequencing, which allows high-throughput sequence analysis of a wide range of genomes, has simplified sequencing, considerably increased its speed, and reduced the cost. This approach enables faster and more efficient determination of whole cp genome sequences, and has been applied to many plant species, including those in the Poaceae [33,34]. In this study, we present the complete cp genome sequence of Erianthus arundinaceus determined using pyrosequencing. On the basis of this sequence, we designed a primer set that is useful for validation of ambiguous sites such as homopolymeric and gap regions in Poaceae cp genomes, and also for sequencing of the entire cp genomes; we used these primers to sequence the whole cp genome of Miscanthus sinensis. Our analysis of these cp genomes provides detailed data on the distribution of SNPs, indels, and microsatellites in Saccharum and related genera. We also discuss the evolution of the Saccharum complex based on the sequence variations of these cp genomes.

Assembly and annotation of the chloroplast genomes of Erianthus arundinaceus and Miscanthus sinensis
The E. arundinaceus cp genome was sequenced using pyrosequencing on the 454 GS FLX system. A total of 481,406 sequence reads (average, 336 bp; range, 30-897 bp) were generated, representing a 162-Mbp sequence. After filtering the reads by local BLASTN analysis with the S. officinarum cp genome (GenBank accession No. NC006084) as a reference, 5,052 reads (average, 362 bp) were retained; a 12-fold coverage of the cp genome was reached. There were 30 homopolymeric stretches (!10 bp), which may lead to errors in the assembled sequences [35]. The accuracy of these regions and the inverted repeat (IR) junction regions in assembled sequences was confirmed by using PCR-based sequencing. Thus, the complete E. arundinaceus cp genome sequence was obtained. To determine the complete sequence of the M. sinensis cp genome, we used the Sanger sequencing with primers designed from the E. arundinaceus cp genome sequence. Sixteen overlapping regions were amplified with specific primers (Table 1) and a total of 320 sequence reads were obtained by using 258 primers, among which 253 primers (98.1%) were identical to both M. sinensis and S. officinarum cp genome sequences and 251 (97.3%) were also identical to that of Sorghum bicolor (S1 Table).
The complete cp genomes of E. arundinaceus (141,210 bp) and M. sinensis (141,416 bp) had typical circular structures (Fig 1). The cp genome of E. arundinaceus included a large singlecopy (LSC) region (83,170 bp) and a small single-copy (SSC) region (12,516 bp), which were separated by a pair of IRs (IRa and IRb; 22,762 bp each); that of M. sinensis consisted of an LSC (83,141 bp), an SSC (12,681bp), and two IRs (22,797 bp each). The GC content was 38.5% in the E. arundinaceus genome and 38.4% in the M. sinensis genome; these values were similar to those of other Panicoideae including S. officinarum [31], M. sinensis [33], and S. bicolor [36]. The number of genes was 143 in E. arundinaceus and 141 in M. sinensis, including 86 and 84 protein-coding genes, respectively. Each genome contained 8 ribosomal RNA (rRNA) genes and 49 transfer RNA (tRNA) genes. Coding genes accounted for 58.9% (E. arundinaceus) and 58.4% (M. sinensis) of the genomes ( Table 2). The difference in the gene number was due to a difference in ycf68 in the IR regions, which appeared to be a pseudogene in M. sinensis because of a frame-shift mutation. S. officinarum and E. arundinaceus have the complete ycf68 open reading frame, whereas S. bicolor has a frame-shift mutation at the same position as in M. sinensis. The members of the Saccharum complex also have lost accD, ycf1, and ycf2, which are absent in the cp genomes of other Panicoideae grasses [33,36,37]. We also found that the start codons of the rpl2 and rps19 genes are likely to convert to ACG and GTG via RNA editing during translation both in E. arundinaceus and M. sinensis, as reported in other species [37][38][39].

Sequence variations in cp genomes
We compared sequences determined in this study with those previously registered in GenBank (28 pertial sequences including 10 regions for E. arundinaceus and the whole cp genome sequence for M. sinensis). For E. arundinaceus, sequence variations were identified at ten sites in seven regions, of which four sites (in trnG-trnfM, atpB-rbcL, trnK intron, and rpl16 intron) were mutated in repeat regions (poly A or T). Base substitutions were detected at six sites (atpA-rps14, three sites in the rpl16 intron, rps16-trnQ, and rps3). In the atpA-rps14 intergenic spacer region we found an adenine-to-cytosine transition (A-to-C; A in Japanese accessions and C in Indonesian accessions), which could reflect geographical variation (S1 Fig). Detailed comparison between the M. sinensis sequence determined in this study and the previously reported one [33] (NC028721) detected three SNPs and nine indels. Of these, an indel in rpoC2 and a SNP in ycf3 resulted in amino acid sequence changes (S2 Table).

Whole-genome comparison in the Saccharum complex
A global alignment of the Saccharum complex cp genomes with the Zea mays cp genome (NC001666) as a reference is shown in Fig 2. High sequence similarities in the protein-coding regions were detected. The IR regions showed lower levels of sequence divergence than the single-copy regions, although there was some gene loss. The gene order was identical in E. arundinaceus, M. sinensis, and S. officinarum. However, detailed comparisons within the Saccharum complex revealed a number of SNPs and indels ( Table 3). The rates of SNP substitutions (nonsynonymous [dN] and synonymous [dS]) and their ratio (dN / dS) among the 76 protein-coding genes in comparison with those of Z. mays are shown in Table 4. The dN (0.0039) and dS (0.0170) values of E. arundinaceus were slightly higher than those of the other genera. The dN/ dS values of the Saccharum complex were smaller than 1.0, similar to those of other Poaceae [40][41][42]; these values suggest purifying selection of the cp protein-coding genes in these genera.   1 Length is indicated in base pairs. 2 Including genes detected in this study (not annotated in GenBank). 3 Including ycf15 and ycf68. 4 The numbers of duplicated genes are shown in parentheses. The distribution of microsatellites (also called simple sequence repeats) in the cp genomes of E. arundinaceus and M. sinensis is shown in Table 5. A total of 40 microsatellite regions (!8 bp) were identified in E. arundinaceus, including 36 mono-, 3 tri-, and one tetranucleotide repeats. In M. sinensis, a total of 38 regions were identified, including 36 mono-, one tri-, and one tetranucleotide repeats. The majority of repeats were located in non-coding regions, whereas some were found in genes such as psbC, rpoB, ndhK, infA, and rpl22. Two microsatellites (in rps16-trnQ/UUG and trnR/UCU-trnfM/CAU) were found in E. arundinaceus but not in M. sinensis.  (Fig 3). Maximum parsimony (MP) analysis generated a single most parsimonious tree with a length of 11,454 (consistency index, 0.57; retention index, 0.86; data not shown). The ML and MP trees had similar topology, which was also similar to those of the published phylogenetic trees of grasses based on complete cp genomes [37,43]. The 39 monocot taxa were divided into two major groups, one containing Poales, including the Saccharum complex, and the other one containing all other monocots. E. arundinaceus, M. sinensis, and S. officinarum were grouped into the PACMAD clade, which is one of the major Poaceae lineages. S. officinarum was more closely related to M. sinensis than to E. arundinaceus, in line with previous phylogenetic analyses [14,44].  (2) ndhA (1)  33 Transcription rpoA (4), rpoB (11), rpoC1 (7), rpoC2 (14)  36 Ribosomal proteins (Small subunit) 12 Ribosomal proteins (Large subunit) Rubisco rbcL (5)  5 Non-coding Intron (38)

Features of the chloroplast genomes of E. arundinaceus and M. sinensis
In this study, we determined the complete cp genome sequences of the members of the Saccharum complex, E. arundinaceus and M. sinensis, using 454 GS FLX pyrosequencing and Sanger sequencing. Pyrosequencing has been increasingly used for the sequencing of entire cp genomes, including those of species from several genera of the Poaceae family [33,36,37], because of its high throughput and low cost. However, homopolymer stretches (mononucleotide repeats) cause errors in pyrosequencing data; these errors are generally difficult to correct by increasing sequence read depth [45,46]. In addition, alignment gaps are often allowed in the assembled sequences [45]. In this study, we designed 258 primers, which made it possible to complete sequencing of the entire E. arundinaceus cp genome, and applied these primers to M. sinensis. These primers have high identity with other plant cp genome sequences such as those of S. officinarum and S. bicolor (S1 Table), and could be used, together with pyrosequencing, for resequencing of ambiguous sites such as homopolymeric and gap regions in Poaceae cp genomes, but also for sequencing of entire cp genomes. Homopolymers are often present in cp genomes and may be used as microsatellite markers. Because the cp genome sequences are highly conserved among grasses, microsatellite primers for cp genomes are transferable across species and genera. In addition, homopolymers are highly polymorphic, and are valuable markers for the analysis of differentiation and population structure, although overall the cp genome sequences are highly conserved. Inter-and intraspecific variations of cp microsatellites have been used to estimate the genetic diversity and phylogenetic relationships among species and genera [47]. With a threshold of !8 bp, we found 40 microsatellite loci for E. arundinaceus and 38 for M. sinensis, including 3 tri-and one tetranucleotide repeats, which were located mostly in non-coding regions. This information could be useful for the development of microsatellite markers for the analysis of genetic diversity in Erianthus, Miscanthus, and related genera.

Comparison of the sequences within and among Saccharum complex species
Comparison of the sequences determined in this study and the sequences previously registered in GenBank identified some polymorphisms. Most of them were found in homopolymeric regions in E. arundinaceus. A base substitution identified in the atpA-rps14 intergenic spacer region reflects geographic heterogeneity. Comparison of the whole cp genome sequences of two M. sinensis accessions detected SNPs and indels at 12 sites. These results indicated the presence of intraspecific mutations in the highly conservative cp genome and could be useful for the analysis of genetic diversity and evolution of Erianthus, Miscanthus, and related genera. However, Yook et al. [48] have reported (on the basis of phenotypic and nuclear SSR genotypic The gene contents differ slightly among the three genera of the Saccharum complex because of a frame-shift mutation that resulted in a premature stop codon and loss of the hypothetical gene ycf68 in these genera. Similar mutations have been reported in some other plant species [49]. Intact copies of another hypothetical gene, ycf15, were detected in both E. arundinaceus and M. sinensis cp genomes, although in some other species this gene contains several internal stop codons and is thus nonfunctional [49]. The validity of ycf15 and ycf68 as protein-coding genes is questionable: according to Raubeson et al. [50], their pattern of evolution is not consistent with them encoding proteins. Therefore, these genes were excluded from subsequent analysis in this study and further investigation is required to understand their functions.

Phylogenetic relationships and evolution
Our phylogenetic analysis based on the variation of the nucleotide sequences of 76 proteincoding genes in cp genomes separated Poales from other monocot groups with a bootstrap value of 100%, which is largely consistent with a recent analysis of other cp genome sequences [40]. Our data suggest that S. officinarum is more closely related to M. sinensis than to E. arundinaceus. We estimated that S. officinarum and M. sinensis diverged 3.6 mya, which is in good agreement with divergence times previously estimated on the basis of nuclear (3.1-3.8 mya) genome diversity [51,52]. A study based on restriction fragment length polymorphism analysis, which used 12 cp-specific probes and examined 32 Saccharum complex genotypes, showed that Erianthus diverged from other lineages early in the evolution of the subtribe Saccharinae [14]. Our analysis estimated the divergence time as 9.1 mya. In addition, E. arundinaceus diverged from the subtribe Sorghinae before the divergence of S. bicolor and the common ancestor of S. officinarum and M. sinensis. The present study showed that the cp genome of E. arundinaceus is more closely related to that of S. bicolor than to those of other members of the Saccharum complex. These data support the suggestion of Sobral et al. [14] that the evolutionary history of Erianthus may differ from that of other members of the Saccharum complex.
In the Old World, Erianthus species comprise four cytotypes: diploid (2n = 2x = 20), triploid (2n = 3x = 30), tetraploid (2n = 4x = 40), and hexaploid (2n = 6x = 60), with a basic number of x = 10 [4]. The present study does not clarify how Erianthus was established, and additional investigations are required. Inclusion of different cytotypes in phylogenetic analysis based on cp genome sequences may provide useful information on the origin and establishment of this genus. Maternal origin of hybrids and polyploids of several species has been investigated using cpDNA variations [53][54][55]. The use of combined data on nuclear and cpDNA variations may help determine the origin and evolutionary history of polyploids [56]. In the subtribe Saccharinae, comparative analysis of nuclear genome variations in Saccharum and Miscanthus suggested that a whole-genome duplication occurred in their common ancestor [51]. This molecular phylogenetic approach, which is used to elucidate the origin and history of polyploidization, could also contribute to characterization of the phylogenetic relationships of Erianthus. Therefore, understanding nuclear genome variations, especially in low-copy nuclear loci [52,57], together with cp genome variations would also be useful for clarifying the evolution of the Erianthus polyploid complex. Understanding its evolution could help us to gain more insight into the phylogenetic relationships of the Saccharum complex genera and provide useful information on their ancestor and polyploidization, which is critical for genetic studies and breeding in these genera.

Conclusion
Comparison of the complete cp genomes provided detailed information on genetic variations among three economically important genera, Saccharum, Erianthus, and Miscanthus. Comparison of the sequences indicated that S. officinarum and M. sinensis are more closely related to each other than to E. arundinaceus. We suggest that E. arundinaceus diverged from the subtribe Sorghinae before the divergence of S. bicolor and the common ancestor of S. officinarum and M. sinensis. This is the first report of phylogenetic and evolutionary relationships among the three genera of the Saccharum complex inferred from maternally inherited variations in whole cp genomes and gene data sets. Our results provide an important framework for understanding the phylogeny and evolutionary history of the Saccharum complex. Molecular data for the other genera of the complex, Narenga and Sclerostachya, are limited and further studies on these genera are needed to improve our understanding of the phylogeny and evolution of the Saccharum complex.

Plant materials and DNA extraction
The E. arundinaceus accession JW630 (Genebank accession number JP173957 at the Genetic Resources Center of the National Institute of Agrobiological Sciences, Japan; https://www. gene.affrc.go.jp/index_en.php) is a wild hexaploid collected in Shizuoka prefecture, Japan (the northernmost area of the wild E. arundinaceus range in Japan). The M. sinensis accession Niigata 410 (JP177091) is a wild diploid collected in Niigata prefecture, Japan. Plants were cultivated in a greenhouse at the National Agriculture and Food Research Organization, Institute of Livestock and Grassland Science (NARO-ILGS), and genomic DNA was isolated from fresh green leaves using the CTAB method [58].

E. arundinaceus chloroplast genome sequencing and assembly
The E. arundinaceus cp genome was sequenced by using pyrosequencing. Total E. arundinaceus genomic DNA was sheared by nebulization and then amplified by emulsion PCR. Amplification products were sequenced on a 454 GS FLX Titanium platform (Roche, Basel, Switzerland) [59]. Chloroplast sequence reads were extracted by local BLASTN searches using the cp genome of S. officinarum [31] as a reference and assembled with Newbler software (v 2.5; Roche). Homopolymer regions (poly A/T and poly G/C) and the junctions between single-copy regions (LSC and SSC) and IRs were amplified and confirmed using primers designed from the E. arundinaceus cp sequence (S1 Table) and PrimeSTAR HS DNA polymerase (TaKaRa, Shiga, Japan). PCR products were purified in a QuickStep2 PCR Purification system (Edge Biosystems, Gaithersburg, MD, USA). They were cycle-sequenced with a BigDye Terminator Cycle Sequence Kit v3.1 (Life Technologies, Foster city, CA, USA) and sequenced using an ABI3130xl genetic analyzer (Life Technologies) using primers described below (S1 Table).

M. sinensis chloroplast genome sequencing
The M. sinensis cp genome was sequenced by using Sanger sequencing of PCR products. Sixteen primers to amplify overlapping products (1,747-12,913 bp) were designed from the E. arundinaceus cp genome sequence for initial amplification of the M. sinensis cp genome (Table 1). Amplification reactions and cycle-sequencing were performed as described above for E. arundinaceus. A total of 258 primers (S1 Table) were used to sequence the entire M. sinensis cp genome.

Annotation, microsatellite analysis, and comparison of the chloroplast genomes
The entire sequences of the E. arundinaceus and M. sinensis cp genomes were annotated using Dual Organellar GenoMe Annotator (DOGMA) software [60]. The predicted annotations were manually checked and verified by comparison with sequences from other PAC-MAD clade species. The circular chloroplast genome maps were drawn by GenomeVx software [61].
Genome structures among the genera of the Saccharum complex were compared using mVISTA software in Shuffle-LAGAN mode [63]; sequence annotation of Z. mays was used.

Substitution rates
Substitution rates were calculated using the PAMLX package [64]. The program CODEML in PAMLX was employed to estimate the rates of nonsynonymous (dN) and synonymous (dS) substitutions and their ratio (dN / dS) in 76 cp protein-coding genes aligned by using PAL2-NAL [65]. The maximum likelihood (ML) tree (see below) was used as a topologically constrained tree. The F3 × 4 model was adopted for codon frequencies under the branch-site model (model = 2, NSsites = 2, and cleandata = 1).

Phylogenetic analysis
Nucleotide sequences of 76 cp protein-coding genes of 37 monocot angiosperms and one dicot angiosperm (Artemisia frigida) available in the GenBank database, and those of E. arundinaceus and M. sinensis were concatenated and aligned using Clustal W [66]. After manual editing, phylogenetic analyses using ML and maximum parsimony (MP) were performed with MEGA6 [67] using subtree-pruning-regrafting and nearest-neighbor-interchange algorithms, respectively. The gaps in the alignment were treated as missing data and statistical support at each node was assessed by bootstrapping [68] with 1,000 replicates. Bootstrap values are indicated on the tree.

Estimation of divergence time of the Saccharum complex
A set of 76 protein-coding genes was aligned and used for the estimation of divergence time. The analysis was performed with nine species including three species of the Saccharum complex with a focus on the PACMAD clade (Fig 4) using the BEAST2 program, which infers tree topology, branch lengths, and node ages by using Bayesian inference and Markov Chain Monte Carlo (MCMC) analysis [69]. The AIC (Akaike Information Criterion) analysis was performed by using jModelTest 2.1.6 [70] to identify the best fit of the substitution model for mutation rates. BEAUti in the BEAST2 program was used to set the criteria for the analysis. We used the GTR (general-time reversible) model of nucleotide substitution with five categories of gamma-distributed rate. An uncorrelated lognormal model of rate variation among branches was assumed and a Yule prior on the birth rate of new lineages was employed [71]. A single divergence time was previously estimated, assuming that the major diversification of the grass groups occurred 80 mya and the Andropogoneae crown diverged 20 mya [72,73]; these two time points were used to calibrate the age of the stem nodes. Two independent MCMC runs were performed for 10 million generations with tree sampling every 1,000 generations. The results were checked with Tracer 1.6 [74], and the sampled trees were summarized by using TreeAnnotator v.2.1.2 available in the BEAST2 package, and edited by using FigTree v. 1.4.2 [75]. The mean and the estimated 95% highest posterior density interval for the divergence time are given for the major PACMAD lineages.