Characterization of the whole chloroplast genome of Chikusichloa mutica and its comparison with other rice tribe (Oryzeae) species

Chloroplast genomes are a significant genomic resource in plant species and have been used in many research areas. The complete genomic information from wild crop species could supply a valuable genetic reservoir for breeding. Chikusichloa mutica is one of the most important wild distant relatives of cultivated rice. In this study, we sequenced and characterized its complete chloroplast (cp) genome and compared it with other species in the same tribe. The whole cp genome sequence is 136,603 bp in size and exhibits a typical quadripartite structure with large and small single-copy regions (LSC, 82,327 bp; SSC, 12,598 bp) separated by a pair of 20,839-bp inverted repeats (IRA, B). A total of 110 unique genes are annotated, including 76 protein-coding genes, 4 ribosomal RNA genes and 30 tRNA genes. The genome structure, gene order, GC content, and other features are similar to those of other angiosperm cp genomes. When comparing the cp genomes between Oryzinae and Zizaniinae subtribes, the main differences were found between the junction regions and distribution of simple sequence repeats (SSRs). In comparing the two Chikusichloa species, the genomes were only 40 bp different in length and 108 polymorphic sites, including 83 single nucleotide substitutions (SNPs) and 25 insertion-deletions (Indels), were found between the whole cp genomes. The complete cp genome of C. mutica will be an important genetic tool for future breeding programs and understanding the evolution of wild rice relatives.


Introduction
The grass family (Poaceae) is one of the most diverse angiosperm families and contains numerous economically important crop species [1]

Complete chloroplast genome of Chikusichloa mutica
Fresh leaves of the Chikusichloa mutica were collected from a plant (originally collected in the wild by Prof. Song Ge #GS0601 for [34]) grown in the greenhouse of the Institute of Botany of the Chinese Academy of Sciences in Beijing. The total cellular DNA was extracted using the cetyltrimethyl ammonium bromide (CTAB) method and purified with phenol extraction [34]. Amplification and Sanger sequencing methods were employed to complete the whole chloroplast genome of C. mutica. Based on the conserved features of chloroplast genome in land plants [21,24] and our previous result [14,15], by using the chloroplast primers from Wu et al [35], we successfully amplified the entire chloroplast in overlapping fragments. Conditions for PCR amplification were 4 min of initial denaturation at 94˚C, 35 cycles of 45 s at 94˚C, 45 s annealing at 52˚C, and 90 s extension at 72˚C, followed by a final 10-min incubation at 72˚C. The PCR products were purified as described in Tang et al [34] and directly sequenced on an ABI 3730 (Applied Biosystems, Foster City, CA, USA). The final Sanger sequences were trimmed and assembled with the ContigExpress program from the Vector NTI Suite 6.0 (Informax Inc., North Bethesda, MD).

Chloroplast genome annotation
The final assembled chloroplast sequence was submitted to DOGMA (Dual Organellar GenoMe Annotator, http://dogma.ccbb.utexas.edu/) for annotation. The original DOGMA draft output contained many errors caused by variation of the exon-intron boundaries of genes or the questionable positioning of the start and stop codons. To finish the final annotation, we subsequently inspected all the inaccurate positions and performed blast searches within the published chloroplast genome database of related species to perform manual adjustments. Both tRNA and rRNA genes were identified by combining the BLASTN searches with relative species in rice tribes [14] and the DOGMA tools. The final annotation was submitted to GenBank and the diagrammatic annotation of the chloroplast genome was plotted using the bioinformatics tools in Circos 0.67 [36] (Fig 1).

Polymorphisms detection
To compare the polymorphisms in detail between the whole chloroplast genomes within Chikusichloa, the published genome data from C. aquatica (KR078265) [16] was employed for comparison with our newly completed chloroplast genome of C. mutica. Based on the conserved structure of chloroplast genomes within the grass family [14,37], the two genome sequences could be aligned by synteny. MAFFT v7.221 [38] was used to conduct the whole chloroplast genome alignment under the FFT-NS-2 setting, followed by manual adjustment. The two aligned genome sequences were used to extract the number and position of the polymorphic sites by DnaSP v5.10 [39], including the SNPs (single nucleotide polymorphisms) and Indels (insertion/deletions).

Simple sequence repeats (SSRs)
Simple sequence repeats (SSRs), also known as microsatellites with 1-6 bp long repeat motifs, are common genomic features, with high rates of polymorphism due to their slip strand mispairing mutation mechanism [40]. They have been widely used as co-dominant molecular markers in marker assisted breeding, population genetics, and genetic linkage mapping [41].
To identify the distribution of SSRs across the chloroplast genome, the public Perl script MISA (http://pgrc.ipk-gatersleben.de/misa/) was employed. The identification of SSRs included   motif sizes from one to six nucleotide units with repeat lower thresholds set to of 6, 5, 4, 3, 3,  and 3 repeat units for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotide SSRs, respectively. Chikusichloa mutica and 13 other species in the rice tribe were examined for SSRs. Potamophila parviflora (GU592210) and Microlaena stipoides (GU592211) were excluded from this analysis due to their incomplete chloroplast genomes.

Chloroplast phylogenomics analysis
As an important target in plant systematics, the chloroplast genome has been widely used to resolve phylogenetic relationships among plant lineages [19]. To further determine and validate the phylogenic relationships of C. mutica with other Oryzeae species, published chloroplast genomes were included in the phylogenetic analysis, including 15 species from the subfamily Ehrhartoideae (Table 1) and one species (Phyllostachys propinqua) from Bambusoideae. A total of 17 species' whole chloroplast genome data were included in the phylogenetic analysis. The complete chloroplast genome alignment from 17 species was used to construct the phylogenetic tree based on the conserved structure among grass family chloroplasts [14,37,42]. The alignment employed MAFFT v7.221 [38] using the same settings as mentioned in the annotation section above. The final alignments (S1 File) were used to resolve relationships using three different phylogenetic-inference methods: maximum parsimony (MP) analysis in PAUP Ã 4.0b10 [43]; Bayesian inference (BI) in MrBayes 3.1.2 [44] and maximum likelihood (ML) with PHYML Version 2.4.5 [45] applying the settings mentioned previously [14].

Genome assembly and feature
By employing the full set of the primers from Wu et al [35], the complete chloroplast genome of C. mutica was sequenced and assembled. For each amplicon, we conducted bi-directional Sanger sequencing to obtain high-quality sequencing bases. After assembly and editing, the whole chloroplast genome sequence was 136,603 bp in length. The genome was annotated following the methods of Wu and Ge [14] and deposited into GenBank with accession number KU696970. The chloroplast genome of C. mutica is a typical quadripartite structure consisting of a pair of inverted repeats (IRs) with a length of 20,839 bp separated by a small single-copy region (SSC) of 12,598 bp and a large single-copy region (LSC) of 82,327 bp, respectively (Fig 1;  Table 1). It is a AT-rich genome typical of most land plants [18] with a GC content of only 39.04%, similar to most of the published chloroplast genomes in the rice tribe ( Table 2). The GC content of the two IR regions was 44.37%, which is higher than 37.20% of the LSC region and 33.37% of the SSC region ( Table 1). The higher GC content of the IR regions was due to the high (54.78%) GC content of the four ribosomal RNAs (rRNAs). The overall average GC content of the rice tribe species was 38.99% (±0.0004), with the highest GC content in the IR region (44.34%) and the lowest in the SSC region (33.31%) ( Table 2).
To understand the structural differences between chloroplasts in the rice tribe, we compared 15 genomes in the rice tribe and one from bamboo ( Table 2). The total length variation between the complete genomes was approximately 2 kb, ranging in length from 134,494 bp to 136,603 bp with the species in Zizaniinae longer than in Oryzinae. The main contribution to the difference in length is found in the LSC regions, with lengths ranging from 80,411 bp to 82,327 bp ( Table 2). The other regions, including the two IR and SSC regions, are relatively conserved in length within the rice tribe.
It has been shown that chloroplast genomes are conserved in gene content and gene order across the grass family [46]. For the final annotation, we predicted a total of 128 functional genes in the chloroplast genome of C. mutica with 110 unique genes and 18 duplicated genes in the IR regions (Fig 1, S1 Table). Among the 110 unique genes, 76 were protein-coding genes and 34 were RNA genes, including 30 tRNA genes and four rRNA genes (S1 Table). For the 18 duplicated genes in the IR regions, there were six protein-coding genes, eight tRNA genes, and four rRNA genes (S1 Table). Sixteen genes contained introns; 14 contained a single intron (eight protein-coding and six tRNA genes) and ycf3 contained two introns. The rps12 gene was found to be trans-spliced with the 5 0 end exon located in the LSC region and the two 3 0 end exons duplicated in the IR region. The trnK-UUU gene had the largest intron (2,487 bp) with the gene matK located within this intronic region. The total length of 76 protein-coding genes was 55,521 bp, and the GC content for the first, second, and third codon positions was 47.75%, 39.57%, and 31.04%, respectively ( Table 1). The lower percentage of GC nucleotides in our dataset at the third codon position corresponds to previous findings in which the third codon positions are AT-biased in the chloroplasts of land plants.

Simple sequence repeats (SSRs)
SSR markers have been widely used in plant genetics studies and will constitute an important genomic resource with the development of NGS (Next Generation Sequencing) technologies [41]. In this study, we identified a total of 133 SSR loci, including 115 mono-nucleotides, four dinucleotides, three tri-nucleotides, ten tetra-nucleotides, and one penta-nucleotide (Table 3) from the whole chloroplast genome of C. mutica. The majority of the SSR loci were mononucleotides (86.47%), and of those, 91.30% were A/T motifs. These analyses demonstrate that the SSRs in chloroplast genomes are commonly composed of polyadenine (polyA) or polythymine (polyT) repeats [47]. In addition to SSR identification, we also conducted a comparative analysis across chloroplast SSRs in the rice tribe (Table 3). The main source of length variation came from mononucleotide SSRs, in which Zizaniinae chloroplasts possessed more than 110 mononucleotide SSRs of eight nucleotides long or longer and the Oryzinae species sampled possessed fewer than 100 such SSRs. All other SSR motifs were at the same length across the examined chloroplasts among all species.

Dynamic variation of the junctions
The typical quadripartite structure of chloroplast genome possesses four junctions (J LA , J LB , J SA , and J SB ) between the two IRs (IR A and IR B ) and the two single copy (LSC and SSC) regions (Fig 2) [21,48]. The expansion or contraction of the two IR regions produces variation of the four junction regions and provides a valuable signal for phylogenetic analysis [48]. The dynamic variation in IR regions can cause the size changes of chloroplast genome. For example, previous studies have shown that the variation of the junctions in Oryza exceeds the junction variability in Zizania [15]. Between C. mutica and C. aquatic, no junction length variation was found with a similar result for the two Zizania species (Fig 2). Limited junction length variation between these groups indicates a conserved structure in the Zizaniinae subtribe. We also compared the dynamic variation of junctions between the Zizaniinae and Oryzinae subtribes (Fig 2). For J LA , located in the intergenic region of rps19-psbA, the distances between rps19 and J LA varied in length from 41 bp to 49 bp and the distance between psbA and J LA was from 81 bp to 83 bp in Oryzinae. In Zizaniinae, those distances were from 41 bp to 44 bp and 81 bp to 82 bp, respectively. For J LB , positioned between rpl22 and rps19, the distances between rpl22 and J LB varied from 24 bp to 30 bp in Oryzinae, and in Zizaniinae, the distance was consistently 24 bp. From analysis of those two junctions, the variation in Oryzinae was greater than in Zizaniinae. However, the variability in distances for J SA and J SB were greater than J LA and J LB . For J SA in all species, the ndhH gene spanned this junction in the Oryzinae subtribe. The distance that the ndhH gene overlapped the junction, which varied from 163 bp to 625 bp in Oryzinae, while in Zizaniinae, the overlap was consistently 181 bp. For J SB , near the ndhF gene, the distance varied from 17 bp to 42 bp in Oryzinae but from 89 bp to 93 bp in Zizaniinae. The junction comparisons indicate that the structural variation in the Oryzinae subtribe varies more widely than in Zizaniinae. Furthermore, these junction comparisons indicate that J LA and J LB is less variable in length than J SA and J SB , with the former less variable than the latter. From this, variations of J SB could be used as molecular markers to separate the two subtribes given that the distance in Zizaniinae was twice as long as that in Oryzinae for J SB .

Polymorphic variation
The two chloroplast genomes from Chikusichloa were found to be only 40 bp different in length with C. mutica shorter than C. aquatica (Table 2). In addition to total length differences, we assessed SNP and Indel variations between the entire chloroplast genomes of C. mutica and C. aquatica (Fig 1 and Table 4). In total, only 83 SNPs and 25 Indels were reported from the genome comparisons. For the SNPs, 58, 8 (16) and 9 were from LSC, IRs and SSC regions, respectively. For the 25 Indels, 21, 1(2) and 2 were within the LSC, IR and SSC regions. The distribution of these polymorphisms in the genome was as follows: 41, 8 (16) and 7 SNPs were from LSC, IR and SSC regions, and 20, 1(2) and 2 Indels were within LSC, IR and SSC regions, respectively. Most of the Indels and SNP variations were found from non-coding regions, including 64 SNPs and 24 Indels. Nineteen SNPs and 1 Indel were found in the coding regions, with the one Indel 21 base pairs into the rps18 gene. Thirteen of those coding SNPs were as synonymous substitutions, and only six of them were as non-synonymous substitutions (S2 Table). Those six non-synonymous substitutions are also from just six different genes: matK, rpoB, rpoC2, ndhJ, rpl16 and ndhD. The types of mutations between the two genomes were 41 transitions and 42 transversions among the 83 SNPs, and among the 25 Indels, 16 were homopolymer repeats, 4 repeat-related Indels and 5 independent Indels. Eleven of 16 homopolymer variations were A/T single repeats. This homopolymer variation is also consistent with previous findings [47].

Phylogeny
The chloroplast genome has been widely used as an important source for molecular markers in plant systematics [49,50]. However, with the development of high-throughput sequencing, the whole chloroplast genome has recently been used in phylogenetic studies as chloroplast phylogenomics [14,19,27]. The conserved structure among grass species chloroplast genomes has been reported from other lineages [14,37] (S2 Fig). In this study, by employing the whole chloroplast genome alignment and three different methods to resolve the phylogenetic relationships among 16 species from the Ehrhartoideae subfamily and one bamboo species as an outgroup (Fig 3), two clades corresponding to the subtribes Oryzinae and Zizaniinae were resolved with high support (as 100 for ML and MP and 1.0 for BI). Within each clade, the relationships among species matched the topology of previous studies, which used partial chloroplast and/or nuclear genes [6,34]. In subtribe Zizaniinae, the two species in Chikusichloa, C. mutica and C. aquatica were closely clustered together as sister species with equal branch lengths. The two species in Zizania were resolved on branches of different lengths. The differing branch lengths in the Oryzinae suggest heterogeneous evolutionary history between these clades with regard to chloroplast evolution.

Discussion
In this study, by employing the traditional Sanger sequencing method, we completely sequenced the chloroplast genome of Chikusichloa mutica. As an important resource in rice germplasm, the complete chloroplast genome provides a valuable genetic resource for breeding and molecular analysis. Furthermore, the set of conserved primers used in this study could be widely employed in all rice tribe species, as well as Poaceae in general [14,35]. The chloroplast genome of C. mutica is extremely conserved in structure compared with other published grass chloroplasts, with the gene content and number the same as other published chloroplast genomes [14,15,16,51]. In comparison with the other species in Chikusichloa, C. mutica was found to have very limited variations (Fig 1) across the whole chloroplast genome.

Sequencing and assembly strategy
Since the first two complete chloroplast genomes were reported from liverwort [52] and tobacco [53] in 1986, the knowledge of the organization and evolution of chloroplast genomes Whole chloroplast genome of Chikusichloa mutica has increased rapidly. Currently, more than 1,000 fully sequenced chloroplast genomes have been deposited in the public database, brought about by the recent developments in NGS technologies [23] as well as innovations in bioinformatics algorithms for assembly [54]. However, the sequencing quality from the traditional Sanger sequencing remains higher than other NGS technologies. The traditional Sanger method of genome sequencing and assembly is more laborious and costly compared with the NGS method [22]. With the development of NGS and corresponding assembled methods, dozens or hundreds of chloroplast genomes could be completed in less time [55,56]. However, the assembled quality of those genomes should be carefully scrutinized [22]. For example, using the Sanger method, Wu et al [22] sequenced one wild rice chloroplast genome and compared it with another published genome generated by a NGS short reads method. They found that the assembled chloroplast genomes were heterogeneous in coding and noncoding regions. Although NGS methods can produce high coverage for the assembled genome, some questions remain unresolved. For example, NGS data from short reads is difficult to assemble with regard to repeat regions across the genome [57]. Further complicating the solution to short read data is the fact that longer reads appear to possess more sequencing errors [58]. The traditional Sanger sequencing method is still one of the most effective ways to complete high quality genomes in spite of its higher cost and time investment compared to NGS methods. By employing this traditional Sanger method to complete a high- quality chloroplast genome for one wild rice-C. mutica, this study provided many valuable informative markers for future studies. However, with the new generation of sequencing technology, those high error rate sequencing could be improved lots and will change the way of sequencing. The third-generation genomic technologies have been widely used in many species [59,60]. For example, the long-read sequencing technology from Pacific Biosciences' Single Molecule Real-Time (SMRT) sequencing can generate reads with an average~20 kb size, but the error of raw reads can be up to 15% [61]. However, if this SMRT technology could be combined with short sequencing reads as Illumina or by self-correction with sufficient sequencing data, the accuracy of the assembled genome can be improved to over 99.99%.

Conserved chloroplast genome features in the grass family
The typical and stable quadripartite structure in chloroplast genomes, including a pair of IRs separating the LSC and SSC regions, has been reported in thousands of species [21,26]. Among all published chloroplast genomes of the grass family, these conserved structures have been reported in all studies [14,34,37]. With regard to the genome size, the length variation of the whole chloroplast genome varies from 132 kb to 141 kb across Poaceae [14,37]. In comparison, the SSC region is more stable in length than the LSC and IRs regions, with a length of approximately 12.  (Table 1). Secondly, the four junctions of the chloroplast genome [48] were consistently located in the same gene regions (Fig 2). Dynamic placement of junctions indicates the variation of the IR regions [21], and as such, the junction positions could be used in phylogenetic analyses [48]. For example, in Chikusichloa, the distances in all four junctions were the same, but they were different in other species (Fig 2). Thirdly, the gene content for all published chloroplast genomes in the grass family are the same as C. mutica (S1 Table). A total of 78 unique protein coding genes and 30 tRNA and 4rRNA genes were annotated among all grass species [14,37]. All monocots have lost the infA, accD, ycf1 and ycf2 genes from their most recent common ancestors with dicots [62]. Although the conserved features of the chloroplast genome in the grass family are highly conserved, numerous microstructural variations (such as small insertions and deletions and SSR variation) have been found and constitute a valuable resource in phylogenetic and population analyses [22,63]. The high-quality chloroplast genome of C. mutica reported here will be a valuable asset for discovering chloroplast variation in other Poaceae species.

Limited variation within the Chikusichloa genus
Polymorphic markers in chloroplast genomes between different species have provided an abundance of informative loci in plant systematic or barcoding research [49,64]-. In this study, we comprehensively compared the polymorphisms, including the SNPs and Indels, between the two fully sequenced chloroplast genomes of C. mutica (KU696970) and C. aquatic (KR078265). We found extremely limited variations, with only 83 SNPs and 24 Indels from the 136,640-bp alignment matrix between the two species. Most of the polymorphisms from coding genes are also synonymous, only six SNP from six genes are identified as non-synonymous. This also reflects that the variation of those polymorphisms is rare as adaptive. In contrast to Chikusichloa, in Zizania, 744 SNPs and 137 Indels were reported between Z. latifolia and Z. aquatica [15]. Several reasons might explain the differences found between the two genera. First, if the divergence times of Zizania were earlier than Chikusichloa, more variations could accumulate. However, the divergence times between the two genera were nearly equal at approximately 4 MYA [34]. Thus, differences in divergence times do not explain the differences in polymorphisms between the genera. Second, the distribution of species might drive the differences: all three species in genus Chikusichloa are located in Southeast Asia, whereas Zizania has a broad geographic distribution, with Z. latifolia and Z. aquatica separately distributed in Asia and North America [8]. The geographic patterns between these species, indicating a broad radiation and/or long-distance dispersal event, might explain the differences in polymorphisms. Partial lineage-specific variations from their own chloroplast genome were reflected the long distance of the segregation [25,65]. This can be seen from the phylogenetic relationships (Fig 3): the branches of two Chikusichloa species are the same, while the branch lengths between the two Zizania species are longer. Several other factors could also cause such differences, such as the efficiency of the inner DNA polymerase, differences in the molecular evolutionary rate, and demographic history. Additional work is needed to clarify the causes of the different rates of polymorphism found in Zizaniinae.

Conclusion
Using traditional high-quality Sanger sequencing technology, we presented the complete chloroplast genome of Chikusichloa mutica, performed comparative analyses in related species of the rice tribe, and deposited the genome into GenBank with accession number KU696970. The gene content, number and genome organization of C. mutica were identical to all other chloroplast genomes from Poaceae. From the whole genome comparison, limited variations were reported between two Chikusichloa species, with only 83 SNPs and 24 Indels between them. Phylogenetic analysis using whole genome sequences from 17 species in grass demonstrated the close relationship of two Chikusichloa species and also confirmed their phylogenetic position in relation to other rice tribe species. The full chloroplast genome data of C. mutica will facilitate the biological study of this important wild rice species. Furthermore, the chloroplast genome sequence is a valuable genetic resource that can be used to conduct population studies for this species and help shed light on its genetic mechanisms and evolutionary history.