Complete chloroplast genome of Castanopsis sclerophylla (Lindl.) Schott: Genome structure and comparative and phylogenetic analysis

Castanopsis sclerophylla (Lindl.) Schott is an important species of evergreen broad-leaved tree in subtropical areas and has high ecological and economic value. However, there are few studies on its chloroplast genome. In this study, the complete chloroplast genome sequence of C. sclerophylla was determined using the Illumina HiSeq 2500 platform. The complete chloroplast genome of C. sclerophylla is 160,497 bp long, including a pair of inverted repeat (IR) regions (25,675 bp) separated by a large single-copy (LSC) region of 90,255 bp and a small single-copy (SSC) region of 18,892 bp. The overall GC content of the chloroplast genome is 36.82%. A total of 131 genes were found; of these, 111 genes are unique and annotated, including 79 protein-coding genes, 27 transfer RNA genes (tRNAs), and four ribosomal RNA genes (rRNAs). Twenty-one genes were found to be duplicated in the IR regions. Comparative analysis indicated that IR contraction might be the reason for the smaller chloroplast genome of C. sclerophylla compared to three congeneric species. Sequence analysis indicated that the LSC and SSC regions are more divergent than IR regions within Castanopsis; furthermore, greater divergence was found in noncoding regions than in coding regions. The maximum likelihood phylogenetic analysis showed that four species of the genus Castanopsis form a monophyletic clade and that C. sclerophylla is closely related to Castanopsis hainanensis with strong bootstrap values. These results not only provide a basic understanding of Castanopsis chloroplast genomes, but also illuminate Castanopsis species evolution within the Fagaceae family. Furthermore, these findings will be valuable for future studies of genetic diversity and enhance our understanding of the phylogenetic evolution of Castanopsis.


Introduction
Castanopsis (Lindl.) Schott. is a monoecious, broad-leaved tree of the genus Castanopsis belonging to the Fagaceae family. The genus contains approximately 120 known species, of which 58 are native and 30 are endemic to China. However, C. sclerophylla is widely distributed in East and South Asia, and the tree has been introduced to North America [1,2]. In a1111111111 a1111111111 a1111111111 a1111111111 a1111111111

Chloroplast genome assembly and annotation
The high-throughput raw reads were trimmed by FastQC. Next, the trimmed paired-end reads and references (C. hainanensis, C. echinocarpa, and C. concinna) were used to extract chloroplast-like reads, which were assembled by NOVOPlasty [21]. NOVOPlasty assembled partial reads and stretched them as far as possible until a circular genome is formed. A highquality complete chloroplast genome was ultimately obtained. The assembled genome was annotated using CpGAVAS [22]. BLAST and Dual Organellar Genome Annotator (DOGMA) wereapplied to check the annotation results [23]. tRNAs were identified by tRNAscan-SE [24]. Circular gene maps of C. sclerophylla were drawn with the OGDRAW v1. 2 program[25]. To analyze variation in synonymous codon usage, MEGA7 was used to compute relative synonymous codon usage (RSCU) values, codon usage, and GC content [26]. RSCU represent the ratio of the observed frequency of a codon to the expected frequency and is a good indicator of codon usage bias [27]. When the RSCU value is less than 1, synonymous codons are used less frequently than expected; otherwise, the value is greater than 1 [28].

Comparative analysis and phylogenetic analysis
MUMmer [29] was employed for paired sequence alignment of the chloroplast genomes. Sequence divergence was computed pairwise distance between each two species adopting protein-coding sequences using MEGA 5.0 with Kimura 2-parameter model [30]. The mVISTA [31] program was used to compare the complete chloroplast genome of C. sclerophylla to three other published chloroplast genomes of the genus Castanopsis, i.e., Castanopsis concinna voucher Strijk_1489 (KT793041.1), C. echinocarpa (KJ001129.1), and C. hainanensis (MG383644.1), in Shuffle-LAGAN mode, adopting the annotation of C. concinna as a reference.
In total, 20 chloroplast genomes belonging to Fagaceae were analyzed in this study, including the newly generated chloroplast genome C. sclerophylla and all of the published chloroplast genomes (data present in NCBI GenBank on 31.12.2018 [32], followed by visualization and manual adjustment of multiple sequence alignment in BioEdit [33]. The maximum likelihood (ML) analysis was conducted using RAxML web servers [34]. For ML analyses, general time reversible (GTR)+ G model was used in as suggested by 1,000 bootstrap replicates with the default tree search algorithm of hillclimbing [30,35,36].

Characteristics of C. sclerophylla cpDNA
A total of 65 million paired-end reads were obtained, and 10.44 Gb of high-quality clean data with a mean Q30 higher than 88.28% were obtained by removing low-quality reads and connector sequence. The remaining high-quality reads were utilized in the further assembly. The complete chloroplast genome sequence of C. sclerophylla is 160,497 bp in length; it has been deposited in GenBank under accession number MK387847. The genome has a typical quadripartite structure including a pair of IR (IRa and IRb) regions of 25,675 bp that are separated by an LSC region of 90,255 bp and an SSC region of 18,892 bp (Fig 1, Table 1). The overall GC content of the chloroplast genome is 36.82%, which is similar to that of other Fagaceae species [37][38][39]. However, a few differences in GC content were found among the chloroplast genomes. The GC contents of the LSC, SSC, and IR regions are 34.65%, 30.94%, and 42.78%,  Table 2). The GC content is highest in IR regions (42.78%), likely due to the presence of four duplicated ribosomal RNA genes in this region, a pattern also found in the chloroplast genome of C. hainanensis [38]. The overall GC content is an important species indicator [40].
A total of 131 genes were found in the C. sclerophylla chloroplast genome, including 86 protein-coding genes, 37 tRNA genes, and 8 rRNA genes (Fig 1, Table 1). Of these 131 genes, 110 genes are unique and annotated and divided into three categories: 79 protein-coding genes, 27 tRNA genes, and four rRNA genes (Table 3). In addition, 21 functional genes (seven proteincoding genes, four rRNA genes, and 10 tRNA genes) are duplicated in the IR regions (Fig 1). The LSC region comprises 62 protein-coding genes and 22 tRNA genes, whereas the SSC region comprises 11 protein-coding and one tRNA gene (S1 Table). There are 14 intron-containing genes, including eight protein-coding genes and six tRNA genes. Twelve genes contain one intron,and clpP and ycf3 have two introns. trnK-UUU contains the longest intron (2,511 bp); and trnL-UAA the shortest (485 bp) ( Table 4). A similar phenomenon is also present in Quercus acutissima [41]. ycf3 gene expression results in stable accumulation of photosystem I complexes [42]. Therefore, we herein focus on the ycf3 intron gain in C. sclerophylla, which may be helpful for further study of the photosynthesis mechanism.

Codon usage analysis
Relative synonymous codon usage frequency (RSCU) valueswere computed for the C. sclerophylla chloroplast genome using protein-coding sequences (S2 Table), as codon usage plays a vital role in shaping chloroplast genome evolution [43]. In total, 23,131 codons are present. Leucine (10.61%) is the most commonly encoded amino acid, with 2,454 codons, followed by isoleucine (8.85%) with 2048 codons; cysteine (1.13%) is the least commonly encoded amino acid, with 262 codons (Fig 2). Similar ratios for amino acids were previously reported for chloroplast genomes [44,45]. Moreover, methionine and tryptophan are encoded by only one codon, indicating no codon bias for these two amino acids (RSCU = 1). Nearly all of the codons ending with A and U had RSCU values of more than one (RSCU > 1), whereas the codons ending with C and G had RSCU values of less than one. The AU contents for the first, second, and third codon positions were calculated to be 54.07%, 56.29% and 70.20%, respectively. The results of high AU content at the third codon position were similar to reports for other plants [46].
The expansion and contraction of IR regions at the borders are the major reason for chloroplast genome size variation and play vital roles in evolution [47][48][49]. A detailed comparison of four junctions (JLA, JSB, JSA, and JLA) between the two single-copy regions (LSC and SSC) and the two IRs (IRa and IRb) was performed for C. sclerophylla, C. hainanensis, C. echinocarpa and C. concinna by analyzing exact IR border positions and adjacent genes (Fig 4). Overall IR regions are relatively conserved in the genus Castanopsis, and this result agrees with reports for the genus Quercus [41]. The rpsl9 gene is located between the junction of the LSC and IRb regions in C. concinna. However, in the C. sclerophylla, C. hainanensis, and C. echinocarpa chloroplast genomes, the rps19 gene is located in the LSC region and is 11 bp, 11 bp, and

Phylogenetic analysis
Phylogenetic analysis was performed by ML based on the 22 aligned sequences of chloroplast genomes (Fig 5). C. fargesii and E. umbra were used as outgroups. The ML-based phylogenetic analysis showed that these four species of the genus Castanopsis form a monophyletic clade and that C. sclerophylla is closely related to C. hainanensis with strong bootstrap values. The ML tree indicated that Castanopsis is closely related to Castanea. Surprisingly, Quercus species do not form a clade, and Quercus is not divided into two clusters containing either evergreen or deciduous tree species. The phylogenetic status of these genera is consistent with a previous report [41,52,53]. The relatively high variation in Quercus may be related to the widely distributed range which need to local adaptation to different environments. Notably, F. engleriana is the first to diverge in Fagaceae, which indicates the relatively high genetic divergence between F. engleriana and others, followed by T. doichangensis, which indicates that they are early diverging taxa in Fagaceae [54]. Moreover, the same topology results of genus Fagus was confirmed by the research based on nuclear marker [55].
Little is known to date about the chloroplast genome of Castanopsis, and only three chloroplast genome sequences of Castanopsis species can be found in GenBank, which has greatly hampered the study of the phylogenetic relationships of this genus. Therefore, more research on the complete chloroplast genomes of Castanopsis species needs to be conducted in the future.

Conclusions
C. sclerophylla is an important evergreen broad-leaved species in the Castanopsis genus of the Fagaceae family. In this study, the complete chloroplast genome sequence of C. sclerophylla Complete chloroplast genome of Castanopsis sclerophylla was determined using the Illumina HiSeq 2500 platform. The C. sclerophylla chloroplast genome exhibits a typical quadripartite and circular structure similar to that of the chloroplast genome of three congeneric species. Compared to the chloroplast genomes of the three other Castanopsis species, that of C. sclerophylla is the smallest (160,497 bp). In the ML phylogenetic tree, the phylogenetic relationships among 22 angiosperms strongly support the known classification of C. sclerophylla, and ML analysis showed that these four Castanopsis species form a monophyletic clade and that C. sclerophylla is closely related to C. hainanensis with strong bootstrap values. In addition, Castanopsis is closely related to Castanea. The genus Castanopsis contains approximately 120 known species, nearly half of which are native to China. Indeed, China has a large amount of Castanopsis germplasm resources, and the availability of chloroplast genomes provides a powerful genetic resource for phylogenetic analysis and biological study. Therefore, further research of the complete chloroplast genome of the genus Castanopsis is necessary in the future. The data will contribute to the development of genetic resources and the identification of evolutionary relationships and also facilitate the exploration, utilization and application of conservation genetics for the genus.
Supporting information S1