Complete Plastid Genome Sequencing of Trochodendraceae Reveals a Significant Expansion of the Inverted Repeat and Suggests a Paleogene Divergence between the Two Extant Species

The early-diverging eudicot order Trochodendrales contains only two monospecific genera, Tetracentron and Trochodendron. Although an extensive fossil record indicates that the clade is perhaps 100 million years old and was widespread throughout the Northern Hemisphere during the Paleogene and Neogene, the two extant genera are both narrowly distributed in eastern Asia. Recent phylogenetic analyses strongly support a clade of Trochodendrales, Buxales, and Gunneridae (core eudicots), but complete plastome analyses do not resolve the relationships among these groups with strong support. However, plastid phylogenomic analyses have not included data for Tetracentron. To better resolve basal eudicot relationships and to clarify when the two extant genera of Trochodendrales diverged, we sequenced the complete plastid genome of Tetracentron sinense using Illumina technology. The Tetracentron and Trochodendron plastomes possess the typical gene content and arrangement that characterize most angiosperm plastid genomes, but both genomes have the same unusual ∼4 kb expansion of the inverted repeat region to include five genes (rpl22, rps3, rpl16, rpl14, and rps8) that are normally found in the large single-copy region. Maximum likelihood analyses of an 83-gene, 88 taxon angiosperm data set yield an identical tree topology as previous plastid-based trees, and moderately support the sister relationship between Buxaceae and Gunneridae. Molecular dating analyses suggest that Tetracentron and Trochodendron diverged between 44-30 million years ago, which is congruent with the fossil record of Trochodendrales and with previous estimates of the divergence time of these two taxa. We also characterize 154 simple sequence repeat loci from the Tetracentron sinense and Trochodendron aralioides plastomes that will be useful in future studies of population genetic structure for these relict species, both of which are of conservation concern.


Introduction
The eudicot order Trochodendrales [1] contains only two extant genera, both of which are monotypic: Trochodendron Sieb. & Zucc. and Tetracentron Oliver. Historically, these two genera have been treated either as the separate families Trochodendraceae and Tetracentraceae, or as the combined family Trochodendraceae [1][2][3][4][5][6][7]. The Trochodendraceae sensu APG III [1] appear to have been widespread in the Northern Hemisphere during the Paleogene and Neogene [7][8][9][10][11][12][13][14][15]. However, the two extant species of the family have small geographic ranges and are restricted to eastern Asia [16]. Trochodendron aralioides Sieb. & Zucc. is a large, evergreen shrub or small tree native to the mountains of Japan to South Korea and Taiwan, and the Ryukyu Islands [2,17], whereas Tetracentron sinense Oliver is a deciduous tree occurring in southwestern and central China and the eastern Himalayan regions. Both species are characterized by apetalous flowers arranged in cymose inflorescences and by loculicidal capsules that dehisce to release winged seeds [2,5,7,18]. Although earlier researchers reported that wood of Trochodendrales wood lacked vessels and thus suggested that Trochodendrales were among the earliest-diverging angiosperms, recent research has documented the presence of vessels in the wood of both genera [2,7,19].
Complete plastid genome sequences have been used increasingly over the past decade to resolve deep-level phylogenetic relationships that have been unclear based on only a few genes. For example, recent plastid phylogenomic studies have helped to resolve key relationships among the earliest-diverging Mesangiospermae [33] as well as early-diverging Eudicotyledoneae and Pentapetalae [26,34]. Indeed, the plastid genome represents an excellent source of characters for plant phylogenetics due to the generally strong conservation of plastid genome structure and its mix of sequence regions that vary tremendously in evolutionary rate [35][36][37], which enable plastid genome sequence data to be applied to phylogenetic problems at almost any taxonomic level in plants [26,[38][39][40][41][42][43]. It is now relatively inexpensive to generate complete plastid genome sequence due to rapid improvements in next-generation sequencing (NGS) technologies [25,[44][45] and due to the relatively small size of the plastid genome (,150 kb) and its structural conservation, which enable dozens of plastomes to be multiplexed per sequencing lane and facilitate relatively straightforward genome assembly [45][46][47][48]. Despite the promise of NGS technology for plastid genomics, the complete plastomes of only eight genera of early-diverging eudicots have been reported: Ranunculus (Ranunculaceae, Ranunculales), Megaleranthis (Ranunculaceae, Ranunculales), Nandina (Berberidaceae, Ranunculales), Nelumbo (Nelumbonaceae, Proteales), Platanus (Platanaceae, Proteales), Meliosma (Sabiaceae, Sabiales), Trochodendron (Trochodendraceae, Trochodendrales) and Buxus (Buxaceae, Buxales). Previous phylogenetic analyses based on some of these complete genomes have not fully resolved the relationships among early-diverging eudicots, however; in addition to the uncertainty surrounding relationships of Buxales, Trochodendrales, and Gunneridae, the positions of Sabiales and Proteales remain poorly supported [26][27]. Plastome taxon sampling is still sparse in these clades, however, and additional sampling may help elucidate these recalcitrant relationships.
In addition to their important role in phylogenetics, plastid genomes may be rich sources of population-level data. The nonrecombination and uniparental inheritance of most plastid genomes can make plastid genomes extremely useful for population genetics, particularly for tracing maternal lineages [49][50]. For example, chloroplast simple sequence repeats (cpSSR) have been widely used in plant population genetics [51], including within early-diverging eudicots, where numerous cpSSR loci have been reported from the plastid genome of the endangered species Megaleranthis saniculifolia (Ranunculaceae) [52].
Here we report the complete plastid genome sequences of Tetracentron sinense and Trochodendron aralioides (the protein-coding and rRNA genes of Trochodendron cp genome were used for phylogenetic analyses in Moore et al. [26], but the cp genome structure of this genus has never been reported), as well as the results of new phylogenetic analyses based on adding Tetracentron and Megaleranthis genomes [52] to the 83-gene data set of Moore et al. [26]. We also compare the plastid genome structure of Trochodendron and Tetracentron, including the characterization of a significant expansion of the inverted repeat in both taxa, and we estimate the divergence time between the two genera. Finally, we characterize the distribution and location of cpSSRs in both Tetracentron sinense and Trochodendron aralioides, which provided further opportunity to study the population genetic structures of these two ancient relict species.

Sequencing and Genome Assembly
Illumina paired-end sequencing produced 892.11 Mb of data for Tetracentron sinense. We obtained 9912310 raw reads of 90 bp in length. The N50 of contigs was 13,981 bp and the summed length of contigs was 143,709 bp. The mean coverage of this genome was 5424.26. After de novo and reference-guided assembly, we obtained a cp genome containing nine gaps. PCR and Sanger sequencing were used for filling the gaps. Four junction regions between IRs and SSC/LSC were first determined based on de novo contigs, and subsequently confirmed by PCR amplifications and Sanger sequencing, sequenced results were compared with the assembled genome directly and no mismatch or indel was observed, which validated the accuracy of our assembly. The genome sequences of Tetracentron sinense and Trochodendron aralioides have been submitted to GenBank (GenBank IDs: KC608752 and KC608753).

General Features of the Tetracentron and Trochodendron Plastomes
The plastid genome size of Tetracentron sinense is 164,467 base pairs (bp) (Figure 1), and that of Trochodendron aralioides is 165,945 bp ( Figure 2). Both genomes show typical quadripartite structure, consisting of two copies of an inverted repeat (IR) separated by the large single-copy (LSC) and small single-copy    (SSC) regions ( Table 1). The IR exhibits a significant expansion relative to most other angiosperms at the LSC/IR junction; specifically, the IR in both Tetracentron and Trochodendron has expanded to include the entirety of the rps19, rpl22, rps3, rpl16, rpl14, and rps8 genes (Figures 1, 2). The SSC/IR boundary occurs within the ycf1 gene, as is typical in angiosperms, but is slightly expanded in the Trochodendron genome to include 1461 bp of the 59 end of ycf1 (versus 1083 bp in Tetracentron; Figure 3). This expansion of the IR at the SSC junction contributes to the difference in length between the two Trochodendrales plastomes; the remainder of the difference is largely the result of length differences among various noncoding regions ( Table 2). Both genomes contain 119 genes (79 protein-coding genes, 30 tRNA genes, and 4 rRNA genes) arranged in the same order, of which 24 are duplicated in the IR regions (Table 3). Sequence divergence between Tetracentron and Trochodendron in coding regions is low (Table 4, Figures 4,5). Only 7 genes (rps11, rpoA, rpl32, rps16, ndhF, ycf1, and rpl36) exhibit divergences of more than 2%, and 12 genes have an identical sequence (Table 4, Figure 4). The genes ndhF, ycf1, and rpl36 have the highest sequence divergences (2.7%, 3.5% and 4.4%, respectively). The coding regions account for 57.5% and 57.3% of the Tetracentron and Trochodendron plastid genomes, respectively. For both cp genomes, single introns are present in 18 genes, whereas three genes (rps12, clpP, and ycf3) have two introns ( Table 5). The overall genomic G/C nucleotide composition is 38.1% and 38.0% for Tetracentron and Trochodendron, respectively; detailed A/T contents of different regions of the plastome for both genomes are listed in Table 6. Due to the lower A/T content of the four rRNA genes, the IR regions possess lower A/T content than the single-copy regions.

Characterization of SSR Loci
In all, 154 SSR loci (77 each from Tetracentron sinense and Trochodendron aralioides) were detected in the two plastid genomes, of which 123 are mononucleotide repeats, 28 are dinucleotide repeats, two are trinucleotide repeats, and one is a tetranucleotide repeat (Table 7). Nearly all of the SSR loci are composed of A/T repeats (Table 7), and these SSR loci are mostly present in noncoding regions. The tetranucleotide locus identified in Tetracentron is in the first intron of ycf3. The two trinucleotide loci in Trochodendron are both located in the spacer region between trnK-UUU and rps16. The unique C mononucleotide repeat from Trochodendron is present in the trnV-ndhC intergenic spacer region.

Phylogenetic and Molecular Dating Analyses
ML analyses of the 83-gene, 88-taxon data set yielded a tree with a similar topology and bootstrap support (BS) values ( Figure 6) as that of the plastid phylogenomic study of Moore et al. [26]. The clades of Trochodendron+Tetracentron and Ranunculus+Megaleranthis were supported with 100% ML BS support. Trochodendrales are sister to the remaining angiosperms with high support (BS = 100%), but Buxaceae are sister to Gunneridae with only 67% BS support.

Expansion of the IR Region in Trochodendrales Plastomes
The plastid genomes of Tetracentron and Trochodendron exhibit the typical gene content and genome structure of angiosperms [37,[53][54], with the notable exception of a significantly expanded IR region (Figures 1, 2, 3). This ,4 kb expansion is responsible for the relatively large size of both Trochodendrales plastomes, which are ,4-5 kb larger than the typical upper size range of angiosperm plastid genomes, including those of nearly all other early-diverging eudicots (Table 8). Significant expansion, contraction, and even loss of the IR appears to be an evolutionarily uncommon phenomena but are nonetheless associated with much of the more significant variation in plastome size in angiosperms. For example, the largest known angiosperm plastome, that of Pelargonium x hortorum, also possesses the largest known IR, at ,76 kb in length [55]. Other significant IR expansions and contractions have been found in Campanulaceae [56][57], Apiaceae [58], and Lemna (Araceae) [59].

Impact of Additional Taxon Sampling on Basal Eudicot Phylogeny
The inclusion of Megaleranthis and Tetracentron in our analyses had no effect on the relationships among the major early-diverging eudicot lineages, and very little effect on support values. Of the basal splits among the eudicots with BS values less than 100% in both the current tree and that of Moore et al. [26], all were within 3% BS value. For example, the sister relationship of Buxales and Gunneridae is 70% in Moore et al. [26] vs. 67% with the inclusion of Megaleranthis and Tetracentron, and the sister relationship of Sabiales and Proteales has BS support of 80% in Moore et al. [26] vs. 83% in the current analyses. These similar values are unsurprising given that Tetracentron and Trochodendron are found to be relatively closely related in our analyses. Indeed, the relatively low sequence divergence between the Tetracentron and Trochodendron plastid genomes supports the taxonomic placement of Tetracentraceae within Trochodenraceae, as advocated by APG III [1]. Although it is possible that the addition of the noncoding regions of the plastid genome (or at least those noncoding regions that can be aligned) to our data set may improve support for these relationships, we may have to look to the other plant genomes for a confident resolution of relationships among the earlydiverging eudicots. In fact, the sister relationship of Buxales and Gunneridae received high support (BS = 98%) in the 17-gene analyses of Soltis et al. [28], which employed a combination of 11 plastid genes, 18S and 26S nuclear rDNA, and 4 mitochondrial genes. However, the sister relationship of Sabiales and Proteales were more poorly supported (BS = 59%) in Soltis et al. [28].

Divergence Time Between Tetracentron and Trochodendron
Cenozoic Trochodendrales fossils are known throughout the Northern Hemisphere, with the Paleocene Nordenskioldia the earliest certain fossil of the order [7][8][9][10][11][12][13][14][15]. Both Tetracentron and Trochodendron had wide distributions in the Northern Hemisphere during the Paleogene and Neogene. Fossil remains of Tetracentron have been found in Japan [60][61], Idaho [62], Princeton, British Columbia and Republic, Washington [63], and Iceland [15]; Trochodendron fossil remains have been reported from Kamchatka [64], Japan [11], Idaho and Oregon [11][12], Washington [7], and British Columbia [63]. Our estimate of the divergence time between the two genera of Trochodendraceae (44-30 mya) encompasses the recent estimate of 37-31 mya from Bell et al. [65], which was based on analysis of 567 taxa and three genes, as well as the mid-Eocene estimate of ,45 mya derived from the rbcL analysis of Anderson et al. [66], which employed numerous fossil constraints from the early-diverging eudicots. The congruence among these studies and with the fossil record suggests that a midto late Eocene divergence for the two extant Trochodendraceae lineages may be a reasonable estimate.

Analysis of Plastid SSR Loci in the Trochodendrales
Because microsatellite loci, including cpSSRs, often exhibit high variation within species, they are considered valuable molecular markers for population genetics [67][68][69]. A limited number of SSR loci were recently characterized for Tetracentron [70], but no cpSSR loci are available for Trochodendraceae. The 77 cpSSR loci that were identified in both Tetracentron and Trochodendron represent ,42% more loci than the 54 loci reported in the plastid genome of Megaleranthis (Ranunculaceae), the only other early-diverging eudicot for which a comprehensive analysis of cpSSR loci is available. The abundant and varied cpSSR loci identified in Trochodendrales will be useful in characterizing the population genetics of both extant species, which are of conservation interest in the wild because of their relatively narrow, presumably relictual distributions, and decreasing numbers [71]. Tetracentron is officially afforded second-class protection in China.

Sample Preparation, Sequencing, and Assembly
Fresh leaves of Tetracentron sinense were collected from the Kunming Institute of Botany at the Chinese Academy of Sciences, and a voucher was deposited at the Herbarium of Wuhan Botanical Garden, Chinese Academy of Science (HIB). Chloroplast DNA was isolated following the protocol of Zhang et al. [45], and an Illumina library was constructed following the manufacturer's protocol (Illumina). The DNA was indexed by tag and

Genome Annotation and Analysis
The Tetracentron and Trochodendron plastid genomes were annotated with DOGMA [73] and BLAST tools from NCBI (the National Center for Biotechnology Information). Physical maps were generated using GenomeVx [74] with subsequent manual editing. Sequence divergence between the Tetracentron and Trochodendron plastid genomes was evaluated using DnaSP version 5.10 [75], and genome sequence identity plots were generated using mVISTA [76] (http://genome.lbl.gov/vista/mvista/submit. shtml). Msatfinder ver. 1.6.8 [77] was used to identify SSR loci by manually setting repeat units.

Phylogenetic and Divergence Time Analyses
All protein-coding sequences, as well as all rRNA sequences, were extracted from the Tetracentron and Megaleranthis plastome [52] and added manually to the 83-gene, 86-taxon alignment of Moore et al. [26]. ML analyses were performed on the concatenated 83gene data set using the following partitioning strategy: (1) codon positions 1 and 2 together; (2) codon position 3; and (3) rRNA genes. The optimal nucleotide sequence model was selected for each partition using jModelTest 2.1.1 using the Decision Theory (DT) criterion [78]. The following models were selected: TVM+I+C for codon positions 1+2 and for codon position 3, and TIM1+ I+C for rRNA.
Partitioned ML analyses were conducted using GARLI 2.0 [79]. A total of ten search replicates were conducted to find the optimal tree, and nonparametric bootstrap support was assessed with 100 replicates [80]. All ML searches used random taxon addition to build starting trees.
Divergence times were estimated using BEAST version 1.7.4 [81], using the same dating strategies employed in Moore et al. [26]. In addition to the three calibration points (used in Moore et al. [26]) of minimum ages of 131.8 mya for angiosperms [82][83][84][85], 125 mya for eudicots [83,86], and 85 mya for the most recent common ancestor of Quercus and Cucumis [26], we additionally constrained the stem lineage of Malpighiales using a minimum of 89.3 my [87] and the node uniting Calycanthus and Liriodendron using 98 my [88], and set the age of Proteales to a minimum of 98 my [89].