Comparative Genome Analysis and Phylogenetic Relationship of Order Liliales Insight from the Complete Plastid Genome Sequences of Two Lilies (Lilium longiflorum and Alstroemeria aurea)

Monocots are one of the most diverse, successful and economically important clades of angiosperms. We attempt to analyse the complete plastid genome sequences of two lilies and their lengths were 152,793bp in Lilium longiflorum (Liliaceae) and 155,510bp in Alstroemeria aurea (Alstroemeriaceae). Phylogenetic analyses were performed for 28 taxa including major lineages of monocots using the sequences of 79 plastid genes for clarifying the phylogenetic relationship of the order Liliales. The sister relationship of Liliales and Asparagales-commelinids was improved with high resolution. Comparative analyses of inter-familial and inter-specific sequence variation were also carried out among three families of Liliaceae, Smilacaceae, and Alstroemeriaceae, and between two Lilium species of L . longflorum and L . superbum . Gene content and order were conserved in the order Liliales except infA loss in Smilax and Alstroemeria . IR boundaries were similar in IRa, however, IRb showed different extension patterns as JLB of Smilax and JSB in Alstroemeria . Ka/Ks ratio was high in matK among the pair-wise comparison of three families and the most variable genes were psaJ, ycf1, rpl32, rpl22, matK, and ccsA among the three families and rps15, rpoA, matK, and ndhF between Lilium.


Introduction
The plastid genome generally contains 30-50 different RNA genes and about 100 protein-coding genes in land plants, which are roughly classified into two main groups: genes involved in the expression and translation machinery of the chloroplast, and genes related to bioenergetics and photosynthetic function [1]. It is highly conserved in organization, gene order and content with a typical circular form [2]. In the past several years, numerous data for genomic sequences have been pouring in and are being applied in the plant phylogeny field due to the development and advancement of next-generation sequencing (NGS) technologies. This method has contributed to improve many studies in plant biology: molecular marker development, hybridisation and introgression, transcriptome investigation, phylogenetic and ecological studies and polyploidy genetics [3].
Monocots are one of the major radiations of angiosperms. They have relatively uniform characters: a single cotyledon, parallel leaf venation, floral parts in threes, sieve-tube plastids with several cuneate protein crystals, scattered vascular bundles in their stems, no vascular cambium-producing secondary phloem and secondary xylem with some exceptions [23]. They comprise ca 61,000 species in 78-80 families and 11 orders [24,25] and are one of the most diverse, morphologically varied, ecologically successful and economically important clades of angiosperms [12].
Although many molecular phylogenetic studies have led to a remarkable reclassification and understanding of relationships within and among the orders, many relationships remain weakly supported among the orders and families. In contrast to the morphological data, molecular data placed Acorus on the root node within monocots [26,27]. This was also revealed by recent plastome sequence analyses [12,22,28]. Chase and colleagues [26] suggested a well-defined ordinal relationship in monocots using combined data from seven genes: four plastid genes, one mitochondrial gene and two nuclear ribosomal genes. Twelve orders were positioned subsequently from the root: Acorales, Alismatales, Petrosaviales, Dioscoreales and Pandanales, Liliales, Asparagales and commelinids, which includes the five orders of Arecales, Commelinales, Zingiberales, Dasypogonales and Poales. Although all orders were strongly supported, the sister relationship between Liliales and Asparagales remains uncertain as does the ordinal relationship within commelinids. Givnish and colleagues [12] tried to resolve these uncertain relationships using 81 plastome-encoded gene sequences and an increase in taxon sampling, including 32 families of monocots, and the results showed an improved support value for those relationships in the maximum likelihood (ML) tree. However, their positions and relationships are still ambiguous, and in addition, the position of Liliales and Dioscoreales+Pandanales is reversed in the most parsimonious (MP) tree. These problematic relationships were also described in the phylogenetic study of Moore and colleagues [28] using plastid inverted repeat sequences. Recently Liu and colleagues [29] reported a complete plastid genome of Smilax china (Smilaceae), a member of Liliales, and tried to analyse the phylogenetic relationship of monocots using three different combined data matrix of 63 chloroplast genes (excluding ndhA-K, infA, rps16, and ycf2 from 77 genes data matrix), 77 protein coding genes, and 81 genes (including rRNA genes from 77 genes data matrix) focused on the position of Liliales. It showed a sister relationship of Liliales and Dioscoreales-Pandanales that was strongly supported by 63 genes, while a sister relationship of Liliales with the commelinids-Asparagales clade was supported moderately on ML analysis by the 81 genes. They suggested a rapid divergence among Liliales, Dioscoreales-Pandanales, and commelinids-Asparagales.
We analyzed the complete plastid genomes of two lilies from different families, the Easter lily (Lilium longiflorum, Liliaceae) and Peruvian lily (Alstroemeria aurea, Alstroemeriaceae), which are famous ornamental flowers in the order Liliales. The former is recognized as most distinctive genera being closer to Smilacaceae and the latter is near to the basal clade in the order Liliaes [30]. Although they have not generated the complete sequence and structure, the partial protein coding gene sequences of Lilium superbum was repeatedly referred as a representative of Liliales in many studies. We also attempted to compare the organisation of the whole plastid genome among 3 families of Liliales and to clarify the phylogenetic position of the order Liliales within monocots using 79 protein-coding gene sequences. In addition, we confirmed the sequence variation in species level of the order Liliales using the 78 plastid genes of two Lilium species, newly analyzed L. longiflorum and L. superbum downloaded from the NCBI database. We also described here a new method for extracting plastid DNA with a simple buffer composition compared topreviously applied methodologies [31][32][33], in the hope that it will be applied in monocot plastome research without the centrifuge equipped swinging bucket rotor.

Plastid isolation from two lilies
We established a simple method for isolating plastids of Lilium longiflorum and it was successfully applied to Alstroemeria aurea. We prepared 50 g of fresh young leaves and deposited in a refrigerator for 2 days. They were then cut in 1-2-cm 2 pieces and blended with isolation buffer [CIB; 0.35 M sorbitol, 50 mM Tris-HCl, 5 mM EDTA, 0.1% bovine serum albumin (BSA; w/v, Sigma A4503), 0.01% DL-DTT]. The slush was filtered through a 4-fold gauze by squeezing and a 3-fold miracloth (Calbiochem, cat. no. 475855) without squeezing. The filtrate was centrifuged at 200 g for 3 min and the supernatant was centrifuged again at 1,000 g for 7 min. After the pellet was completely diluted with CIB buffer without BSA, the plastid band was extracted using a 40/80% Percoll gradient by centrifuging at 3,200 g for 15 min. To purify the plastid solution, 3 volumes of CIB buffer without BSA, were added to the isolated solution and centrifuged at 1,700 g for 1 min. The pellet was dissolved again with CIB buffer without BSA, and plastid DNA was used as a template for 454 pyrosequencing and finally extracted using a DNeasy Plant Mini Kit (Qiagen, cat. no. 69104). Until the plastid DNA was isolated, the highspeed centrifuge was maintained at 4°C.

Genome sequencing, assembling and annotation
The genome sequencing was performed at SolGent Co. (Daejeon, Korea). The quality of the DNA was assessed by gel electrophoresis (Fig. S1) and the quantity was estimated by a fluorescence-based method using the Quant-iT PicoGreen dsDNA Kit (Invitrogen). A whole-genome shotgun library was generated from 5 µg of the plastid DNA with the GS DNA Library Preparation Kit (Roche Applied Science) according to the manufacturer's protocol. The DNA library was titrated by means of sequencing on the Genome Sequencer FLX system (Roche Applied Science). Based on the results of the titration sequencing run, an appropriate amount of the DNA library was used for the emulsion PCR setup. Subsequently, the clonally amplified DNA fragments bound to capture beads were enriched and sequenced on four medium regions of a PicoTiter Plate using standard sequencing chemistry (Roche Applied Science). Upon sequencing and processing of the raw data, a de novo assembly was performed using the GS de novo Assembler software version 2.5.3 with default settings. Gaps between the contigs were filled using designed primer sets and whole plastid genome sequence was obtained. The plastid genome of Lilium longiflorum and Alstroemeria aurea were annotated using DOGMA (Dual Organellar GenoMe Annotator, http://dogma.ccbb.utexas.edu/, [34]). Annotation of the transfer RNA gene was performed using DOGMA and the tRNAscan-SE programme (ver. 1.23 [35]). Intron and exon boundaries for intron containing genes were determined by comparison of reference sequences of monocots.

Taxon and gene sampling, sequence alignment and phylogenetic analysis
The 28 taxa selected here represent eight orders of monocots with two ancestral angiosperms and three gymnosperms as outgroups ( Table 1). All of the plastid genome sequences were available in GenBank including the partial coding gene sequences of three species, Pandanus utilis, Yucca schidigera, and Lilium superbum. The character matrix for phylogenetic analysis consisted of the nucleotide sequences of 79 protein-coding genes and all of the pseudogenes were also included in the data matrix (Data File S1), although several gene and intron losses were found. We excluded four ribosomal RNA genes because they affected an obviously incongruent phylogenetic tree of monocots based on the data matrix of gene partitions [29]. They were aligned by MUSCLE [36] and manually adjusted. Both RAxML, BI, and MP analyses were performed on the concatenated 79-gene data set which generated by Geneious R6 (ver. 6.0.5 available from http://www.geneious.com/ [37]). The length of the 79 aligned genes used for phylogenetic analysis was 96,692 bp and the aligned matrix is available from the authors on request. Akaike Information Criterion (AIC) via jModelTest (ver. 0.1.1 [38]) was used to determine the most appropriate substitution model for the full data matrix in addition to 79 separated gene data matrix (Table S1). The RAxML tree was generated with the RAxML BlackBox web-server (ver. 7.2.8 [39]) which performs under the GTR+G model of nucleotide substitution, with gamma distributed rate heterogeneity and a proportion of invariant sites. And also rapid bootstrap was performed with 100 replications. Bayesian inference analysis was performed using MrBayes plug-in [40] in Geneious 6.0.5 [37] with default setting upon the GTR model and gamma rate variations. The MP tree was also constructed by heuristic search and bootstrap was performed with 1000 replicates using PAUP* 4.0b10 [41]. For the heuristic analyses, tree searches were performed with 1000 random sequence additions and tree-bisection-reconnection (TBR) branch swapping, permitting 10 trees to be held at each step to reduce time searching suboptimal 'islands' of trees (e.g. [26,42]). Bootstrap analysis [43] used the same settings as above.

Comparative analyses of plastid genome sequences among the familial and species level in the order Liliales
We compared the plastid genome structures of three families from the order Liliales using the data of Lilium longiflorum, Alstroemeria aurea, and Smilax china [29]. The structural changes in the plastid genome were confirmed by gene content and order comparisons performed using MultiPip-maker [44]. IR boundaries were also described among three families and the substitution rates of 78 genes, which excluded infA gene because of their loss in Smilax and Alstroemeria, were calculated using DnaSP [45]. We also analysed the tandem repeat sequences distribution in the plastid genome of three families using Tandom Repeat Finder [46]. The sequence variation of 78 plastid genes between two Lilium species, L. longiflorum and L. superbum, were also confirmed.

Plastid genome features of the Easter and Peruvian lily
We sequenced the complete plastid genome of the Easter lily (Lilium longiflorum, Liliaceae, KC968977) and the Peruvian lily (Alstroemeria aurea, Alstroemeriaceae, KC968976). Plastid genome of L. longiflorum is 152,793 bp in length and composed of LSC region of 82,230 bp, two IR copies of 26,520 bp and SSC region of 17,523 bp (Figure 1). A total of 136 predicted coding regions were detected, 92 of which were different and 22 of which were duplicated in the IR. The coding regions included 47 protein coding genes, 37 transfer RNAs, 4 duplicated ribosomal RNAs, and 26 ribosomal proteins ( Table  2). Sixteen genes containing introns are trnA-UGC, trnI-GAU, trnK-UUU, trnL-UAA, petB, petD, atpF, ndhA, ndhB, clpP, rpl2, rpl16, rps12, rps16, ycf1 and ycf3. Both clpP and ycf3 were composed of two introns but the others had a single intron. Three genes, psbT, rpl2 and ndhD, possess ACG and rps19 has GUG as start codons. infA may be a pseudogene consisting of 231 bp modified at the 5' end.
A. aurea possess a structurally similar plastid genome to L. longiflorum and it is consisted of 155,510 bp. It includes LSC region of 84,241 bp, two IRs of 26,701 bp and SSC region of 17,867 bp ( Figure 1). It shows the same gene contents and orders excluding the loss of infA and ycf15.

Phylogenetic analyses
Tree topologies for RAxML (-lnL of 564070.1945), Bayesian inference analysis (BI), and the most parsimony analysis (MP) were congruent with each other and all clades were strongly supported in those trees ( Figure 2). Monocots were monophyletic with strong support (BP 100 in both the RAxML and MP tree, PP 1.0 in the BI tree) and Acorus (order Acorales) was positioned in the basal clade within the monocots. Lemna (order Alismatales) was subsequently divergent. Dioscorea (order Dioscoreales) and Pandanus (order Pandanales) weresister to each other and made a strongly supported clade. A sister relationship between order Liliales and order Asparagales -commelinids clade was also improved (BP 98 in RAxML, PP 100 in BI, and BP 92 in MP). Within Liliales, acloser relationship between Liliaceae and Smilacaceae was formed, rather than Alstroemeriaceae. Phoenix (Arecales) was a sister of the Poales clade and Typha (Typhaceae) was located in the basal position in Poales. We constructed a phylogenetic tree according to their substitution model, by a combined data matrix of genes with GTR-model (GTR+G, GTR +I, GTR+I+G, total of them), TVM-model (TVM+G, TVM+I+G, total of them), and K81uf-model (K81uf+G and total of them) generated the RAxML tree, which has the same topology with overall data matrix combined of 79 genes even with the supporting values of branches often decreased (data not shown).
Major gene and intron deletions were found in several orders and are mapped on Figure 2: accD in Acorales, rps16 in Dioscoreales, three genes (accD, ycf1 and ycf2) and two introns (rpoC1 and clpP) in Poaceae and ndhF and ndhA in Orchidaceae of Asparagales (Oncidium has a partial ndhA gene). In addition, ndhK, ndhH, both of rps14 and ycf4, psaJ and rpl33 were lost in Oncidium, Phalaenopsis, Festuca, O. sativa and Triticum. infA was absent in Lemna (Alismatales) and Liliales, although Lilium has pseudo-infA.

Comparison of the plastid genome sequences of three families of the order Liliales
Results of whole-genome comparison of three families, Liliaceae, Smilacaceae, and Alstroemeriaceae, revealed similar plastid genome structures and the gene contents were highly conserved within order Liliales (data not shown) except infA, which was deleted in Smilax and Alstroemeria. They also showed similar identities of 87-88% when compared to the plastid genome sequence of Phalaenopsis (Asparagales). Plastid genome of Smilax was longer than the other two genomes and it arises from the slightly larger LSC, IRs, and SSC and in their similar G+C contents (Table 3). IR junctions were varied among three families, despite the same expansion pattern in Lilium and Alstroemeria (Figure 3). Junction of LSC-IRb (JLB in Figure 3) was confirmed at rps19 in Lilium and Alstroemeria, while it was at rpl22 in Smilax. IRb-SSC boundary (JSB in Figure 3) was extended to ndhF gene only in Alstroemeria and 1bp of it was included in Lilium. IRa boundaries were positioned near to 3' end of ycf1 (JSA in Figure 3) and with an end point of partial rps19 in Lilium and Alstroemeria, and complete rps19 with partial rpl22 in Smilax, although thses was not annotated in their original submission (JLA in Figure 3).
Comparative Genome Analysis of Order Liliales PLOS ONE | www.plosone.org

Sequence variations between two Lilium species
Inter-specific sequence variation was analyzed in 78 genes, excluding cemA which was absent in the database of L. superbum (Table S3). Our analysis included partial sequences for L. superbum because of its incomplete sequence information. The variation ratio was calculated among truly aligned sequences between two species. We described sequence information of the ten most variable genes in Table 4 and it was highest in rps15 with 3.66% variation. Most of them are comprised of less than 1,000bp except three genes rpoA, matK, and ndhF. In contrast, 11 genes were completely conserved, petN, psaI, psaJ, psbF, psbJ, psbL, psbT, psbZ, rpl23, rpl 32, and rps7. We found that some of the genes have indel sequences. 6bp insertion (TTGGCG) of repeat sequence was found in L. longiflorum compared with L. superbum. This similar pattern was also distributed in infA and another repeat sequence composed 6bp (CTTTTA) was deleted in L. longflorum. 6bp insertion (CTTTAG) and 13bp deletion (ATATCTATTTTGATGATAGTGACA) were also detected in rpl20 and ycf2, respectively. In ndhG, eight T of poly-T region was deleted (Table S3).

Discussion
Here, we analyzed the complete plastid genome sequences of two important ornamental lily species (Lilium longflorum and Alstroemeria aurea) of the order Liliales and it allowed us to solve the confused sister relationship of this order. We also achieved a comparative analysis in inter-familial and interspecific variation of true lilied plastid genomes.

Sister relationship of the order Liliales in monocots
Phylogenetic analyses of 79 plastid protein-coding genes produced a well-resolved phylogeny of seven monocot orders and it was mostly congruent with the results of Chase and colleagues [26]. The basal position of Acorales (Acorus) and the subsequent divergence of Alismatales (Lemna) were reconfirmed. These two basal orders are commonly aquatic or emergent aquatic [23]. Petrosaviales, which was recognised in APG III [25], was regarded as a sister of the lilied/commelinid clade [26,27], but unfortunately, it was not included in this study. In recent phylogenetic studies using plastid IR sequences, these relationships were also weakly supported as the Liliales and Asparagales+commelinids clade with BP 68 and Asparagales and commelinids clade with BP 53 [28]. Givnish and colleagues [27] introduced a well-defined relationship of monocots using a ML tree composed of branches with strong support based on 81 plastid genes. Although they focused on the phylogenetic relationship of Poales and a MP tree using the same data matrix, they showed a different topology with a closer relationship between Dioscoreales and Asparagales. The sister relationships among the Dioscorelaes-Pandanales, Liliales, Asparagales, and commelinids clade, which showed the rapid divergence among the groups, were well supported. However, the contradiction between poor resolution and closely related group around Liliales still remained even though an attempt using various data matrix of chloroplast genes [29]. Our phylogenetic tree revealed a strong sister relationship of Liliales to the Asparagales+commelinids clade with improved resolution compared to the previous phylo-genomic study [29] using Smilax and L. superbum data as a representatives of Liliales. Additionally, in Poales, Typhaceae was located on the root of Table 2. List of genes found in Lilium longiflorum chloroplast genome.

Group of gene
Name of gene* No.   [11,13]. Gene loss events in major clade of monocots was congruent with the previous studies [22,29,47], except infA gene loss in Liliales, although it was present in Lilium as a pseudo-gene.

Inter-familial variation of plastid genome structure in Liliales
The gene contents and orders in the plastid genome were congruent among three families, except infA loss in Smilax and Alstroemeria. If unclear IR boundary information is given, incorrect gene order phylogenies are recovered. Therefore, the maintenance of the IR is necessary in the evolution of chloroplast genomes in most cases. Yue and colleagues [48] proposed that the IR provides an insulation mechanism that stabilizes the genome structure and the genes in single copy regions do not commute across the IR. Generally, IRs of monocots contained a trnH-rps19 gene cluster near the IRa-LSC junction; moreover, they expand more progressively than non-monocot angiosperms [49]. They explained that a doublestrand break (DSB) event first occurred at IRb and led to the expansion of the IR to trnH, followed by a successive DSB event within IRa leading to the expansion of the IR to rps19 or to rpl22. Results comparing IR expansion among three families of Liliales showed that they have a typical JLA of monocots with a trnH-rps19 cluster. However, JLB was extended to rpl22 in Smilax plastid genome. Moreover, we found that JSB was moved to ndhF gene in Alstroemeria plastid genome (Figure 3) which explains the length variation in the total sequences of three different plastid genomes, as well as a length variation in inter-genic spacer region (Table 3). Ka/Ks ratios according to the pair-wise comparison of substitution among three families were different to the gene partition using whole data of major monocots lineages of Liu et al. [29]. From the results, we suggest that ccsA, matK, ndh gene series (A, B, D, F, and H),  rpoA, and rbcL will be suitable genes for phylogenetic study of monocots concerning available length using PCR amplification reaction and proportion of variable sites, though ndh genes were often missing especially in some lineage of Asparagales.
Most of repeat sequences were distributed in the non-coding regions of inter-genic spacer or intron. Although, many of the repeat sequences were in Smilax, and the largest repeat, over 100bp, was uniquely found in Lilium.

infA divergence among three families
Millen and colleagues [50] suggested that infA, which codes for translation initiation factor 1, has been entirely lost or has become a pseudogene ca 24 separate times in 309 angiosperms. In four species, this gene was regarded to be transferred from chloroplast to nuclear DNA independently during angiosperm evolution. According to their results, this parallel event occurred in Tricyrtis and Smilax, a member of Liliales. Our data revealed that this event also occurred in Alstroemeria, which is closer to the basal Liliales than Lilium or Smilax. Interestingly, even in the Lilium plastid genome, infA seems to have lost its function because it has AAT instead of AGT in the start codon position and includes two premature stop codons, although we do not know what kind of mutation has occurred in the start position of infA. Further study is needed to improve our understanding of infA gene evolution in Liliales.

Inter-specific variation between two Lilium species
Results of the complete plastid genome sequence of two lilies in this study make it possible to analyse the details of sequence variation in species level although it was restricted just in the coding region. We compared the variation of 78 plastid genes between L. longiflorum and L. superbum, which has been used as a representative of the order Liliales until this time. 11genes of 78 genes were conserved between two species, 19 genes were variable over 1% in the entire aligned sequence, and 5 indels, including two repeat sequences composed of 6bp, were also found. The result will provide an informative guide line for recognizing the species and genus of the order Liliales in the near future when more sequence data is accumulated.
Although there was a limitation of material amount, we describe here a simple method for extracting plastid DNA with a simple buffer composition, in the hope that it will be applied in monocot plastome research, which has already been applied in some of the Asparagales members, for example, Asparagaceae, Alliaceae, and Orchidaceae. Figure S1. Electrophoretogram of isolated plastid DNA in the present study. M) lambda-HindIII digest, Lil) Lilium longiflorum (268 ng/µl), Als) Alstroemeria aurea (314 ng/µl). (TIF)