The Complete Chloroplast and Mitochondrial Genome Sequences of Boea hygrometrica: Insights into the Evolution of Plant Organellar Genomes

The complete nucleotide sequences of the chloroplast (cp) and mitochondrial (mt) genomes of resurrection plant Boea hygrometrica (Bh, Gesneriaceae) have been determined with the lengths of 153,493 bp and 510,519 bp, respectively. The smaller chloroplast genome contains more genes (147) with a 72% coding sequence, and the larger mitochondrial genome have less genes (65) with a coding faction of 12%. Similar to other seed plants, the Bh cp genome has a typical quadripartite organization with a conserved gene in each region. The Bh mt genome has three recombinant sequence repeats of 222 bp, 843 bp, and 1474 bp in length, which divide the genome into a single master circle (MC) and four isomeric molecules. Compared to other angiosperms, one remarkable feature of the Bh mt genome is the frequent transfer of genetic material from the cp genome during recent Bh evolution. We also analyzed organellar genome evolution in general regarding genome features as well as compositional dynamics of sequence and gene structure/organization, providing clues for the understanding of the evolution of organellar genomes in plants. The cp-derived sequences including tRNAs found in angiosperm mt genomes support the conclusion that frequent gene transfer events may have begun early in the land plant lineage.


Introduction
Plastid and mitochondria are essential organelles in plant cells. Chloroplasts conduct photosynthesis in the presence of sunlight and mitochondria indirectly supply energy within plant cells; together they form the powerhouses of the cell. Both chloroplasts and mitochondria possess their own genomes. The chloroplast (cp) genome and mitochondrial (mt) genomes are often used for the study of plant evolution [1,2]. From the information of all sequenced cp genomes, most of them range from 120 to 160 kb in length and have GC contents of 30 to 40%. The quadripartite organization is shared by almost all cp genomes, consisting of a large-single-copy region (LSC; 80-90 kb) and a small-single-copy region (SSC; 16-27 kb), as well as two copies of inverted repeats (IRs) of ,20 to 28 kb in size. The gene content and structure of angiosperm cp genome is highly conserved [3,4]. Expansion and contraction of the IR as well as gene and intron losses have been documented in a wide range of angiosperms [5,6].
The mt genome plays crucial roles in plant development and productivity [7]. In comparison to other non-plant eukaryotes, plants have large and complex mt genomes [8,9]. Mt genomes of seed plants are unusually large and vary in size at least in an order of magnitude. Much of these variations occur within a single family [10]. Seed plant mitochondrial genomes are characteristic for their very low mutation rate [11], frequent uptake of foreign DNA by intracellular and horizontal gene transfer [12], and dynamic structure [13]. The evolving land plants have gained new mechanisms to facilitate more frequent gene exchanges between mt and cp genomes as well as between mt and nuclear genomes, which make mt genomes increase their sizes. [14].
In the past several years, we have witnessed a dramatic increase in the number of complete organellar genomes, especially those of plants. Until now, there are 206 complete cp genomes and 47 mt genomes having been deposited in GenBank Organelle Genome Resoures. With the emergence of next-generation sequencers, new approaches for genome sequencing have been gradually applied due to their high-throughput, time-saving, and low-cost advantages. With a new gene-based strategy and combining data from the two next-generation sequence platforms, pyrosequencing (Roche GS FLX) and ligation-based sequencing (Life Technolo-gies SOLiD), we successfully assembled cp and mt genomes of resurrection plant B. hygrometrica (Bunge) R. Br [15]. B. hygrometrica or Bh is a small dicotyledenous, homiochlorophyllous resurrection plant in the Gesneriaceae family, and it is widespread in China, inhabiting shallow rock crevices from humid tropical regions to arid temperate zones [16,17]. In this study, we analyze genomic features and structures of both cp and mt genomes of B. hygrometrica. Through organellar genome comparison with other lower plants and angiosperms, we provide information for the better understanding of organellar genome evolution in land plants.

Results and Discussion
Features of B. hygrometrica cp genome and mt genome The Bh cp genome is 153,493 bp in length and has a GC content of 37.59%. Similar to those of other angiosperms [4,18,19], the Bh cp genome maps as a circular molecule with the typical quadripartite structure: a pair of IRs (25,450 bp, covering 16.6%) separated by the LSC (84,692 bp, covering 55.1%) and SSC (17,901 bp, 11.7%) regions ( Figure 1). It encodes 147 predicted functional genes and 19 of them are duplicated in the IR regions. Among these 147 genes, we identified 103, 36, and 8 protein-coding, tRNA, and rRNA genes, respectively ( Table 1 and Table 2). 38% of the genome is non-coding, including introns, intergenic spacers, and pseudogenes. All the rRNA genes (rrn16, rrn23, rrn5 and rrn4.5) and 7 tRNA (trnI-CAU, trnL-CAA, trnV-GAC, trnI-GAU, trnA-UGC, trnR-ACG, and trnN-GUU) genes are located in IR regions. Similar to other dicot species, Bh has two genes (rps19 and trnH) located in the position of IR/LSC junctions. This is different in monocots, such as rice and maize, whose cpDNAs have a fully duplicated rps19 gene in the IR/LSC junctions [18]. The average length of intergenic regions is 385 bp, varying from 1 to 2,221 bp. There are 4 cases of overlapping genes (psbD-psbC, ndhK-ndhC, atpE-atpB, and ycf1-ndhF), resulting in an average coding density (including conserved genes, unique ORFs and introns) of 1/1,058 bp. The cp genome has 19 intron-containing genes with 12 in protein-coding genes and 7 in tRNA genes. In terms of size, gene content, and intron composition, Bh cpDNA closely is mapped to Olea europaea cpDNA (155,872 bp, GC 37%) [20] among all angiosperms. Sequence alignment shows that 93% (142,189 bp) of the Bh cp genome sequence are covered by that of O. europaea with 95.4% identity ( Figure S1). 36 tRNAs are detected, enabling B. hygrometrica cp genome to decode all 61 codons. All of 3 stop codons are present with UAA being the most frequently used (UAA 40%, UAG 33% and UGA 27%) ( Table S1). Phylogenetic analyses, which were constructed by 63 protein-coding sequences from 12 cp genomes (one green algae as outgroup, one land and 10 seed plants) (Table S2), indicates that the phylogenetic position of B. hygrometrica is closer and older than V. vinifera among the analyzed dicots ( Figure S2).
The Bh master mt genome is assembled into a circular molecule of 510,519 bp in length ( Figure 2) and has an average GC content of 43.27%. This size is bigger than the mt genome of A. thaliana (366,924 bp) [21], but smaller than Vitis vinifera (773,279 bp) [12] among dicots. With only 12% of coding sequences, the largest part (88%) of this genome is non-coding, containing 1.45% repeat and 10.52% cp-derived sequences ( Table 1). The mt genome has 65 genes, including 33 proteincoding, 4 rRNA, and 28 tRNA genes, and 10 genes have exonintron structure. Similar to other angiosperms, the Bh mitochondrion uses the canonical genetic code. All 61 codons are present in mt genome, and UAA (46%) is the most frequently-used stop codon ( Table S3). The known 33 protein-coding genes in mt genome are similar to other published angiosperm mt genomes ( Table 2), such as 9 subunits of NADH dehydrogenase (complex I), 5 subunits of ATP synthase (Complex V), and 3 subunits of cytochrome c oxidase (complex IV). Compared to other angiosperms, we observed that there is one sdh3 and no sdh4 in Bh mt genome. There are 3 copies of sdh3 in mt genome of Nicotiana tabacum [22], and both sdh3 and sdh4 are present in that of V. vinifera [12]. The Bh cox1 has an intron/exon structure that is unlike other higher seed plants (Table S4). There are two 5S rRNA (rrn5) copies and one copy of rrn5 from its cp genome. The best sequence alignment score belongs to V. vinifera mt genome, with 23% (119,377 bp) of Bh mt genome being alignable to that of V. vinifera with 94.2% identity ( Figure S3).
Plant cells often contain multiple clones or copies of cp and genomes, and thus the organellar genomes can be regarded as a population with genetic heterogeneity [4]. Polymorphic sites can be detected by aligning thousands of high quality reads to assembling of the cp or mt genome [23]. Our SNP analysis shows that there is no intervarietal SNPs (intraSNPs) found in Bh cp genome. However, we identified 729 SNPs in Bh mt genome ( Table S5) and SNPs in mt genome occurred at a rate of 1 in 700 bp. We only detected 9 SNPs in gene regions with 2, 1, 1, and 5 in rrn5, rrn26, trnM-CAU, and rrn18, respectively ( Table 3). There are no intraSNPs detected in known protein-coding gene regions. The intraSNPs have been demonstrated to be present in both cp and mt genomes of rice [23,24]. As an indicator for the heterogeneous nature of cp and mt populations, the intravarietal polymorphisms provide us useful markers for the future genetic studies on B. hygrometrica.

Structural dynamics of Bh mitochondrial genome in ontogeny
All previous studies on complete sequencing of flowering plant mt genomes are based on the master circle (MC) hypothesis [21,22,[25][26][27][28]. Arrieta-Montiel et al. reported on the structural dynamics of the common bean mt genome [29]. The analysis of 10 recombinant clones supports existence of the MC molecule in wheat mt genome [7]. However, the result of field-inversion electrophoresis suggests that Physcombitrella mt genome does not consist of a multipartite structure, as seen in angiosperms [30]. As mitochondrial gene orders are significantly different between lower plants and higher flowering plants, the multipartite structures as seen among angiosperms may originate during the evolution of pteridophytes or seed plants [9,30].
Repeat prediction by REPuter shows that there are 14 repeat pairs in Bh mt genome ( Table 4). Since not all the repeats are involved in recombination, from the mt assembly [15] we detected 3 repeat-specific contigs that are candidates for the recombination among the MC and other isomeric (IO) and subgenomic molecules, which have been confirmed by SOLiD sequencing [15]. Those 3 repeat contigs are the repeat pairs of two palindromic matches (1,474 bp and 843 bp) and one forward match (222 bp). By aligning all SOLiD long mate-pair reads to both ends of the repeats, we constructed the MC and 4 isomeric molecules ( Figure 3). These subgenomic molecules are not discussed further because we have yet to find significant differences among the sequence reads.
The length of recombinant repeats (222 bp, 843 bp, and 1,474 bp) of Bh is different from that of V. radiata, and demonstrates the recombination across short mt repeats (38-297 bp) [31]. The longer repeats are reminiscent of those found in other angiosperm mitochondrial genomes, which are involved in mt genome rearrangements and can result in stoichiometric shifting of subgenomic mt genome topologies, occasionally beyond detection level for one (or more) of alternative DNA topologies [29,32]. The 3 recombinant repeats are located in gene-rich regions and split mt genome into 6 segments. Each segment has some essential conserved genes, such as nad1 e4 and rps13 in segment A, and atp6, nad2 e3-5 and coxb in segment B. From this point, it is possible that the mt genome of Bh do not have subgenomic molecules. There are 4 genes (nad1, nad2, nad5 and sdh3) with exon-intron structure separated by the recombinant repeats ( Table 5). The nad1 gene in MC molecule is cross-strand gene with exon 4 in positive strand and exon 1-3 in negative strand. Recombination involving introns might lead to rearranged molecules without loss of essential genes [30]. In all rearranged 4 isomeric molecules, the Trans-splicing genes are different, and the IO3 molecule has all 4 Trans-splicing genes. Trans-splicing status of group II intron widely distributes in the mt genome of higher plant [33][34][35]. We compared gene structures of 15 mt genomes from lower to higher plants, and found 3 conserved genes (nad1, nad2, and nad5) as well as other higher plants contain trans-splicing intron ( Table 6), while there is no intron in those genes of Chara vulgaris. The 3 genes structure supported the multipartite structures formed by multiple recombination may arise with the earliest tracheophytes [30,32], and can be a molecular signature of plant evolution.

Comparative analysis of cp genome organization
We compared 12 cp genomes ranging from green algae to angiosperm ( Table S6). The GC contents of cp genome are lowest in lower plants (Charophyta and Bryophyta) but highest in Cycads. Monocots seem to have slightly higher GC contents than dicots among their cp genomes. The genome size and structure of cp genomes are also different in those cp genomes. For example, C. vulgaris (184,933 bp; Charophyta) has the largest genome while the smallest genome is found in lower plant Marchantia polymorpha (121,024 bp). The genome size of angiosperms is more stable than lower plants with dicots larger than monocots. Compared to lower pants, the most variable portions of angiosperm cp genomes are percentages of IRs (34% in A. thaliana) and LSC (54.5% in A. thaliana) regions. This is the result of IR expansion into the LSC region from green algae to angiosperm [4].
The cp genome contains genes that encode structural and functional components of the organelle. Although some genes and gene clusters are well conserved among all plants, the overall structure of cp genomes show remarkable differences (Table S6). First, there are 63 core protein-coding genes, shared among all plants, whereas there are 3 additional core genes (chlB, chlL and ycf12) only found in the lower plant lineage (Charophyta, Bryophyta, and Cycads). The 63 core cp genes are involved in photosynthesis, energy metabolism, and other housekeeping functions. Second, there are 10 genes (psaM, rpl5, rpl12, rpl19, tufA, ycf20, ycf62, ycf66, odpB, and ftsH) are unique to green algae. All of them reside in the LSC region except ycf20 gene that is duplicated in IR regions. Compared to seed plants, there is only one gene, ribosomal protein L21 (rpl21) is conserved in both green algae and liverwort. Third, all four ribosomal RNA genes (rrn4.5, rrn5, rrn16, and rrn23) have 2 copies in IR regions except the 2 copies of rrn4.5 that is lost in Charophyta. Fourth, gene loss and transfer to the nucleus is a common feature of cp genomes [36]. We detected 3 genes (petL, petN, and ycf3) that are lost at the base of the Bryophyta lineage and 2 genes (accD and ycf2) are lost in monocots as compared to dicots. There are also some speciesspecific gene lost events, such as psaJ in O. sativa and nadJ-ccsA lost in O. europaea [20]. The unique loss of psbZ in LSC region testifies the convergent evolution of B. hygrometrica and O. europaea.
The order of cp genes in plants is not constant, changing among different regions of the genome as large gene clustering become rare. Among 63 core protein-coding genes, 50 are always reside in LSC region, 5 (psaC, ndhD, ndhE, ndhG, and ndhI) in SSC region, and 8 (ndhA, ndhB, ndhF, ndhH, rpl2, rpl23, rpl32 and rps7) in variable positions among 12 examined cp genomes. No conserved proteincoding genes are found constant in IR regions. These mobile genes may serve as an indication of lineage markers, since 4 of them (ndhB, rpl2, rpl23, and rps7) locate on LSC region in lower plants and migrated to IR regions in higher plants. Genes residing in the boundary of LSC/IRA or IRB/LSC are usually ribosomal proteins S12 (rps12) in higher plants and the position-conserved ycf1 is more likely present in the boundary of IRA/SSC and SSC/IRB in dicots.

Comparative analysis of plant mt genomes
The plant mt genomes are exceptionally variable in size, structure, and sequence content and the accumulation of repetitive sequences contributes the most to such variation [31]. From the feature comparison of 15 plant mt genomes ( Table 7), we noticed that their genome sizes vary from 67,737 bp in C. vulgaris to 773,279 bp in V. vinifera. Recently, the large mt genome have been reported in Cucurbita pepo with 982,833 bp [10]. The GC contents of these genomes are also variable from 40 to 47%. There is a massive difference of coding sequences between lower and higher plants. The coding sequence in C. vulgaris is 90.7%, whereas it is 4.94% in Tripsacum dactyloides. Repeat content ranges from 1 to 41% among seed plants and are smallest in B. hygrometrica, composed of only 1.45% of the genome. Both large (.1,000 nt) and small (,50 nt) repeats affect recombination in seed plants [7,31,37]. The protein-coding genes and tRNAs in mt genomes also vary largely because of the large number of function-unknown proteins or ORFs in mt genomes and frequent plastid DNA insertions including cp tRNA genes [26,38].
We also carefully examined conserved genes in different plant lineages (Table S4). First, there are 14 conserved core proteincoding genes shared among all lineages, including seven subunits of NADH dehydrogenase (Complex I), one subunit of ubichinol cytochrome c reductase (Complex III), three subunits of cytochrome c oxidase (Complex IV), and three subunits of ATP synthase (Complex V). All these genes play important roles either in proton movement across the inner membrance of the mitochondrion or electron transfer reactions in the respiratory chain. However, the gene structures are not conserved among them, and only two genes (nad4 and cox2) have exon-intron structures in all mt genomes. For comparison, there are 9 genes (nad3, nad4L, nad6, nad9, cob, cox3, atp1, atp9, and ccmFN) without exon-intron structure in both seed and early land plants and with exon-intron structure at least in one lower plant. Intron structure in mt genes is common as we only detected 6 genes (sdh3, sdh4, atp4, atp8, ccmB, and ccmFN) have no introns among all plants. Second, gene loss is more frequent in dicots than monocots, as genes in cytochrome c biogenesis are lost in both B. vulgaris and A. thaliana. Three species (Nicotiana tabacum, V. vinifera and B. hygrometrica) gained sdh3 as in this study. The number of ribosomal protein genes is different in various plant mt genomes. Most ribosomal proteins (23) are present in V. vinifera genome. Contrast to higher plants, there is no matR detected in liverwort and green algae in this study. However, it is reported that in mosses, Takakia and Sphagnum have part of matR [35,39]. Most of mt genomes in plants have 3 ribosomal RNAs (rrn5, rrn18, and rrn26), but there are multiple copies found in angiosperms (such as T. aestivum and B. vulgaris). Copy number-variable mt genes are reported in wheat, rice, and maize [7]. In summary, since the gene coding fraction is much less among mt genomes as compared to cp genomes, even conserved genes are also variable in gene content, structure, and intron positioning [30].

Plastid DNA insertions in mt genome
One of the important events in determining mt genome size in angiosperms is the frequent capture of sequences from the cp genome [10,26,28,40]. A recently study demonstrates frequent DNA transfer from cp to mt genomes occur as far back as the common ancestor of the extant gymnosperms and angiosperms, about 300 MYA and the frequency of cp-derived sequence transfer is positively correlated with variations in mt genome size [41]. For instance, B. hygrometrica mt genome contains fragments of cp origin, ranging from 50 to 5,146 bp in length ( Table S7). The total fraction of cp-derived sequences present in Bh mt genome is 53,440 bp, 10.5% of the whole mt genome. Most of these insertions are conserved, as evidenced from the observation that 45 out 80 insertions (over 50 bp) are identified in mt genomes of   Ribosomal proteins (LSU) rpl16

tRNAs transfer between cp and mt genomes
To investigate whether mt genomes encodes a full set of tRNAs species necessary for protein synthesis in the organelle, we identified 28 tRNA genes from the complete Bh assembly based on tRNA structures and realized that all 61 codons are used by Bh mitochondria (Table S3). However, the tRNA genes encoded by the mt genome alone are not sufficient to decode all codons; for instance, trnA is missing in Bh, and it suggests that the role of the missing tRNA is supplied by either cp or nuclear genomes [22,26,41]. tRNAs originated from plastids are called cp-derived tRNAs and their counterparts are native mt tRNAs. Half of the 28 mt tRNAs in B.hygrometrica are identified as cp-derived tRNAs ( Table S8) and 19 amino acids are encoded by only one codon except for leucine (UAA and CAA) and serine (GCU, UGA, and GGA).
In contrast to the protein coding genes, mt tRNA genes appear constantly being transferred from cp genomes during the evolution of angiosperms (Figure 4 and Table S8), and the proportion of cp-derived tRNAs in mt genomes increases from 8% in Charophyta to 55% in dicotyledonous plants. There are 17 cpderived tRNAs and 14 mt native tRNAs in mt genome of V. vinifera, which has the most cp-derived tRNAs among dicotyledonous plants. Seven mt-native tRNA genes (trnD-GUC, trnE-UUC, trnI-CAU, trnK-UUU, trnM-CAU, trnS-GCU, trnS-UGA and trnW-GUA) and one cp-derived tRNA gene (trnF-GAA) are common to all 15 species. Compared to the mt native tRNA genes in lower plants, there are three tRNA genes (trnH-GUG, trnN-GUU, and trnW-GUA) integrated as part of large cp genomic fragments into mt genomes among angiosperms [26]. This indicates that frequent DNA transfer from cp to mt genomes occur as far back as the emergence of seed plants [41]. We detected two different tRNAs transfer events in seed plants. One is trnC-GCA transfer in monocots and the other involves two (trnD-GUC and trnQ-UUG) gene transfer events in dicots. Cp-derived tRNA genes replace their mt counterparts were identified in all sequenced angiosperms, even in gymnosperm Cyas mtDNA. But these replacement not occurred in Marchantia, Reclinomonas, Cyanidioschyzon, Nephroselmis, Chara, and Physcomitrella [43].The mt-native tRNA gene (trnG-GCC) had all been lost in monocots and six mt-native genes (trnA-UGC, trnG-UCC, trnL-UAG, trnR-UCU, trnR-ACG, and trnT-GGU) are mostly lost in all angiosperms.    Gene gain and loss in plant organelle genome Starting from Bh organellar genomes, we have analyzed in a systematic way representative cp and mt genomes of various lineages and our results provide information for a better understanding of organellar genome evolution and function. Sequence-based phylogenetic analysis clearly supports the conclusion that Bh is much close to V. vinifera. Structural dynamics of Bh mt genome suggest that the multipartite structures may have started during the evolution of seed plants [30]. However, mechanisms for rapid mt genome rearrangement and expansion among plant lineages remain enigmatic. Based on eleven known cp and mt genomes of different lineages, we showed a strong relationship between the changing organellar genomes among angiosperms, and some of the lineage-associated gene gain and loss may provide excellent markers for phylogenetic studies ( Figure 5). For instance, there are 9 cp and 4 mt genes lost during the evolution from green algae to lower land plants. It seems that monocots have a faster rate of evolution than dicots in organellar genomes in our study, because 3 cp and 9 mt genes are lost in monocots and only 2 mt genes are lost in dicots. In additional, gene structures and positioning of cp and mt genomes are also very informative for the understanding of land plant evolution. In agreement with the results of several previous studies, most of the transferred angiosperm sequences from cp to mt genomes become degenerated and are regarded as junk sequences, whereas some of the cp-derived tRNAs are still functional in mt genomes [26,28,41]. As more plant organellar genome sequences become available, the evolution of plant organellar genomes will unveil its details and mechanisms.

Genome sequencing and assembly
We developed an efficient procedure for Bh organellar genome sequencing and assembly using whole genome data from 454 GS FLX sequencing platform [15]. Briefly, we collected fresh leaves and extracted genomic DNA for 454 GS FLX sequencing (see manuals of GS FLX Titanium for detail). In order to validate genome assembly and to make sure for the assembly of the master circle or MC, we construed two mate-pair libraries (2650 bp) for SOLiD 4.0 sequencing platform with insert sizes of 1-2 kb and 3-4 kb by following the SOLiD Library Preparation Guide. The method for assembling organellar genome was based on correlation between contig read depth and copy number in the genome [44]. We first filtered cp reads from the raw data  according known plant cp genome sequences and then assembled the ''clean'' read into cp genome into the major segments: largesingle-copy (SSC), small-single-copy (SSC) and inverted repeats (IRs) regions. The mt genome assembly is more complicated than that of cp genomes. We filtered the contigs including mt conserve genes (such as NADH dehydrogenase and succinate dehydrogenase) and removed the contamination of cp sequences. The genebased method for assembling mt genome has been reported earlier [7]. Mapping all the SOLiD mate-pair reads to mt contigs with Bioscope, we obtained the major contig relationship map in the repeat regions to assemble the MC.

Genome annotation
The cp genome was annotated by using the program DOGMA (Dual Organellar GenoMe Annotator) [45] coupled with manual corrections for start and stop codons. Protein-coding genes are identified by using the plastid/bacterial genetic code. Codon usage is predicted by using CodonW (http://codonw.sourceforge.net/). We construct a custom-designed amino acid database for protein-coding genes and nucleotide databases for rRNA and tRNA genes, compiled from all previously annotated plant mt genomes available at the organelle genomic biology website at NCBI (http://www.ncbi. nlm.nih.gov/genomes/ORGANELLES/organelles.html). NCBI BlastX and BlastN searches of the mt genome against the databases allow us to find protein and RNA genes, respectively. All BlastN and BlastX searches are carried out by using the default settings with evalue 1e-10. Putative RNA editing sites are inferred to create proper start and stop codons as well as to remove internal stop codons. We also used tRNAscan-SE [46] to corroborate tRNA boundaries identified by BlastN. The annotated GenBank files of the cp and mt genomes of Bh are used to draw gene maps using OrganellarGenome DRAW tool (OGDRAW) [47]. The maps were then examined for further comparison of gene order and content.

Analyses on SNPs, repeats, and cp-derived sequences
We identified intra-specific SNPs in both cp and mt genomes. Using BioScope, we mapped two runs of SOLiD mate-pair reads to both cp and mt genomes (BioScope Software User Guide). We carried out repeat sequence analysis using the REPuter web-based interface (http://bibiserv.techfak.uni-bielefeld.de/reputer/) [48], including forward, palindromic, reverse and complemented repeats with a minimal length of 50 bp. Transposable elements and other repeated elements were mapped with RepeatMasker Web Server (http://www.repeatmasker.org/cgi-bin/WEBRepeat-  Masker) running under the cross_match search engine. Cpderived sequences are identified with BlastN search of mt genomes against Bh annotated cp genomes (Identity $80%, E-value#1e-5, and Length $50 bp). The cp-derived sequences were then aligned to all known plant mt genomes by using BlastN (Identity $80%, E-value#1e-5, and Coverage $50%). tRNAs transferred to the mt genome were identified by aligning to all tRNAs in the cp genome of the same species by using BlastN (Identity $80%, E-value#1e-5, and Coverage $50%).
We align amino acid sequences from individual genes using MUSCLE v3.8.31 [49], remove ambiguously aligned regions in each alignment using GBLOCKS 0.91b [50], and concatenate the aligned sequences. We use maximum likelihood method and PhyML v3.0 [51] under Jones-Taylor-Thornton (JTT and gamma distribution of rates across sites with four categories) model of sequence evolution to construct phylogenetic trees. Confidence of branch points is estimated based on 100 bootstrap replications. We obtained the best tree after heuristic search with the help of Modelgenerator [52].