Chloroplast genome features of an important medicinal and edible plant: Houttuynia cordata (Saururaceae)

Houttuynia cordata (Saururaceae), an ancient and relic species, has been used as an important medicinal and edible plant in most parts of Asia. However, because of the lack of genome information and reliable molecular markers, studies on its population structure, or phylogenetic relationships with other related species are still rare. Here, we de novo assembled the complete chloroplast (cp) genome of H. cordata using the integration of the long PacBio and short Illumina reads. The cp genome of H. cordata showed a typical quadripartite cycle of 160,226 bp. This included a pair of inverted repeats (IRa and IRb) of 26,853 bp, separated by a large single-copy (LSC) region of 88,180 bp and a small single-copy (SSC) region of 18,340 bp. A total of 112 unique genes, including 79 protein-coding genes, 29 tRNA genes, and four rRNA genes, were identified in this cp genome. Eighty-one genes were located on the LSC region, 13 genes were located on the SSC region, and 17 two-copy genes were located on the IR region. Additionally, 48 repeat sequences and 86 SSR loci, which can be used as genomic markers for population structure analysis, were also detected. Phylogenetic analysis using 21 cp genomes of the Piperales family demonstrated that H. cordata had a close relationship with the species within the Aristolochia genus. Moreover, the results of mVISTA analysis and comparisons of IR regions demonstrated that the cp genome of H. cordata was conserved with that of the Aristolochia species. Our results provide valuable information for analyzing the genetic diversity and population structure of H. cordata, which can contribute to further its genetic improvement and breeding.


Introduction
Piperales, which show discrete vascular bundles in the stem and threefold flower parts [1], are placed in the Magnoliids clade, which is an early evolutionary branch in the angiosperm tree [2]. The category of Piperales has changed in different ways over time. According to the recent APG IV system, Piperales includes three families (Piperaceae, Aristolochiaceae and a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 Saururaceae), 17 genera, and more than 4,000 species, most of which belong to the Piperaceae family [3]. Saururaceae is the smallest family in Piperales, including only four genera and six species, most of which are aromatic herbs with creeping rhizomes [3][4][5].
Houttuynia cordata, the single member of the Houttuynia genus in the Saururaceae family, is widely used as a leafy vegetable and medicinal herb throughout much of Asia [6]. Its characteristic extract, houttuynin, has been proven to have diverse pharmacological effects including anticestodal [7], antibacterial [8], and antiviral activity [9]. Moreover, extracts of H. cordata have had important functions in improving the immune system of patients with severe acute respiratory syndrome (SARS) [10]. Although H. cordata is the only species in the Houttuynia genus of the Saururaceae family, its populations from different regions vary widely at chromosome numbers (2n = 24-128) and polyploidy levels [6,11]. This probably resulted from the prevailing cytomixis and meiotic abnormalities that occur during microsporogenesis [11]. Almost all previous studies on this plant have focused mainly on its physiological and biochemical properties [7,9,10,12,13], whereas few studies were aimed at deciphering the genetic diversity, population structure, and taxonomic status of H. cordata in Piperales. This is mostly likely due to the limited genome information and development of markers.
Chloroplasts have their own circular genome and play a vital role in photosynthesis, physiology and development in most plants. For most angiosperms, chloroplast (cp) genomes typically range in size from 120 to 170 kilobase pairs (kb) [14]. Compared to the nuclear genome, the cp genome is more conserved in terms of gene size and content, genome structure, and linear order of the genes [15]. Generally, the cp genome has a quadripartite cycle, comprising a pair of inverted repeat regions (IRA and IRB) that are separated by one large single-copy (LSC) region and one small single-copy (SSC) region. Previous studies have demonstrated that, compared to the nuclear genome, fewer substitutions of nucleotides and rearrangements of genome structure occur in the cp genome [16,17], making it an ideal model to decipher genome evolution and phylogenetic relationships in complex angiosperm families [18,19]. The development of next-generation sequencing technologies has provided highly efficient, low-cost DNA sequencing platforms that produce large volumes of short reads [20]. Moreover, the third-generation sequencing technology, PacBio sequencing platform, otherwise called single-molecule teal-time (SMRT) sequencing technology, generates long reads with lengths of up to 30 kb that can easily close the gaps in current reference assemblies through extended repetitive regions. Already, thousands of cp genomes have been completely revealed. Recently, two short communications have reported the length and gene contents of cp genomes in H. cordata [21,22]. However, a comprehensive analysis of this cp genome has not yet been performed.
In the present study, we de novo assembled the complete cp genome of H. cordata, a quadripartite cycle of 160,226 bp, through a combination of PacBio and Illumina sequencing platforms. The rates of synonymous (Ks) and non-synonymous (Ka) substitutions for shared common genes between H. cordata and related species were also calculated. Phylogenetic analysis using 21 Piperales species demonstrated that H. cordata had a close relationship with the species within the Aristolochia genus. Our results provide valuable information for future research in genetic improvement, population genetics, and population diversity of H. cordata.
with minor modifications. Briefly, 5 g fresh leaves excluding petioles and veins were homogenized in 400 mM sorbitol, 5 mM MgCl 2 , 5 mM MnCl 2 , 2 mM EDTA, 10 mM NaHCO3, 0.5% (w/v) bovine serum albumin (BSA), 5 mM ascorbate, and 20 mM Tricine-NaOH (pH 8.4). After centrifugation at 3000 g for 5 min, the pellet was gently suspended in 400 mM sorbitol, 5 mM MgCl 2 , 2.5 mM EDTA, and 50 mM 4-(2-hydroxyethyl)-1-piperazineethanesulfonic acid (HEPES)-KOH (pH 7.6). The intact chloroplasts were further purified using 40% (v/v) Percoll (Zhuoyue, Beijing Baiaolaibo, China, http://www.baiaolaibo.com/). Isolated intact chloroplasts were suspended in 3 mM MgCl 2 and 25 mM HEPES-KOH (pH 7.6), followed by centrifugation at 10,000 g for 3 min. Then, the cp DNA was isolated from the chloroplasts using the DNeasy Plant Mini Kit (Qiagen, USA) according to the manufacturer's instructions. After assessing the quality and quantity of the cp DNA, one DNA library with 20 kb insertion using the PacBio Sequel platform (PacBio, USA) and one DNA library with 450 bp insertion using the Illumina HiSeq X Ten platform (Illumina, USA) were sequenced based on the manufacturer's instructions. All the raw data, including short Illumina reads and long PacBio reads used in this study are available at the BIG Sub (http://bigd.big.ac.cn), under the accession number: CRA002843.

Genome assembly and gene annotation
The 150 bp paired-end reads were generated by the Illumina HiSeq platform. After removing sequencing adapters and low-quality reads with QC values less than 20%, the clean reads were firstly compared to the complete cp genome of the related Aristolochia tubiflora (Gen-Bank accession: NC_041455) using BLASTn software [24] (E-value: 10-6) to select cp genome sequences. These selected Illumina reads were then de novo assembled into scaffolds using SOAP denovo v2.04 (http://soap.genomics.org.cn/soapdenovo.html) with a K-mer value of 51. Low-quality PacBio reads (minimum read length of 500 bp and minimum read quality of 0.80), were removed from the raw data. Then the filtered PacBio reads were errorcorrected to remove single nucleotide insertion/deletions using Illumina short reads by the PacBioToCA module of the Celera Assembler with default parameter settings (-length 500, -partitions 200) [25]. Afterwards, the corrected PacBio reads were used to gap-fill the scaffolds by PBjelly (https://sourceforge.net/projects/pb-jelly/) [26] with all PacBio reads >8kb to generate a circular cp genome map. Frame-shift errors were manually corrected during gene prediction.

Codon usage analysis
Codon usage bias is a universal feature of all genomes and has been proposed to regulate translation dynamics such as translation efficiency and accuracy, as well as protein folding [31]. To further analyze H. cordata cp genome evolution, the CodonW1.4.2 program (http:// downloads.fyxm.net/CodonW-76666.html) was employed to calculate the synonymous codon usage of protein-coding genes with default settings.

Whole cp genome sequence comparisons
To provide comprehensive knowledge of cp sequence divergence, the H. cordata cp genome was compared to four cp genomes from the Aristolochia genus. The divergence of the LSC/ IRB/SSC/IRA boundary regions among these related species was visualized by IRscope (https://irscope.shinyapps.io/irapp/) based on the annotations of their available cp genomes in GenBank. Additionally, the mVISTA program (http://genome.lbl.gov/vista/mvista/submit. shtml) was also used to compare the whole cp genome divergence among these related species.

Synonymous and nonsynonymous substitution rate calculations
To determine synonymous (Ks) and non-synonymous (Ka) substitution rates, we performed pairwise comparisons of the 79 protein-coding genes between H. cordata cp genome and four close Aristolochia species. Pairwise alignments of the common genes among species were carried out by MAFFT [34], and the Ka/Ks ratios were calculated with KaKs_calculator 2.0 [35] using the default parameters for plant plastid code.

Chloroplast genome assembly and genome features
The Illumina sequencing platform gave rise to 3,763 Mb of raw data. After trimming, 3,390 Mb of clean reads with a Q20 value of 95.59% was obtained. The PacBio platform generated 137,091 subreads with an average length of 4,871 bp and an N50 length of 7,337 bp (S2 Table). Both Illumina reads and PacBio subreads were used to construct the complete cp genomes of H. cordata.
The H. cordata cp genome showed a typical quadripartite cycle of 160,226 bp, comprising a pair of inverted repeats (IRa and IRb) of 26,853 bp, which were divided by a large single-copy (LSC) region of 88,180 bp and a small single-copy (SSC) region of 18,340 bp (Table 1). Regarding GC content, the IR regions showed the highest GC content of 43.03%, followed by 36.81% in the LSC region, whereas the SSC region exhibited the lowest GC content of 32.15%. The overall GC content of the cp genome was 38.36%. In total, 112 unique genes were identified in the H. cordata cp genome, including 79 protein-coding genes, 29 tRNA genes, and four rRNA genes (Table 1). Furthermore, out of 112 unique genes, 82 and 13 genes were found in the LSC and SSC regions, respectively; while 17 genes, including six protein-coding genes, seven tRNAs, and four rRNAs were duplicated in the IR regions (Fig 1). Among the 112 unique genes, 18 genes (comprosing 11 protein-coding genes and seven tRNA genes) had one intron, and only two genes (clpP and ycf3) harbored two introns ( Table 2).

Repeat sequence and Simple Sequence Repeat (SSR) detection
In this study, a total of 48 repeat sequences with lengths ranging from 30 bp to 69 bp were detected, including 27 forward repeats and 21 palindromic repeats, whereas no reverse or complement repeats were identified (S3 Table ; Fig 2A). Among these repeats, 14 were 30-39 bp in length, 11 were 40-49 bp, 14 were 50-59 bp, and nine were 60-69 bp (Fig 2A). All the repeats were found within seven protein-coding genes (ccsA, petA, petN, psaB, rpl32, ycf2, and ycf3) (Fig 2B). Out of 30 repeats (62.5%), 15 forward repeats and 15 palindromic repeats, were contributed by ycf2, which is essentially a pseudogene. Most paired repeats (36 repeats, 75%) were located in the same genes; however, 12 repeats, that were all forward types, were seen in different genes (S3 Table).
In total, 86 SSR loci of 17 different types with a length of at least 10 bp were also detected by MISA (Table 3). Among the SSR loci, mononucleotide repeats were the most abundant with 68 SSR motifs (79.07%) of only two types (A/T). Eleven tetranucleotide repeats representing 10 different types and four trinucleotide repeats representing three different types (TTA/TAT/ ATA) were identified. The dinucleotide repeats were only TA type with two motifs, and the pentanucleotide repeats (TCTTT) was observed only once. No hexanucleotide repeats were observed, and the longest SSR was a tetranucleotide repeat (TTTA) of 24 bp in length.

Codon usage analysis
The coding sequence of 79 protein-coding genes gave rise to 22,816 codons. Among these, the leucine codons were the most biased, with a frequency of 10.36%, whereas the cysteine codons had the lowest usage frequency of only 1.11% (S4 Table; Fig 3A). To gain knowledge of synonymous codon usage bias of H. cordata cp genome, we also calculated the relative synonymous codon usage (RSCU) value. The results showed that the RSCU values of 31 codons were greater than 1, indicating these codons were preferentially used. Among these preferential codons, the majority of codons ended with A (13 of 31) or U (16 of 31) except UUG and UCC (S4 Table; Fig 3B).

Phylogenetic analysis and whole cp genome sequence comparisons
To analyze phylogenetic relationships among the Piperales family, we downloaded 20 complete cp genomes covering four genera (S1 Table) from NCBI to construct the phylogenetic trees. To reduce data redundancy, 75 homologous CDs of all 21 cp genomes were used to generate phylogenetic tree by the Maximum Likelihood method with 1000 bootstrap replicates. The phylogenetic tree generated 19 nodes, and Asarum sieboldii (NC_037190) formed the outgroup. Out of 19 nodes, 18 nodes bootstrap values were �92%. Generally, the phylogenetic tree showed that each genus (Aristolochia, Piper, and Chloranthus) constituted a monophyletic group. H. cordata together with the species of Aristolochia formed a subgroup with bootstrap values of 100%, indicating that they shared a more closer relationship (Fig 4). Additionally, mVISTA analysis was performed to evaluate divergence in the genome sequences between H. cordata and Aristolochia species with reference to the annotation of H. cordata. The results revealed that all selected cp genomes showed generally high similarity (>85%). However, several minor inserted regions and one large inserted region beyond 1 kb were observed (Fig 5), indicating that the H. cordata cp genome underwent evolutionary divergence. Overall, the coding regions were more conserved than the non-coding regions and the obviously divergent genes were ycf2, rpl14, rpl19, atpH, and rpl22.
To gain a more comprehensive comparison, the IR regions that were believed to contribute to the size variation of the cp genome were also compared among these five species (Fig 6). The results showed that, compared to IRB/SSC and SSC/IRA junctions, the distribution of the genes in the LSC/IRB and IRA/LSC border regions were relatively conserved. The rps19 and rpl2, and trnH genes were distributed at the LSC/IRB and IRA/LSC junction in all five cp genomes, differing only in the distance of three genes to the junction. For SSC/IRA boundaries, the ycf1 gene located on the SSC region had a 169 bp, 171 bp, and 1431 bp extension to the IRA region in A. contorta, A. debilis, and H. cordata, respectively. However, a 26 bp and 36 bp distance of the ycf1 gene to the boundary were found in A. tagala and A. tubiflora, respectively. For IRB/SSC boundaries, the ndhF gene located on the SSC region was either near or

Synonymous (Ks) and non-synonymous (Ka) substitutions rate analysis
The Ka/Ks ratio has been considered as well-recognized marker for assessing genome evolution and selection pressure affecting genes [36,37]. The Ka/Ks ratio of the pairwise common 79 protein-coding genes between H. cordata cp genome and four related Aristolochia species was calculated (S5 Table). All Ka/Ks values of common genes were below 1 in all four comparisons. Of 79 homologous CDs, 76 CDs (96.20%) had a Ka/Ks value below 0.5 in all tested comparisons, and only the ycf2 gene always had a Ka/Ks value above 0.5. Overall, the average Ka/ Ks value of all common genes was 0.149. Additionally, 31 common genes (44.93%) had an average Ka/Ks value below 0.1, suggesting that these genes have undergone strong purifying selection pressures in H. cordata cp genome.

Discussion
H. cordata, known for its use as a vegetable and medicinal for functions, is a member of the Saururaceae family, which is believed to be an ancient and relic family [38]. The Saururaceae, Piperaceae, and Aristolochiaceae families together form the Piperales, which is an early diverging lineage [3,38]. Because of the lack of a related cp genome, we employed the combination of PacBio and Illumina sequencing platforms, an effective strategy for the assembly of cp genome without a reference [39,40] to de novo assemble the H. cordata cp genome. The complete cp genome of H. cordata reported here showed a typical quadripartite cycle of 160,226 bp in length, which was consistent with a previous report of cp genome of H. cordata [22]. However, unlike the previous study, we found one more protein-coding gene and one less tRNA were obtained in the present study. Another short communication also reported a complete cp genome of H. cordata 161,090 bp in length, harboring 81 protein-coding genes [21]. These studies indicate that the cp genome of H. cordata has undergone divergence whether it is more or less. Because the two reports could not provide detailed information on gene contents and genome information, the divergent hot spots in cp genome of H. cordata could not be detected. We also noticed that the cp genome of H. cordata obtained in the present study was comparable with that of published Aristolochia (Aristolochiaceae) species (159,308-160,520 bp; S1 Table) [41,42], Piper (Piperaceae) species (159,909-161,721 bp) and Saruma henryi (Aristolochiaceae) (159,914 bp). This finding indicates that cp genomes of the Piperales species were conserved in length. However, the cp genome of H. cordata was slightly larger than that of Passiflora edulis (151,406 bp) at length [43]. Moreover, the gene content between H. cordata and P. edulis cp genome was much more divergent, where one genes (trnE-UUC) were missing in H. cordata cp genome, whereas six genes (including rpl20, rpl22, rps7, rps16, infA, and accD), were missing in P. edulis cp genome. The variable IR region and boundary construction in SSC/IR and LSC/IR have been considered as the main driving force for the length variation of angiosperm cp genomes [44]. As shown in Fig 6, the size of the IR region and the adjacent genes in the boundaries of H. cordata were similar to those of the four selected Aristolochia species. Only minor shifts of these adjacent genes occurred within the boundaries.
Repeat sequences are believed to play an important role in genome rearrangements and sequence variations through illegitimate recombination and slipped-strand mispairing [45,46]. Forty-eight repeat sequences of 30-69 bp were detected in H. cordata cp genome. Furthermore, among the reported Aristolochia species, 38-138 repeats were identified [41,42], suggesting that repeat sequences were variable between lineages, which can be used as genomic markers for phylogenetic analysis [42,47]. Intriguingly, the largely pseudogene, ycf2 contributed nearly two-thirds (30 of 48) of the repeats. Similar results were observed in the cp genomes of Haberlea rhodopensis, Vernicia fordii and Nasturtium officinale [40,48,49], indicating that ycf2 was the most variable in cp genome. A total of 86 SSR motifs of 17 different types were observed in H. cordata cp genome, fewer than those of the reported Aristolochia species (ranging from 95 to 156) [41,42]. Among these SSRs, mononucleotide repeats with the A/T type were the largest in number, similar to the results of previous studies [40-42, 50, 51]. This is most likely due to high proportions of polyadenine (polyA) and polythymine (polyT) in the cp genome. These results suggest that SSRs reshape the cp genome and are powerful tools for identifying the genetic diversity among different species.

Conclusion
Herein, the complete H. cordata cp genome was de novo assembled by integrating the Illumina and PacBio platforms. The H. cordata cp genome showed a typical quadripartite cycle of 160,226 bp, which comprised 79 protein-coding genes, 29 tRNA genes, and four rRNA genes. A total of 48 repeat sequences and 86 SSR loci were identified, which could be used for marker development as well as phylogenetic and population studies in H. cordata. Moreover, codon usage analysis revealed that the Leu codon ending with A/U was preferentially utilized. The phylogenetic tree of 21 Piperales species, constructed based on homologous protein-coding genes, demonstrated that H. cordata had a close relationship with Aristolochia species. In summary, this research lays a foundation for future phylogenetic studies on Piperales species and provides useful information for the genetic improvement and breeding of H. cordata.
Supporting information S1