Complete chloroplast genome sequence and phylogenetic analysis of Spathiphyllum 'Parrish'

Spathiphyllum is a very important tropical plant used as a small, potted, ornamental plant in South China, with an annual output value of hundreds of millions of yuan. In this study, we sequenced and analyzed the complete nucleotide sequence of the Spathiphyllum 'Parrish' chloroplast genome. The whole chloroplast genome is 168,493 bp in length, and includes a pair of inverted repeat (IR) regions (IRa and IRb, each 31,600 bp), separated by a small single-copy (SSC, 15,799 bp) region and a large single-copy (LSC, 89,494 bp) region. Our annotation revealed that the S. 'Parrish' chloroplast genome contained 132 genes, including 87 protein coding genes, 37 transfer RNA genes, and 8 ribosomal RNA genes. In the repeat structure analysis, we detected 281 simple sequence repeats (SSRs) which included mononucleotides (223), dinucleotides (28), trinucleotides (12), tetranucleotides (11), pentanucleotides (6), and hexanucleotides (1), in the S. 'Parrish' chloroplast genome. In addition, we identified 50 long repeats, comprising 18 forward repeats, 13 reverse repeats, 17 palindromic repeats, and 2 complementary repeats. Single nucleotide polymorphism (SNP) and insertion/deletion (indel) analyses of the chloroplast genome of the S. 'Parrish' relative S. cannifolium revealed 962 SNPs in S. 'Parrish'. There were 158 indels (90 insertions and 68 deletions) in the S. 'Parrish' chloroplast genome relative to the S. cannifolium chloroplast genome. Phylogenetic analysis of five species found S. 'Parrish' to be more closely related to S. kochii than to S. cannifolium. This study identified the characteristics of the S. 'Parrish' chloroplast genome, which will facilitate species identification and phylogenetic analysis within the genus Spathiphyllum.


Introduction
Spathiphyllum is a genus of approximately 41 species [1] of monocotyledonous flowering plants in the family Araceae, and is one of the most popular ornamental plants. Members of this genus are evergreen herbaceous perennial plants with large leaves that are 12-65 cm long a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 quality value less than Q20); (4) the reads containing 10% Ns were removed; and (5) adapters and small segments less than 50 bp in length after mass pruning were excluded.
At the same time, we used another method, single-molecule real-time (SMRT) circular consensus sequencing, to obtain the whole chloroplast genome of S. 'Parrish', following the standard protocol provided with the PacBio platform (Biozeron, Shanghai, China). To obtain more accurate assembly results, the original sequencing data were processed by filtering out the following: (1) polymerase reads whose length was less than 100 bp; (2) polymerase reads with a mass less than 0.80; (3) subreads extracted from polymerase reads and adapter sequences; and (4) subreads whose length was less than 500 bp.

Chloroplast genome assembly and validation
First, SOAPdenovo (v2.04) [26] was used to preliminarily assemble the Illumina sequencing data. Second, the PacBio sequencing data were compared using BLASR (San Diego, CA, USA) [27]. To reduce the errors of single bases and insertions/deletions (indels) in the long PacBio sequences, the data were corrected according to the results of the comparison. The PacBio raw reads were pre-processed by trimming the adapter sequences, low quality (Q < 0.80) reads, short reads (length < 100 bp) and short subreads (length <500 bp). Finally, the PacBio clean data were used for the assembly.
NOVOPlasty (v2.7.2) software (https://github.com/ndierckx/NOVOPlasty) was used for chloroplast genome assembly. The S. kochii chloroplast genome was used as the reference genome for the assembly of S. 'Parrish' samples. The rbcL gene of the reference genome was used as a seed sequence. The other parameters were set to the defaults. Then, clean reads were compared with the scaffold obtained by assembly. The results were locally assembled and optimized by paired-end and overlap relations of reads. The gaps in the assembly results were repaired using GapCloser (v1.12, http://soap.genomics.org.cn/soapdenovo.html) software, with the default parameters. Finally, the reference genome was used to correct the location and direction of the four chloroplast partitions (LSC/IRa/SSC/IRb), and the initial position of the chloroplast assembly sequence was determined to obtain the final chloroplast genome sequence.

Gene annotation and codon usage
The protein-coding, transfer RNA (tRNA) and ribosomal RNA (rRNA) genes of the chloroplast genome of S. 'Parrish' were predicted by DOGMA (http://dogma.ccbb.utexas.edu/) [28] software. The parameters were set as follows: (1) genetic code for Blastx: 11; (2) percent identity cutoff for protein-coding genes: 60; (3) percent identity cutoff for RNAs: 60; and (4) COVE threshold for mitochondrial tRNAs: 20. Then, the redundancy in the initial genes predicted by DOGMA was eliminated. The ends of the genes and the exon/intron boundaries were manually corrected to obtain a high-accuracy gene set, using the protein-coding genes of the reference genome as a reference. Using the S. kochii chloroplast genome as the reference genome, the genome of S. 'Parrish' was assembled. Finally, OrganellarGenomeDRAW software (http:// ogdraw.mpimp-golm.mpg.de/cgi-bin/ogdraw.pl) [29] was used to display a circle map.
The degree of codon preference can be reflected by the relative probability of a particular codon in the synonymous codon encoding the corresponding amino acid. To obtain the codon preference value, relative synonymous codon usage (RSCU) was calculated by CUSP (EMBOSS v6.6.0.0) with default parameters.

SSRs and long repeat structure
Microsatellite analysis of contig sequences was carried out with the MIcroSAtellite (MISA) identification tool [30]. The parameters (unit_size, min_repeats) were defined as follows: 1-10, 2-6, 3-5, 4-5, 5-5, and 6-5; the minimum distance between two SSRs was set to 100 bp. Parametric significance was met under the following conditions: 10 or more repeats of one base, 6 or more repeats of two bases, 5 or more repeats of three bases, 5 or more repeats of four bases, 5 or more repeats of five bases and 5 or more repeats of six bases. Additionally, when the distance between the two microsatellites was less than 100 bp, the two microsatellites formed a composite microsatellite. Finally, primers were designed for the SSR sequences by Primer3 (v.0.4.0, http://primer3.ut.ee).

SNP and indel detection and annotation
Using MUMmer4 alignment software (Maryland, USA) [31], global alignment between each sample and reference sequence was carried out, the sites that differed between the sample sequence and the reference sequence were identified, and the potential SNP loci were detected through preliminary filtering. The 100-bp sequences on both sides of the reference sequence SNP loci were extracted, and the extracted sequence and assembly results were aligned using BLAT (v35, http://genome.ucsc.edu) software to verify the SNP loci [32]. If the length of the alignment was less than 101 bp, the unreliable SNP was removed; if the alignment was repeated many times, the SNP that was considered to be a duplicate was also removed, and finally, a reliable SNP was obtained.
The preliminary insertion/deletions (indels) results were obtained by comparing the samples with reference sequences using LASTZ (v1.03.54, http://www.bx.psu.edu/miller_lab/dist/ README.lastz-1.02.00) software. Then the best comparison results were selected through axt_correction, axtSort and axtBest programs, and indel results were obtained preliminarily. Then, 150 bp upstream and downstream of the reference sequence indel locus were compared with the sequence reads of the sample by BWA (http://bio-bwa.sourceforge.net) software and SAMtools (http://samtools.sourceforge.net/), and reliable indels were obtained by filtering.

Phylogenetic analysis
An evolutionary tree was constructed based on the population SNP matrix of the sample and reference genome. For each sample, all SNPs were linked in the same order to obtain FASTA format sequences of the same length, one of which was the reference sequence used as an input file for the construction of the evolutionary tree. An evolutionary tree was also constructed based on the core gene: a single-copy core gene identified by gene clustering was used to compare multiple protein sequences using MUSCLE (v3.8.31) software [35], and the results were used to construct an evolutionary tree. PhyML (v3.0, http://www.atgc-montpellier.fr/ phyml/) and 1000 bootstraps were used to construct the phylogenetic tree with the maximum likelihood (ML) method [36]. Data files used in the phylogeny analysis has been added to the supplemental file (S1 and S2 Data).

Features of S. 'Parrish' chloroplast genome DNA
The length of the S. 'Parrish' chloroplast genome is 168,493 bp. The genome has a quadripartite structure with an SSC of 15,799 bp and an LSC of 89,494 bp, which are separated by two IR regions (IRa and IRb, each 31,600 bp) (Fig 1 and Table 1). The GC content of the overall chloroplast genome and the LSC, SSC, and IR regions is 36.19, 34.72, 29.35, and 39.98%, respectively (Table 1); these values are similar to those found for the genome of S. kochii [20]. The GC content of the two IR regions is higher than that of the LSC and SSC, which is a very common pattern in other plants [21], and this phenomenon is mostly attributable to rRNA genes and tRNA genes [37].
The S. 'Parrish' chloroplast genome encodes 132 genes in total, comprising 87 protein-coding genes, 37 tRNA genes and 8 rRNA genes ( Table 2). The IR region includes 7 protein-coding genes, 7 tRNA genes and 4 rRNA genes. The SSC contains 11 protein-coding genes and 1 tRNA gene, while the LSC contains 62 protein-coding genes and 22 tRNA genes (Fig 1).
The frequency of codon usage was inferred based on the sequence of protein-coding genes and tRNA genes (Table 3). In total, 28,423 codons, which encoded all genes, were detected in S. 'Parrish'. Of these codons, 2,903 (10.21%) encode leucine, which is the most frequent amino acid in the chloroplast genome, and 333 (1.17%) encode cysteine, which is the least frequent.
The chloroplast genome of S. 'Parrish' contains 19 intron-containing genes, including 6 tRNA genes and 13 protein-coding genes. Ycf3 and clpP contain two introns, and the other 17 genes include one intron ( Table 4). The intron (2,569 bp) of the trnK-UUU gene, which is the largest intron, includes the matK gene. The rps12 gene is a trans-spliced gene with the 5' end located in the LSC region and the duplicated 3' ends in the IR regions. Ycf3 is required for the stable accumulation of the photosystem I complex [38,39]. The introns in the S. 'Parrish' chloroplast genome may be useful for further studies of the mechanism of photosynthesis evolution.
Intron or gene gain or loss can be found in chloroplast genomes [8,[40][41][42] and may be significant during evolution. However, few studies have reported on the mechanism of photosynthesis evolution in Spathiphyllum. In this paper, we compared the chloroplast genome of S. 'Parrish' to that of other species of monocotyledons. These results provide a theoretical foundation for Spathiphyllum chloroplast genome research, breeding and molecular marker development.

SNP and indel detection and annotation
Analysis of SNPs and indels in the chloroplast genome of S. 'Parrish' relative to that of S. cannifolium revealed 962 SNPs in S. 'Parrish'. Of these SNPs (S6 Table), 704 were located in intergenic regions, representing the most frequently occurring mutations, and the coding regions included 134 synonymous SNPs, 123 nonsynonymous SNPs, and 1 stop mutation. There were 158 indels, including 90 insertions and 68 deletions, in the S. 'Parrish' chloroplast genome relative to the S. cannifolium chloroplast genome (S1 Fig and S7 Table). Of these 158 indels, 57 (36.08%) were single-base indels, which differed from the numbers in maize and sugarcane [8,43,44]. It indicated that the nucleotide substitution events in the chloroplast genomes of Spathiphyllum species were more than that between species of Oryza and Kaempferia. . The analysis of these SNPs and indels molecular markers can provide theoretical basis for species identification in the future. these species, respectively (Fig 4).

Comparative chloroplast genome analysis
Comparative analysis of chloroplast genomes is an essential step in genomics [47,48]. A comparison of the structural differences among Araceae chloroplast genomes indicates that the chloroplast genome of S. kochii is the smallest (Fig 4; S. 'Parrish', 168,493 bp; S. cannifolium, 171,420 bp; D. seguine, 163,704 bp; S. kochii, 163,368 bp; and P. ternata, 164,013 bp). To explain the level of genome divergence, the whole sequence identity of the five Araceae chloroplast genomes was calculated using mVISTA with S. 'Parrish' as a reference (Fig 5). The IR (A/ B) regions exhibited less divergence than the SSC and LSC regions. In addition, the noncoding regions showed more differences than the coding regions. Except for the noncoding regions, the most highly divergent regions between S. 'Parrish' and S. cannifolium were mainly in ndhF-ndhE in the IRa and SSC regions (Fig 5), the length of which was approximately 10 kb. Except for the noncoding regions, the most frequently divergent regions between S. 'Parrish' and S. kochii were mainly in the coding regions of the ycf1 sequence in the IRa and SSC regions (Fig 5), the length of which was approximately 7 kb. The difference in regional structure between the two segments may be responsible for the closer relationship between S. 'Parrish' and S. kochii than between S. 'Parrish' and S. cannifolium.

Phylogenetic analysis
The complete chloroplast genome of S. 'Parrish' provides information that can be used to analyze the phylogenetic relationships of S. 'Parrish' with 15 other monocots. Multiple sequence alignment was performed using the whole chloroplast genome (Fig 6A) and the protein-  coding genes (Fig 6B) in 15 monocots. The B. napus and R. sativus chloroplast genomes were used as outgroups. We used ML to construct a phylogenetic tree. In the tree, S. 'Parrish' was closer to S. kochii than to S. cannifolium. These results (Fig 6A and 6B) suggest that the two methods produce similar multiple sequence alignments, and the phylogenetic tree analysis shows that the chloroplast genome sequence is useful for species identification and genetics. The difference in scale causes a difference in the alignment of the protein coding sequence and whole chloroplast genome. Second, we performed an alignment analysis on the complete sequences of three samples of S. 'Parrish' (MK391158), S. cannifolium (MK372232), S. kochii (KR270822), and found that the sequence similarity of the three chloroplast genomes was 99.53% (S2 Fig). The percentage system was shown on evolutionary branches and the difference in scale causes a difference in the alignment of the protein coding sequence and whole chloroplast genome.

Conclusions
In this study, we reported and analyzed the complete chloroplast genome of S. 'Parrish', which is one of the most popular ornamental plants worldwide. A comparison of the structure of the Araceae chloroplast genomes revealed that the IRa and SSC regions were more divergent than the other two regions, and the noncoding regions showed more differences than the coding regions. In the repeat structure analysis, we detected 281 SSRs, which included 223 mononucleotides, 28 dinucleotides, 12 trinucleotides, 11 tetranucleotides, 6 pentanucleotides, and 1 hexanucleotide, in the S. 'Parrish' chloroplast genome. In addition, 50 long repeats, comprising 18 forward repeats, 13 reverse repeats, 17 palindromic repeats, and 2 complementary repeats, were identified. Analysis of SNPs and indels in the S. 'Parrish' chloroplast genome relative to the S. cannifolium chloroplast genome revealed 962 SNPs and 158 indels in the S. 'Parrish' chloroplast genome. Phylogenetic analysis among five species found S. 'Parrish' to be more closely related to S. kochii than to S. cannifolium. The results of this study provide an assembly of the whole chloroplast genome of S. 'Parrish' and information on its divergence from the chloroplast genome of other members of Spathiphyllum, which might be useful for future breeding and biological discoveries.