Mining of simple sequence repeats (SSRs) loci and development of novel transferability-across EST-SSR markers from de novo transcriptome assembly of Angelica dahurica

Angelica dahurica is a widely grown plant species with multiple uses, especially in the medical field. However, the frequent introduction of A. dahurica to new areas has made it difficult to distinguish between varieties. Simple sequence repeats (SSRs) detected based on transcriptome analyses are very useful for constructing genetic maps and analyzing genetic diversity. They are also relevant for the molecular marker-assisted breeding of A. dahurica. We identified 33,724 genic SSR loci based on transcriptome sequencing data. A total of 114 primer pairs were designed for the SSR loci and were tested for their specificity and diversity. Ten SSR loci in untranslated regions were ultimately selected. Subsequently, 56 A. dahurica ecotypes collected from different regions were analyzed. The SSR loci comprised 2–8 alleles, with a mean of 5.2 alleles per locus. The polymorphic information content value and Shannon’s information index were 0.6274–0.2702 (average of 0.4091) and 1.3040–0.5618 (average of 0.8475), respectively. Thus, the 10 novel SSRs identified in this study were almost in accordance with Harvey-Weinberg equilibrium and will be useful for analyzing A. dahurica genetic relationships. The results of this study confirm the potential value of transcriptome databases for the development of new SSR markers.


Introduction
Angelica dahurica [1,2], which is known as 'Baizhi' in China, is a perennial dicotyledonous herb of the family Apiaceae. It originated in Taiwan and is widely grown in Korea, China, Japan, and Russia [3]. A. dahurica has been used in traditional Chinese medicine [4] because of its anti-proliferative [5], anti-inflammatory [6], anti-depressive [7], anti-oxidative [8], and anti-microbial [9] effects. It has also been applied to regulate hormones [10], decrease blood glucose levels [11], and treat headaches, ulcers, toothaches, abscesses [12], and other diseases. Angelica dahurica root extracts contain furanocoumarin compounds [13], phenolic compounds, and volatile oils, all of which account for the multiple pharmacological effects of this plant species. A. dahurica ecotypes may vary in their root surface texture and color, stem color, active ingredient compositions and abundances, and other characteristics. However, it remains unknown whether these differences are the result of phenotypic plasticity or ecotypic variation. Analyses of molecular data may clarify this issue. Angelica dahurica has a functionally diploid genome (2n = 22) [14], and is cultivated in Sichuan, Zhejiang, Henan, Hebei, Zhejiang, Anhui, and other provinces in China. Additionally, there are many wild A. dahurica populations. The abundance of available A. dahurica germplasm resources is ideal for genetic studies as the core of germplasm resource diversity is genetic diversity. The development of molecular genetic markers may be useful for clarifying the genetic variation among varieties. Eukaryotic genes are broken up by introns [15]. In the process of eukaryotic gene expression, introns are spliced out to form different mRNAs that guide protein expression and ultimately produce phenotypic diversity. Therefore, SSR markers can be divided into two types: genic markers, such as those in mRNA coding regions and untranslated regions (UTRs), and genomic markers, including markers in non-functional genes and intron regions. SSR markers are usually developed through transcriptome sequencing. Compared with genomic markers, genic markers are more practical and may be closely associated with traits because variations in the nucleotide sequences of functional genes often result in phenotypic diversity.
Simple sequence repeats (SSRs), which are also known as microsatellites, are one of the most efficient types of genetic markers because of their reproducibility, multiallelic nature, codominant inheritance, relative abundance, and high genome coverage [16]. SSR loci are generally divided into genomic SSRs and genic SSRs based on their genomic locations. Genic SSRs occur in gene coding regions, making them suitable for explaining the phenotypic and functional diversity in various populations, gene mapping, and analyses of evolution [17].
The cross-species transferability of SSRs, which is dependent on phylogenetic closeness, is greater for genic SSRs than for genomic SSRs. Advances in next-generation (i.e., high-throughput) sequencing technology have improved the efficiency and reliability of transcriptome sequencing experiments [18]. Consequently, many expressed sequence tag (EST)-SSR markers exhibiting cross-species transferability have been developed via the transcriptome sequencing of various plant species, such as Euphorbiaceae family members [19], Lolium multiflorum [20], and Saccharina japonica [21]. The frequent introduction of A. dahurica to new regions in China has made it difficult to distinguish between different populations, which is problematic for the development of A. dahurica-specific EST-SSR markers. Thus, A. dahurica-specific EST-SSR markers have not been reported. We herein present the first report of the development of genic SSR markers in A. dahurica. These markers may provide researchers with new opportunities for assessing the molecular phylogeny and genetic diversity among A. dahurica varieties.

Plant materials
Our research object was A. dahurica, which is not an endangered or protected species. No specific permission was required for the materials we collected because they are ecotypes that have been planted for a long period, and the exchange of planting areas occurs frequently. We collected them with the owners' permission. We collected 56 A. dahurica ecotypes from Sichuan, Zhejiang, Hebei, Jiangsu, Anhui, and Guizhou provinces as well as the Chongqing municipality (Table 1) for the subsequent development of genic SSR markers.
Leaves, stems, and phloem and xylem tissues of A. dahurica cultivar 'Chuan zhi No. 2' were collected at the seedling and bolting stages for transcriptome sequencing. The samples were immediately frozen in liquid nitrogen and stored at −80˚C for subsequent RNA extraction. Three biological replicates were prepared for each sample.

RNA extraction, sequencing, and analysis of the A. dahurica transcriptome and SSR locus identification
Total RNA was extracted with the mirVana miRNA ISOlation Kit (Thermo Fisher, Carlsbad, CA, USA). The 2100 Bioanalyzer RNA Nanochip (Agilent, Santa Clara, CA, USA) was used to verify RNA quality. All samples had an RNA Integrity Number > 7 and a 28S/18S ratio > 2. The TruSeq Stranded mRNA LT Sample Prep Kit (Illumina, San Diego, CA, USA) was used to purify and fragment the mRNA, after which double-stranded cDNA was synthesized with SuperScript II Reverse Transcriptase (Invitrogen, Carlsbad, CA, USA). Short fragments were modified with an A-tail on the 3 0 end and adapters were ligated via PCR amplification. The resulting modified fragments were sequenced with the Illumina Genome Analyzer HiSeq X Ten sequencing platform. Data analyses and base calling were performed with the Illumina instrument software. The transcriptome datasets are available in the NCBI Sequence Read Archive (accession number SRX247927).
The Trinity software (version: trinityrnaseq_r20131110) [1] was used to splice and obtain transcript sequences. Additionally, the MISA software (http://pgrc.ipk-gatersleben.de/misa/) was used to detect genic SSR markers based on the de novo transcriptome sequencing data [22]. The SSR loci containing repeat units of 1-6 nucleotides were identified, and the minimum SSR length was set at six iterations. SSR primers were designed using the Primer3 software (http://pgrc.ipk-gatersleben.de/misa/) with the following parameters: primer length of 18-22 bp, with an optimum length of 20 bp; an annealing temperature of 60˚C; a PCR product size of 100-280 bp; and a GC content of 40-60%, with an optimum of 50% [23].

Germplasm resource DNA extraction
The seeds of the various ecotypes were sown in pots and germinated in a growth chamber under the following conditions: 22˚C, 75% relative humidity, and a 16-h light/8-h dark cycle.
Leaves were collected at the seedling stage, and genomic DNA was extracted according to a modified cetyltrimethylammonium bromide protocol, with polysaccharides removed using high salt concentrations and polyphenols removed using polyvinyl pyrrolidone, an extended RNase treatment and phenol-chloroform extraction [24]. The DNA quality was assessed by 1% agarose gel electrophoresis (m/v) and OD260/OD280 with a NanoDrop 2000 spectrophotometer (Wilmington, DE, USA). The DNA quantity was assessed with a NanoDrop 2000 spectrophotometer.

Development and validation of SSR markers
A total of 114 SSR primer pairs were synthesized by Chengdu TsingKe Biotechnology Co., Ltd.
(Chengdu, China). PCR amplifications were completed with the T100 Thermal Cycler (Foster City, CA, USA). The PCR program was as follows: 94˚C for 5 min; 35 cycles of 94˚C for 30 s, 55-65˚C for 30 s, and 72˚C for 2 min; 72˚C for 10 min. The specificity and ideal annealing temperature of the PCR products were determined by 1% agarose gel electrophoresis. Primers that amplified a single band of the expected size were selected. The forward primers were labeled with HEX fluorescent probe by Chengdu TsingKe Biotechnology Co., Ltd. (Chengdu, China) to analyze fragments on the Applied Biosystems 3730xl DNA Analyzer (Carlsbad, CA, USA). For genotyping, 0.5 μL PCR amplification products in 10 µL ABI highly deionized (Hi-Di) formamide and ABI GeneScan 500 LIZ Size standard mixture (130:1) were first pre-degenerated at 95˚C for 5 min and then separated by capillary electrophoresis. Alleles were detected with the GeneMapper software (version 4.1) [25]. Markers with a strong tendency to form stutter peaks were excluded in this step. The sequences of SSR products were determined by Chengdu TsingKe Biotechnology Co., Ltd. (Chengdu, China).

Data analysis
We used the GenAlEx software (version 6.501) [26] to edit the data and transform it the codominant format automatically. The number of alleles (Na), number of effective alleles (Ne), observed heterozygosity (Ho), expected heterozygosity (He), Shannon's information index (I), and Hardy-Weinberg equilibrium (HWE) were calculated with the Popgene software (version 1.32) [27]. Additionally, the polymorphic information content (PIC) was calculated with the PIC_Calc program.

Transcriptome sequencing and de novo assembly
The quality of the extracted total RNA was appropriate for transcriptome sequencing (see S1 Table, S1 Fig). After a stringent quality assessment and data filtering step, 49,580,458.67 raw reads were selected for further analyses and deposited in the NCBI SRA database (PRJNA523076). An overview of the sequencing results is presented in Table 2. All high-quality reads were assembled with the Trinity software, which produced a contig N50 length of 1,858 bp and a mean length of 1,299.83 bp. After sequencing, 110,251 unigenes were identified in the A. dahurica transcriptome based on the de novo assembly of clean raw reads comprising a total length of 143,307,954 bp. The GC content of the unigenes was mostly 30-50% (Fig 1). The mean A. dahurica transcriptome GC content was 43.33%.

Identification of simple sequence repeats from the A. dahurica transcriptome
A total of 33,724 potential SSR loci were identified across 26,455 unigenes from the transcriptome of A. dahurica. The total size of the examined unigene sequences was 143,307,954 bp, meaning one SSR site was detected per 4,249 bp. Moreover, 20,734 unigenes (78.37%) contained a single SSR, 5,723 unigenes (21.63%) had more than one SSR, and 1,993 unigenes (7.53%) contained compound SSRs (Table 3). Thus, A. dahurica contains many types of SSRs, and all of the repeats comprising 1-6 nucleotides were detected.

Development and validation of identified SSR markers
A total of 114 primer pairs were designed and synthesized (S1 Table). After PCR amplification, 14 primer pairs that produced a single band of the expected size and many polymorphisms were selected for subsequent analysis. All of the forward primers were labeled with different HEX fluorescent dyes and then used to analyze the fragments from the 56 A. dahurica ecotypes collected from different locations in China ( Table 1). Loci that failed to provide clear signals in the expected size range or that lacked polymorphisms were eliminated. Finally, 10 novel SSR The results showed Loci 1, 2, 3, 4, 7, 9, and 10 were moderately polymorphic and in accordance with HWE.

Developing SSRs based on the transcriptome is reliable and efficient
With the advent and rapid development of high-throughput sequencing technology, RNA sequencing and the development of molecular markers has become easy and reliable in many species [28][29][30]. Multiple SSR loci can be identified by analyzing the transcriptome, which is significant for the localization of genes responsible for specific traits and functions. This is the first report describing the application of high-throughput sequencing to the development of A. dahurica genic SSR markers based on unigenes. A total of 33,724 genic SSR loci were detected, and a set of 10 novel genic SSR markers was developed to provide additional tools for analyzing the genetic diversity of A. dahurica. The 10 polymorphic loci were then validated in 56 individuals. These loci showed abundant polymorphism, implying they are useful for analyzing A. dahurica relationships and confirming the potential value of an A. dahurica transcriptome database for the development of new SSR markers.

The infrequent observation of GC repeats may be related to CpG clusters and GC contents
GC repeat motifs were the least common among the mononucleotide and dinucleotide repeats, possibly because of the presence of CpG clusters. Previous studies have indicated that CpG dinucleotides often occur in discrete regions, and almost 60% of these CpG-rich clusters are located in or close to genes, mostly at the 5 0 end [31,32]. The GC content and CpG pattern, along with the chromatin condensation and B-Z transition, in a gene and around the promoter can determine gene expression characteristics through transcription factor binding sites Table 3. A. dahurica transcriptome SSR general statistics.

Items Number
Total number of unigene sequences examined 110251 Total size of examined unigene sequences (bp) 143307954 Total number of identified SSRs 33724 The number of sequences containing SSR loci 26455 The number of sequences with 1 SSR loci 20734 The number of unigene sequences with more than 1 SSR locus 5723 The number of SSRs present in compound formation 1993 https://doi.org/10.1371/journal.pone.0221040.t003  [33]. Cell differentiation involving the deamination of mC to T has been proposed to result in CpG deficiency [34], and CpG deficiency is associated with a corresponding TpG (CpA) excess [35]. The mean GC content of the A. dahurica transcriptome unigenes was 43.33% (i.e., less  than the AT content), which may be related to the limited number of GC dinucleotide repeats or G/C mononucleotides. Moreover, CpG clusters may be essential for regulating gene expression, and are therefore infrequently observed.

The presence of 10 novel SSRs in untranslated regions implies SSR diversity is related to different phenotypes
Cis-acting elements involved in post-transcriptional control are generally located in the UTRs of mRNAs [36]. In the current study, 10 novel A. dahurica SSRs were localized to UTRs. UTRs are vital at the interface between mRNAs and proteins. They regulate stability, transport, and translation efficiency, as well as the function and subcellular localization of translated proteins, and they increase the coding capacity of the genome [36][37][38][39]. In future studies, the relationship between SSRs and regulatory elements in the UTR should be characterized to clarify phenotypic diversity.

Conclusion
The A. dahurica transcriptome characterization and the substantial body of transcripts reported here will facilitate research to develop the medicinal and nutritional properties of this species. Ten novel A. dahurica SSR markers with polymorphism were developed, which provides a foundation for genetic diversity analysis, genetic mapping and marker breeding in A. dahurica.
Supporting information S1 Fig. RNA