The taxonomic significance of ddRADseq based microsatellite markers in the closely related species of Heracleum (Apiaceae)

Many studies on Heracleum have shown poor correspondence between observed molecular clusters and established taxonomic classification amongst closely related species. This might reflect both unresolved taxonomy but perhaps also a lack of good genetic markers. This lack of appropriate and cost effective species-specific genetic markers hinders a resolved relationship for the species complex, and this in turn causes profound management challenges for a genus that contains both endemic species, with important ecological roles, and species with an invasive potential. Microsatellites are traditionally considered markers of choice for comprehensive, yet inexpensive, analyses of genetic variation, including examination of population structure, species identity, linkage map construction and cryptic speciation. In this study, we have used double digest restriction site associated DNA sequencing (ddRADseq) to develop microsatellite markers in Heracleum rechingeri. Genomic DNA from three individuals were digested with Sbf1 and Nde1 and size selected for library construction. The size-selected fragments were sequenced on an Ion Torrent sequencer and a total of 54 microsatellite sequences were bioinformatically confirmed. Twenty five loci were then tested for amplification, resulting in 19 of these being successfully amplified across eight species, comprising both the so-called thick-stemmed species (H. persicum, H. rechingeri, H. gorganicum and H. lasiopetalum), and thin-stemmed species (H. anisactis, H. pastinasifolium and H. transcaucasicum). Both Bayesian and distance-based clustering, and principal coordinate analyses clearly separated these into two groups. Surprisingly, three H. pastinacifolium populations were not separated from populations of the morphologically similar endemic species, H. anisactis, suggesting lack of genetic differentiation. Likewise, high genetic similarity was found between H. persicum and H. rechingeri populations, questioning taxonomic separation at the species level between these taxa. Further analyses are needed to re-evaluate the taxonomic significance of observed morphological variability currently applied to distinguish these sister taxa. Nevertheless, our results represent progress in the effort to develop cost-efficient molecular tools for species discrimination in this genus.


Introduction
In many plant species complexes, factors such as hybridizations, phenotypic plasticity and high variability across the geographical distribution range makes species identification based on morphological characters challenging [1][2][3]. Heracleum is a taxonomically complex genus, and previous investigations have found incongruent results between observed molecular clusters, morphological data and taxonomic classifications [4][5][6]. Lack of appropriate species-specific genetic markers hinders progress to untangle clear relationships within a species complex in a cost-effective way [2][3][4]6]. This is particularly challenging for a genus with species having important ecological roles, or which are rare, or alternatively invasive, or of special interest for conservation management.
The ecological significance of the Heracleum in terms of the negative effect on the local biodiversity has recently been reported in its invasive range in northern Europe [7]. Beside its significant impact on ecosystem, various species of the plant are used for variety of purposes which include as a fodder crop, ornamental use, food additives like spices, and in pharmaceutical applications [8,9]. Yet certain species also present a human health risk mainly due to the production of phototoxic derivatives in the exuded stem and leaf sap. Correspondingly, the eradication of noxious Heracleum species in human populated areas is considered resourcedemanding [10].
Molecular studies have shown that the seemingly cosmopolitan genus Heracleum (Apiaceae) is not monophyletic [5,11]. There is also extensive variation associated with genetic, chemical and morphological traits at various geographical levels within and among species [3][4][5]12]. The genus consists of more than 100 species found mainly in the West-Asian/Caucasus and Syno-Hymalian regions, the former region being suggested as a center of origin for the genus [3,11,13]. Eight [14]. On the other hand, the species within each section are reported to be morphologically very similar. For instance, the main morphological character differentiating the Pubescentia and Wendia sections is primarily the thickness of the stem. Furthermore, based on diagnostic taxonomic characters described in Mozaffarian et al. [15] Iranian Heracleum species can be classified into two major groups, based on the size of the plant and the stem diameter, namely the "thinstemmed" (H. pastinacifolium, H. anisactis, H. transcaucasicum and H. rawianum) and the "thick-stemmed" (H. persicum, H. rechingeri, H. gorganicum). However, recent molecular studies have shown lack of distinctiveness among species within these suggested groups [12].
Genetic markers can be used to clarify how diversity is partitioned within the species complex. Microsatellite markers, short segments of DNA consisting of 2 to 6 base pair (bp) repetitive motifs, may be particularly useful. The main features of these markers include codominant inheritance, a high mutation rate, and thus often appreciable variability within and between populations observed in a range of Eukaryotes [16][17][18][19][20]. Another important attribute is presumed selective neutrality of microsatellite markers, sometimes even when found within genic regions [21]. Besides high frequency throughout the genome, and often high variability among individuals, their utility also includes a highly consistent amplification within and among closely related species.
Microsatellites have been extensively used to study species complexes and cryptic speciation, particularly in cases with unclear and unresolved phenotypic relationships [e.g. 2,22]. For instance, species-specific markers have successfully been applied to discriminate between large and small-glanded individuals of the Dalechampia scandens species complex (Euphorbiaceae) [23]. Likewise, such markers have been applied to delimitate the species boundaries in genus Phoenix (Arecaceae), showing that the P. dactylifera clade, unexpectedly, is very distinct from the related species of P. atlantica and P. canariens [24]. Additionally, it has been shown that microsatellite markers are often readily transferrable between species [2], but this is not always the case [e.g. 22].
In recent years, next generation sequencing (NGS) techniques have been widely adopted to identify molecular markers including both microsatellite and single nucleotide polymorphism (SNP) in non-model organisms. This can be done at low cost and in a short time, even without an associated reference genome [25,26]. Furthermore, Restriction Associated DNA sequencing (RADseq), a technique for reproducibly subsampling a target genome [27], has made it possible to evaluate population genomic data for virtually any organism to a high degree of accuracy, with longer sequence reads and in a cost-effective nature [27,28].
In this study, we wanted to use ddRADseq to identify and develop microsatellite markers to help resolve the Heracleum species complex. Specifically we wanted to develop microsatellite markers first, based on the H. rechingeri genome, and subsequently try to utilize these markers in other closely related species of Heracleum. We wanted to test whether thick-stemmed species complex including H. rechingeri and H. persicum and H. gorganicum represent a distinct genetic lineage, and also whether we are able to discriminate endemic thin-stemmed species H. anisactis and H. transcacasicum from the widespread H. pastinacifolium.

Plant materials and ddRADseq analysis
First, leaf tissue from three individuals of Heracleum rechingeri were collected from a population in Fandoghlu (Namin, Ardebil province, Iran) and immediately placed on silica gel (Table 1). H. rechingeri has been treated as a separate species with a few distinct morphological traits, including the shape of mericarp and stylopodium compared to the widespread sister taxa H. persicum (see Table 2 extracted from Flora Iranica). The species occurs in a narrow distribution range in the western parts of Alborz mountains in northern Iran [15].
DNA was extracted from the dried tissues using the ENZA SP Plant DNA Kit as specified by the manufacturer's protocol (Omega Bio-Tek, Norcross, USA) and eluted in water. DNA was quantified using the broad range Qubit assay (Thermofisher Scientific) and normalized to a working concentration of 5 ng ul -1 . We used the double digest Restriction site Associated DNA (ddRADseq) protocol as described by Vivian-Smith and Sønstebø [29] from which we developed microsatellite markers. Genomic DNA (200 ng) was digested in NEB 4 buffer with the Sbf1-HF and Nde1-HF restriction enzymes (New England Biolabs) and ligated to adaptors all in a single unidirectional one-tube reaction [29]. The individual PCR-amplified libraries were pooled and size selected with the Pippin Prep instrument (Sage Science Inc.), using a 2% gel cassette (Cat No. CSD2010), and a size fractionation window of 340 to 420 bp was collected. Where required, the selected fragments were purified throughout the protocol with 1.1 volumes of AMPure XP (Beckman Coulter). Libraries were quantified on a Bioanalyzer High Sensitivity chip (Agilent Technologies), and a concentration of 35 pM selected for the library to be templated onto Ion Sphere Particles (ISPs). This was performed with an Ion Torrent One-Touch 2 (Thermofisher Scientific), and the Ion Torrent PGM Template OT2 400 Kit (Cat No. 4479878 with MAN0007218). Templating efficiency was monitored after enrichment using the Ion Sphere Quality Control kit (Cat. No. 4468656). The ISP libraries were then sequenced using the Ion Torrent PGM (Thermofisher Scientific), based on 316 v2 chips (Cat. No. 4488149). Fastq files were produced after base calling and de-replication with the Torrent Suite software version 4.0.2. The raw sequence reads generated from Ion Torrent sequencer were deposited in the Sequence Read Archive (SRA) database (accession number PRJNA595715). Sequence files were processed to identify microsatellite containing motifs and primer pairs which were designed based on both consensus and singleton sequences with the  Table 2. Variation in key morphological traits of taxonomic importance amongst the Heracleum species studied. The data is reproduced from Flora Iranica [14].

Species Section
The shape of mericarp QDD program [30]. Consensus sequences were obtained with a default threshold of 95% identity being applied to merge reads with a high similarity.

Validation of primers and microsatellite genotyping
Primer pairs were examined for efficiency with additional PCR reactions for both positive amplification and the ease of microsatellite identification in other species of Heracleum includ- ing H. persicum, H. gorganicum and H. antasiaticum, H. anisactis, H. pastinasifolium and H. transcaucasicum and H. lasiopetalum as listed in Table 1. Twenty five microsatellite markers were selected to test for amplification against a panel of 48 individuals from species of Heracleum (Table 1). Each PCR reaction contained 2.5 μl of Taq DNA Master Mix Red 2× (Amplicon, Denmark), 0.2 μM of each forward and reverse primer and 1 μl of genomic DNA (ca. 15 ng) in a final volume of 5 μl. Thermal cycles consisted of the initial denaturation at 95˚C for 5 min, 30 cycles at 95˚C for 30s, 60˚C for 45s and 72˚C for 45s, followed by and 72˚C for 10 min in a Gene Amp 9700 thermal cycler (Applied Biosystems, USA). PCR products were checked for amplification using a 2% agarose gel.
To examine the efficiency of the developed markers in screening for genetic variation, the allele sizes for microsatellite loci were determined using a polyacrylamide gel. The procedure to prepare the polyacrylamide gel was identical to that already described in Movahedi et al. [31]. The size of the alleles were determined using a GeneRuler 50bp DNA ladder (Thermo Fisher Scientific).

Data analysis
Clustering analysis was conducted based on pairwise genetic distances among populations, Da [32], using the Populations1.2.32 program. A two dimensional plot of principal coordinate analysis obtained based on genetic distance was constructed by plotting the two first components. In addition, population genetic structure was determined with a Bayesian approach, implemented as the Markov chain Monte Carlo (MCMC) algorithm, in Structure V.2.3.4 [33], to evaluate the likely population of origin of individuals. We used an admixture model with correlated allele frequency model and a 1.5 × 10 5 burn-in period followed by 3 × 10 5 iterations to compute the parameters. For each K, ranging from K = 1 to K = 6, 20 independent runs were applied. The likely optimum number of genetic clusters was determined by plotting the mean estimated ln probability of data, Pr(D), for each K and the plot was obtained with Structure Harvester [34]. For graphic visualization of the Structure results, we utilized the ClumpAK algorithm [35]. This enabled us to compare the results of independent runs and to estimate the similarity across Q matrices, and a plot was constructed to illustrate the proportions of memberships of individuals to each inferred cluster estimated from highly similar runs.

Results
In total, 367 051 reads with average length of 229 bp were obtained from the three studied H. rechingeri individuals. The results revealed that 7886 reads (2.1% of total) contained microsatellite motifs with two, three, four, five and six nucleotide repeats, respectively. From these reads, 54 singleton and consensus sequences contained adequate flanking regions next to the microsatellite repeats enabling the design of primers (S1 Table). Twenty five loci were selected randomly to test for PCR amplification, and 19 primer pairs were successfully amplified in H. rechingeri and produced amplicons with the expected size. The nucleotide sequences of amplified loci were submitted to GenBank with accession number KX928682.1-KX928692.1 (Table 3).
Six loci comprised more than two alleles across all individuals from the studied species (Table 4). However H.rech482 consisted of only two invariant alleles throughout all individuals  We analyzed the genetic structure of the Heracleum populations with both model-based Bayesian structure and by distance-based analyses. Principal coordinate and clustering analyses showed that the group of thick-stemmed species including H. persicum, H. rechingeri, H. gorganicum and H. antasiaticum were recovered together as a single cluster (Figs 2 and 3).
The results of model-based clustering analysis suggested that the model with K = 2 was substantially better than other models (Fig 4a). A similar result to that observed by distance-based clustering was also obtained by model-based Bayesian genetic structure analyses. This exhibited two distinct genetic clusters, each corresponding to the thick-stemmed and thin-stemmed species groups, respectively (Fig 4b).

Discussion
Novel microsatellite markers were obtained by screening ddRADseq next generation data from H. rechingeri. The development of these microsatellite markers then allowed the successful cross-amplification of those loci in congeneric taxa. We found that these markers proved to be efficient in distinguishing between the major sections of Heracleum, including the thickand thin-stemmed species (Table 2). However, these markers were not able to distinguish and resolve sister taxa of the species complex mainly within the sections of Pubescentia and Wendia [14].

Efficiency of ddRADseq in developing microsatellite markers
Previous investigations have reported (GA) n as an abundant motif in various plant species (e.g. see [36] for a review). Our results, on the other hand, show that (AT) n is a far more abundant dinucleotide repeat, but (AG) n is also an abundant repeat in the taxa studied here. The percentages of total reads containing microsatellite motifs may vary among species and the types of sequencing methodology. A comparison amongst sequences obtained by traditional Sanger sequencing and NGS techniques, based on a review of 71 different studies, showed that NGS was approximately 40 times more effective in microsatellite marker development than Sanger sequencing [25]. In comparison, the development of microsatellite markers in Satureja bachtiarica and Heracleum persicum using 454 sequencing with an enrichment step for microsatellite loci, resulted in a discovery of 16% (8371 out of 51051 sequences) and 15% (3904 out of 25951) microsatellite motifs, respectively [2,31]. RADseq data have also recently been used to identify microsatellites markers in various organisms, including several plant species. The application of RADseq data in Solanum melongena and Cedrus atlantica, has shown that 0.02% (2000 reads) and 8.5% (5714 reads) consisted of microsatellite motifs, respectively [37,38]. Likewise, screening the next generation sequence data obtained by ddRADseq in Eleusine coracana and Taxus florinii have shown that 0.15% (10327 reads out of 6810971) and 1% (94851 of 8823053 reads) contain microsatellite motifs, respectively [39,40]. The output of the present ddRADseq analyses, 2.1%, is comparable to that reported in other studies.

Taxonomic application of microsatellite markers
Both model-based and distance-based clustering methods revealed two major genetic clusters, each representing a morphologically distinct group. Each genetic cluster consists of almost all samples from either thick-stemmed or thin-stemmed species, respectively. Our results also show that populations of H. persicum do not form a distinct group compared to the other closely related sister species, including H. rechingeri (Figs 2, 3 and 4b, Table 2). Based on taxonomic records there are few phenotypic traits that have been used as reliable taxonomic criteria to discriminate between H. persicum and H. rechingeri (Table 2). In Flora Iranica the presence of abaxial leaf trichome is listed as a diagnostic criteria in the latter species [14]. However, other molecular data from these species, including an ITS and trnL-trnF nucleotide analysis, together with a volatile oil composition study, also confirm our observed pattern which suggests a lack of differentiation amongst H. persicum and H. rechingeri populations [4,6,12], and perhaps that the abaxial leaf trichome is an unreliable trait in resolving H. rechingeri and H. persicum. Since inter-specific hybridization has been reported to be prevalent in Heracleum [2,3], the observed patterns can also be interpreted as a lack of fixed differentiation resulting from an introgression of alleles between the congeneric species.  H. pastinasifolium, H. transcacasicum, H.  persicum, H. rechingeri, H. gorganicum, H. lasiopetalum, H. antasiaticum and H.  H. antasiaticum has several distinctive morphological traits including leaf and fruit shape and trichome density, which when taken together, distinguish it from the other thick-stemmed species (Table 2). Our clustering analyses confirm the genetic differentiation among this and other species. Three thin-stemmed species, H. pastinacifolium, H. anisactis and H. transcacasicum are also clearly distinct from the former group. In addition, the three populations from H. pastinacifolium are not genetically differentiated from populations of a morphologically similar species, H. anisactis and H. transcacasicum. Similar studies using ITS and rpl32-trnL nucleotide sequences, or from the analysis of the chemical constituents of essential oils derived from the fruits, have also delimited the thin-and thick stem species, even though H. pastinacifolium and H. anisactis remain similar [4,12]. Limited numbers of cross-amplified microsatellite markers in H. lasiopetalum reveal a distant relatedness of this species within Heracleum. This species is morphologically different from the other study species and occurs in an alpine region of the Zagros mountains located in western part of Iran. This observation is also congruent with earlier results obtained from the few nuclear and chloroplast specific sequences that have been analysed to date [12].
Several taxonomic characters vary amongst Heralceum species, more specifically within the thick-and thin stemmed species complexes 14. However, there have been few reports of intraspecific variability in these traits. Our field observations indicate considerable variation between adult individuals within a given population. The observed morphological variability includes number of umbellate on the main umbel, which is described as the main taxonomic character that distinguishes H. pastinacifolium and H. anisactis [14]. This trait is highly variable within populations of H. anisactis and has overlapping trait variability values found in H. pastinacifolium. Interestingly, H. anisactis is listed as an endemic species in Iran with a narrower distribution range than that found for the H. pastinacifolium species. Our results suggest that H. anisactis and H. pastinacifolium are actually synonymous. Although the number of polymorphic microsatellite loci is limited in our study, the observed pattern is similar to that reported in earlier studies [4,6,12]. An additional systematic analysis by screening novel microsatellite markers, phenotypic and anatomical traits across several individuals for multiple populations within species might be required to depict the uniqueness of previously identified sister taxa.

Conclusions
This study demonstrates the utility of microsatellite markers in studying complex species with otherwise weak taxonomic signals. The present analysis suggests that the thin-stemmed species (H. anisactis and H. pastinasifolium) and thick-stemmed species (H. rechingeri and H. persicum) are more closely related than what is perhaps expected from their described diagnostic taxonomic criteria. Long sequence reads obtained by ddRADseq provide a valuable tool to develop and deploy novel microsatellite markers, and provide a time and cost-effective approach for a genus that contains both endemic and invasive species.
Supporting information S1 Table. Generic information of 54 microsatellite loci identified based on ddRADseq data in Heracleum. (XLSX)