Development of Genomic Microsatellite Markers in Carthamus tinctorius L. (Safflower) Using Next Generation Sequencing and Assessment of Their Cross-Species Transferability and Utility for Diversity Analysis

Background Safflower (Carthamus tinctorius L.), an Asteraceae member, yields high quality edible oil rich in unsaturated fatty acids and is resilient to dry conditions. The crop holds tremendous potential for improvement through concerted molecular breeding programs due to the availability of significant genetic and phenotypic diversity. Genomic resources that could facilitate such breeding programs remain largely underdeveloped in the crop. The present study was initiated to develop a large set of novel microsatellite markers for safflower using next generation sequencing. Principal Findings Low throughput genome sequencing of safflower was performed using Illumina paired end technology providing ~3.5X coverage of the genome. Analysis of sequencing data allowed identification of 23,067 regions harboring perfect microsatellite loci. The safflower genome was found to be rich in dinucleotide repeats followed by tri-, tetra-, penta- and hexa-nucleotides. Primer pairs were designed for 5,716 novel microsatellite sequences with repeat length ≥ 20 bases and optimal flanking regions. A subset of 325 microsatellite loci was tested for amplification, of which 294 loci produced robust amplification. The validated primers were used for assessment of 23 safflower accessions belonging to diverse agro-climatic zones of the world leading to identification of 93 polymorphic primers (31.6%). The numbers of observed alleles at each locus ranged from two to four and mean polymorphism information content was found to be 0.3075. The polymorphic primers were tested for cross-species transferability on nine wild relatives of cultivated safflower. All primers except one showed amplification in at least two wild species while 25 primers amplified across all the nine species. The UPGMA dendrogram clustered C. tinctorius accessions and wild species separately into two major groups. The proposed progenitor species of safflower, C. oxyacantha and C. palaestinus were genetically closer to cultivated safflower and formed a distinct cluster. The cluster analysis also distinguished diploid and tetraploid wild species of safflower. Conclusion Next generation sequencing of safflower genome generated a large set of microsatellite markers. The novel markers developed in this study will add to the existing repertoire of markers and can be used for diversity analysis, synteny studies, construction of linkage maps and marker-assisted selection.


Introduction
A member of the family Asteraceae, Safflower (Carthamus tinctorius L.) is a diploid (2n = 24), mostly self-pollinating dicot with an estimated haploid genome size of 1.4 GB [1]. The crop is grown in wide geographical zones across the world [2] with Kazakhstan and India currently dominating safflower production [3]. It is a multi-purpose crop employed for diverse uses such as dye production, edible oil extraction and for medicinal applications [4]. It has also been exploited for production of biofuel and industrial oil [5,6]. Recently, transgenic safflower has been employed as a plant factory for production of important pharmaceuticals of human interest such as insulin and apo lipoprotein [7][8][9]. Considering the desirable oil composition of safflower and its resilience to dry conditions, it can serve as an important source of edible oil especially in arid regions of the world. However, undesirable features such as low yield, spiny nature and susceptibility to several biotic stresses have reduced its cultivation in several regions including India [10].
Conventional breeding programs in several crop species have resulted in the development of cultivars with improved yield and increased resistance to several diseases. Improvements can be achieved more efficiently and faster through analysis of global genetic diversity existing in the crop for selection of elite genotypes and by molecular breeding approaches [11]. Application of molecular markers in crop breeding has proven to be a powerful method for improvement of several crop species [12]. A prerequisite for successful implementation of molecular breeding in crops is the availability of strong molecular marker-trait association [11]. A comprehensive program to increase yield is essential for safflower improvement [13]. However, safflower genetics and genomics are largely unexplored and scarcity of reliable molecular markers in safflower is a major limitation for development of effective molecular breeding programs in the crop [14,15].
A wide range of dominant markers such as random amplified polymorphic DNA (RAPD), amplified fragment length polymorphism (AFLP), inter-simple sequence repeat (ISSR) and sequence-related amplified polymorphism (SRAP) have been used for assessing the genetic diversity of safflower [16][17][18][19][20]. However, the dominant inheritance pattern of these markers does not allow detection of allelic information, which is important for crop breeding.
Conversely, co-dominant markers allow detection of allelic diversity but in safflower, the repertoire of co-dominant markers is limited. Since their discovery in early 1980's, microsatellite markers or SSRs (simple sequence repeats) have gained importance owing to their co-dominant inheritance, multi-allelic nature, wide genome coverage, high reproducibility, high polymorphic index, adaptability to automation, high throughput genotyping as well as efficient transfer to closely related species making them valuable tools for genetics and breeding [21][22][23]. In safflower, earlier studies used conventional methods of library enrichment or EST databases for development of microsatellite markers. Chapman et al. [24] generated 104 polymorphic EST-SSRs for linkage mapping in safflower. Naresh et al. [25] reported five EST-SSRs used for testing the purity of safflower hybrids. Hamdan et al. [26] isolated 64 polymorphic genomic SSRs from an enriched genomic library of safflower. However, these methods have high development cost and low throughput restricting the use of microsatellite markers [27]. Next generation sequencing (NGS) provide resources for high-throughput SSR development at a lower cost [28,29]. Mining of NGS data for development of microsatellite markers has been exploited in a variety of plant species viz., pigeon pea, chrysanthemum, chokecherry, grass pea [30][31][32][33]. In safflower, Lee et al. [34] reported thirty polymorphic microsatellite markers derived from pyro-sequencing data while Pearl et al. [35] reported first set of 244 single nucleotide polymorphism (SNP) markers. Nonetheless, till date, only 203 polymorphic SSR markers have been reported indicating an urgent need for enrichment of robust co-dominant markers in safflower.
The genus Carthamus includes 18 species of which, C. tinctorius L. is the only cultivated species [1]. The wild species of Carthamus are known to harbor several agronomically desirable traits, which were lost during the course of safflower domestication [36][37][38]. Transferability of microsatellite markers to closely related species and genera would assist in the identification of marker-trait associations, which could be used for introgression of desirable loci from the wild species to cultivated safflower thus broadening its gene pool. Such markers would also be useful for synteny studies, identification of progenitor species and the study of genome evolution in the crop.
The current study exploited the efficiency of next generation sequencing data for analysis of microsatellite fraction present in safflower genome and derivation of a large set (5,716) of novel microsatellite markers for safflower. A subset of 325 microsatellite markers was experimentally validated using twenty-three geographically diverse safflower accessions. In addition, cross species transferability of polymorphic markers was assessed. Markers generated in this study would serve as important resources for population genetics, construction of linkage maps and marker-assisted selection in the crop.

Plant material and genomic DNA extraction
An accession of C. tinctorius L. (PI No: 560175) with high oil content (44%) was used for Illumina paired-end sequencing. A geographically diverse set of 23 safflower accessions belonging to seventeen countries was used to test the developed microsatellite markers. Cross species transferability of polymorphic microsatellite markers was tested using nine wild relatives of C. tinctorius L., including the probable progenitor species (C. oxyacantha and C. palaestinus). Seed samples were obtained from USDA-ARS, WRPIS, Pullman, WA, USA and IPK Gene Bank, Germany. Detailed information on plant material used in this study is given in Table 1.
Leaf material was harvested from 10-week-old plants of each accession and total genomic DNA was isolated using CTAB method [39]. The qualitative and quantitative analysis of extracted DNA was done by electrophoresis on a 0.8% agarose gel and using a NanoDrop spectrophotometer (NanoDrop, Wilmington, DE).

Identification of microsatellites, functional annotation and development of primer pairs
The assembled sequences were mined for perfect microsatellites using an in-house developed Perl script (S1 Script). The script contains separate modules for different SSR types (di-to hexa-nucleotides) and provides the details in terms of SSR type, repeat number, start and end position of repeat in the query sequence, total length of repeat and the complete sequence.
Clustering was performed using CD-HIT on the identified sequences harboring microsatellites (clustering criteria; similarity > = 90% and 80% length coverage) to remove redundancy. Imperfect and compound SSR types were not included in the analysis. Functional annotation of the retrieved microsatellite sequences was performed using web-based automated annotation pipeline, FastAnnotator using default parameters (http://fastannotator.cgu.edu.tw/) [43]. Sequences containing microsatellites with repeat length of 20 bases (10 units for di-; 7 units for tri-, 5 units for tetra-nucleotides) and optimal flanking regions ( 30 bases on both flanks of microsatellite) were used to design primers. The web-based program, BatchPrimer3 version 1.0 (http://probes.pw.usda.gov/batchprimer3/) [44] was used for designing primer pairs with following parameters: primer length 18-28 bases; product size ranging from 100bp-500bp; optimum annealing temperature between 50°C to 65°C and GC content of 40% to 80% with an optimum value of 60%. Other parameters were used at default setting. Blast+ version 2.2.26 (ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.2.26/) was used to query the previously reported SSR markers [24][25][26]34] against the marker set for which primers were designed in the current study.BLASTN hits with an E value less than 1x10 -13 were considered significant.

Validation of microsatellites
Primers were synthesized at Integrated DNA Technology, USA. Genomic DNA of two safflower accessions (PI: 560172 and 560175) was used as template for standardization of PCR conditions. The PCR was conducted in a total reaction volume of 15μl containing 50 ng of template DNA, 1X PCR buffer, 2mM of MgCl 2 , 0.2mM of each dNTP, 0.3mM each of forward and reverse primers and 1.25 U of Taq DNA polymerase (Biotools, Spain). Amplifications were performed in a Veriti thermal cycler (Applied Biosystems, USA) with the following cycling conditions: Initial denaturation at 96°C for 5 mins followed by 28-30 cycles of 96°C for 45s, primer annealing temperature (Tm; optimized for each primer pair; ranging between 55°C to 65°C) for 30s, DNA extension at 72°C for 1 min and a final extension at 72°C for 7 mins. The generated amplicons were analyzed on 2% agarose gel for product size and amplicon quality.

Genotyping and cross species transferability
Primer pairs producing a clear unambiguous band were used for genotyping a panel of 23 safflower accessions ( Table 1). The polymorphic markers were further assessed for their cross species transferability in nine wild relatives of C. tinctorius L. (Table 1). For polymorphic markers, M13 tailing of the PCR product was adopted as described earlier [45,46]. The labeled PCR products were analyzed on 6.5% PAGE using 4300 DNA analyzer system (LICOR, USA).

Marker diversity analysis
Statistical analyses of genetic data [Average number of alleles per locus (Na), gene diversity per locus (He), observed heterozygosity (Ho) and polymorphic information content (PIC)] of microsatellite markers were evaluated using POWERMARKER version 3.25 (http://www. powermarker.net) [47]. Cluster analysis for polymorphic microsatellite loci across the tested panel was performed using DARwin version 5.0.158 (http://darwin.cirad.fr/darwin) [48] based on simple matching coefficient.

Results and Discussion
Genome sequencing of C. tinctorius L.
Illumina paired-end technology was used for sequencing the safflower genome. We obtained 48,502,680 raw reads with an average read length of 101 bases which provided~3.5X coverage of the genome. The average quality score (Q) of raw reads was >25 in all the sequencing cycles (Fig 1). The raw sequences were checked for sequence artifacts such as low quality reads and adaptor contamination using NGS QC Toolkit [40]. A total of 44,164,564 (91.06%) high quality filtered reads were obtained with 98.1% bases showing a Q value of >20.

Discovery of microsatellites
An in-house developed Perl script (S1 Script) was used for mining perfect microsatellites from the clustered genomic data. Perfect repeats were selected as these are known to have higher mutation rates than imperfect loci and are expected to therefore, yield more polymorphism [54]. Additionally, more allelic variation is observed with increasing number of repeats [55]. Thus, sequences with repeat length < 20 bases were not analysed as these may not be significantly polymorphic. Following these criteria, we identified 31,390 microsatellite sequences which were further filtered to remove redundancy using CD-HIT and a non-redundant set of 23,067 putative microsatellite loci was obtained.
Significant heterogeneity was observed in frequency, motif type and repeat length of SSRs in safflower. Di-nucleotides were the most frequent type of repeats representing 71% of total SSRs, followed by tri-(10%), tetra-(5.8%), penta-(5.13%) and hexa-nucleotide repeats (4.26%; Fig 2). Di-nucleotides are known to be the most represented SSRs in genomes of several plant species viz., pigeonpea, mungbean, sweet potato and sesame [30,54,56,57]. The safflower genome was highly enriched with AT/TA repeats accounting for 57.65% of all the dinucleotide motifs followed by AG/TC (27.5%) and AC/GT (14.8%) repeats. This is in consonance with the observations of Lee et al., [34] who isolated microsatellites based on pyro-sequencing of the safflower genome. In general, plant genomes have been reported to be rich in AT repeats [58][59][60]. We did not obtain any CG/GC repeat in the analyzed data (Fig 3 and Fig 4A). Among trinucleotide repeats, AAT was the most common motif (35.6%) followed by AAG (25%). However, Lee et al., [34] reported ACC to be the most frequent tri-nucleotide repeat (27%) in safflower. This variation could be due to differences in the quantum of data generated in the two studies. While Lee et al. [34] reported 1100 contigs with SSRs, we obtained a significantly higher number of SSR-containing contigs (23,067) which may represent a more accurate distribution of SSR frequencies. Another factor could be the inherent bias observed in individual sequencing runs [61,62] that could have led to differences between the two studies. The least frequent tri-nucleotide motif was CGC for which only one locus was detected and it represented the only GC-rich trimer repeat obtained in the current study (Figs 3 and 4B). Low frequency of GC-rich repeats was also reported in genomic sequences of other crops [52,53]. Among tetra-nucleotide repeats, the ACAT motif was the most predominant (Fig 3).
The microsatellite motifs were also assessed for their repetitive unit length. The reiteration number of a SSR motif ranged from 4 to 24 and di-nucleotides were found to have greater number of reiteration units, which gradually decreased in higher motif types (Fig 5).

Functional annotation of microsatellite sequences
In order to study the potential functional significance of 23,067 microsatellite sequences, annotation was performed using FastAnnotator [43], which reported the average length and GC content of these sequences to be 300 bp and 30%, respectively. More than 50% of sequences (~13,000) were greater than 200 bp in length and the N50 of these sequences was 383 nucleotides. Out of the total set analyzed, 2,611 sequences were found to have similarity with sequences in the NCBI non-redundant protein database and 1,003 sequences (4.3%) were found to have at least one functional annotation. Around 738 sequences were assigned gene ontology (GO) while 99 sequences contained at least one domain. Ten sequences were found to be common among all annotation categories while 155 sequences were found to be common among GO and domain categories. Only one sequence was found to share the GO and enzyme annotation (Fig 6). S1 Table provides detailed information regarding annotated SSR sequences.  [63]. The present study thus reports a novel set of microsatellites, which might be correlated with the expressed components of safflower genome.

Validation of SSR markers
Primer pairs were designed for microsatellite loci using BatchPrimer3 version 1.0 [44] which allows detection of SSR motifs and designs primers from flanking regions. Out of 23,067 microsatellite loci identified above, 5,737 loci were recognized with sufficient flanking region and fulfilled the criteria for primer design (see Methods). Homology search of 5,737 microsatellite loci against the previously reported SSRs in safflower [24][25][26]34] was performed. The BLASTN results revealed that 14 and 7 sequences had significant similarity (E value < 1x10 -13 ) with SSRs previously reported by Chapman et al. [24] and Lee et al. [34], respectively. These sequences were removed from the analysis leading to the identification of 5,716 novel microsatellites in safflower.  A subset of 325 microsatellite loci, designated as NGSaf_1 to NGSaf_325, was chosen for experimental validation and included di-(58), tri-(257) and tetra-nucleotide (10) repeats. These sequences have been submitted in the NCBI GenBank database under the accession numbers KM670560-KM670883 (S2 Table). High representation of trimeric repeats (79%) was selected to increase the probability of their presence in the coding regions [64]. It is believed that selective forces do not allow expansion of any repeat type other than trinucleotides in coding regions to avoid frame shift mutations that could alter protein functionality [60]. These repeats therefore, have a greater probability for stronger marker-gene/trait association and a high rate of transferability across species. Details of untested primer pairs are provided in S3 Table. Out of 325 tested SSR primers, 294 (90.4%) generated high quality reproducible amplicons of expected size. Thirteen primer pairs failed to provide any PCR product and 18 primer pairs produced multiple amplicons, which were difficult to evaluate and were excluded from further analysis (S2 Table). The process of SSR development is subjected to attrition at each step. A mean 50% attrition between primer design and successful amplification of SSR loci has been reported in earlier studies [52,65]. Lee et al. [34] tested 509 primer pairs in safflower, of which only 302 (59.3%) produced successful amplification. The higher rate of successful amplification in the current study (90.4%) could be due to improved generation and analysis of sequencing data.

Polymorphism analysis and cross species transferability of SSR markers
Based on their geographical origin, 23 safflower genotypes from 17 countries of the world (Table 1) were selected for testing the discriminatory potential of validated microsatellite  markers. Out of the 294 microsatellite loci which produced robust amplification, 93 (31.6%) were polymorphic among the studied genotypes ( Table 3). The average number of alleles (Na) per locus varied from 2 to 4 while the mean observed heterozygosity (Ho) and mean gene diversity (He) were 0.0494 and 0.3746, respectively. The low level of observed heterozygosity might be attributed to the highly self-pollinating nature of the crop. Sixty-eight polymorphic markers revealed 2 alleles, 23 markers yielded 3 alleles while 2 markers detected 4 alleles among the 23 accessions. In total, 213 alleles could be identified in the assessed genotypes using 93 polymorphic SSR loci. Polymorphism information content (PIC) was calculated for each marker and ranged from 0.0416 (NGSaf_43, 91, 115) to 0.5602 (NGSaf_69) with a mean PIC of 0.3075 (Table 3).
In our data, some repeat motifs were found to be more polymorphic among the studied accessions than other repeat types. The repeat motif 'AG' exhibited highest polymorphism (57.1%) among all analyzed repeats followed by 'AAT' (46.4%). Detailed information regarding motif type and percent polymorphism is given in Table 4. It has been reported that different taxa exhibit different preferences for SSR types [59]. This information would help in selection of motif types and increase the probability of finding polymorphic markers in safflower.
The cross species transferability of polymorphic SSR loci was also assessed in nine wild species of safflower (Table 1). Each primer pair, except NGSaf_307, was found to be amplifiable in two or more of the tested wild relatives (S4 Table). Twenty five SSR markers showed 100% transferability to all the wild relatives. The highest rate of cross transferability of markers was observed in C. oxyacantha (97%) followed by C. palaestinus (87%) while markers were found to be least transferable in C. tenuis (42%) (Fig 8). Based on cytogenetic studies, C oxyacantha and C. palaestinus, had been proposed as the possible progenitors of cultivated safflower [66]. The high rate of cross species amplification of SSR markers obtained in the present study supports the earlier observations on homology between these species and their possible contribution to the safflower genome [67].
Cluster analysis for assessment of phylogenetic relationships in Carthamus sp.
Cluster analysis based on simple matching coefficient was used to assess the genetic relationships between C. tinctorius (safflower) genotypes and related wild species (Fig 9). The analysis grouped the studied accessions into two major clusters (I and II). All safflower genotypes, irrespective of their geographical origin, clustered in a single group (subgroup Ia) although some indicative groupings were observed for genotypes from USA and the European gene pool. Inclusion of more accessions from these geographical zones might be useful in identifying regional gene pools in the crop. The two wild species, C. oxyacantha and C. palaestinus, grouped together in subgroup Ib of cluster I. The clustering of C. oxyacantha and C. palaestinus along with C. tinctorius genotypes in cluster I supports the hypothesis that these wild species are more closely related to cultivated safflower than the other wild relatives. All the other wild species grouped together in Cluster II. Distinct clustering of accessions with differences in chromosome number was also observed. All Carthamus accessions with chromosome number n = x = 12 grouped together in cluster I. Cluster II segregated into two subgroups. Cluster IIa contained the diploid wild species with basic chromosome number = 10 (C. glaucus subspecies anatolicus, C. boissieri Halacsy and C. tenuis) while the tetraploid relatives (C. lanatus subspecies creticus, C. lanatus, C. lanatus subspecies turkestanicus and C. lanatus subspecies lanatus; basic chromosome number = 11) grouped in Cluster IIb.
Cross species amplification of microsatellite markers improves with decreasing phylogenetic distances [68]. The family Asteraceae is reported to have a low level of genetic conservation   Table 3.  Table 3. resulting in limited transferability of microsatellite markers across different genera [69]. Crossgenera transferability of microsatellite markers from sunflower has been shown to be inadequate and of limited use in safflower [14,70]. Molecular markers developed in the current study demonstrated a high rate of interspecific amplification (ranging from 42% to 97%) within the genus Carthamus. We have also established the efficiency of these markers in elucidating the genetic relationships between members of the genus Carthamus. These markers could also be used for synteny studies between cultivated and wild species of safflower.

Conclusion
In conclusion, our study provided an insight into the microsatellite components of safflower genome. Using next generation sequencing data, a large set of 5,716 novel microsatellite primers were designed of which, 325 markers were experimentally validated. Ninety-three markers were found to be polymorphic among the studied accessions. These markers were successfully used for genetic analysis in C. tinctorius L. and also showed significant cross species transferability in related wild species. Our data supports C. oxyacantha and C. palaestinus as the possible progenitors of cultivated safflower. We were also able to distinguish between various wild species with differing basic chromosome numbers. Markers generated in this study will enhance the current repository for safflower and would be useful in crop improvement programs. The current study also supports the efficiency of next generation sequencing data in providing faster and reliable resources for marker development in non-model crops.