Genome-Wide Analysis of Simple Sequence Repeats and Efficient Development of Polymorphic SSR Markers Based on Whole Genome Re-Sequencing of Multiple Isolates of the Wheat Stripe Rust Fungus

The biotrophic parasitic fungus Puccinia striiformis f. sp. tritici (Pst) causes stripe rust, a devastating disease of wheat, endangering global food security. Because the Pst population is highly dynamic, it is difficult to develop wheat cultivars with durable and highly effective resistance. Simple sequence repeats (SSRs) are widely used as molecular markers in genetic studies to determine population structure in many organisms. However, only a small number of SSR markers have been developed for Pst. In this study, a total of 4,792 SSR loci were identified using the whole genome sequences of six isolates from different regions of the world, with a marker density of one SSR per 22.95 kb. The majority of the SSRs were di- and tri-nucleotide repeats. A database containing 1,113 SSR markers were established. Through in silico comparison, the previously reported SSR markers were found mainly in exons, whereas the SSR markers in the database were mostly in intergenic regions. Furthermore, 105 polymorphic SSR markers were confirmed in silico by their identical positions and nucleotide variations with INDELs identified among the six isolates. When 104 in silico polymorphic SSR markers were used to genotype 21 Pst isolates, 84 produced the target bands, and 82 of them were polymorphic and revealed the genetic relationships among the isolates. The results show that whole genome re-sequencing of multiple isolates provides an ideal resource for developing SSR markers, and the newly developed SSR markers are useful for genetic and population studies of the wheat stripe rust fungus.


Introduction
Wheat stripe (yellow) rust is a devastating disease causing calamitous production losses of wheat, one of the most important cereal crops worldwide [1], endangering global food security.
It is caused by the obligate biotrophic parasitic fungus Puccinia striiformis f. sp. tritici (Pst) and occurs in most wheat areas with cool and moist weather conditions during the growing season [2]. Pst requires living host (wheat/grasses and Berberis/Mahonia spp.) to complete the asexual and sexual phases of its life cycle [3]. Its dikaryotic urediniospores are the main spores that infected wheat and are able to spread via the wind up to thousands of kilometers from the initial infection sites [4][5][6]. Pst populations often possess a high genetic diversity and virulence variability [7], giving them the ability to circumvent specific resistance genes incorporated in wheat cultivars within only a few years after release [5,8,9]. Understanding the mechanisms of the pathogen diversity and virulence variability is important for control of the disease.
Simple sequence repeats (SSRs), also known as microsatellites, are tandem repeats of 2-6 base pairs of DNA flanked by sequences that are generally unique in the genome, but conserved in the organism populations [10,11]. The unique flanking sequences may provide templates that facilitate the development of specific primers for amplifying SSR alleles by polymerase chain reaction (PCR). Allelic differences identified in this manner indicate variable numbers of repeat units present at SSR loci. SSRs have been used as molecular markers in many organisms because they are abundant, highly polymorphic and repeatable. Because SSR markers are often co-dominant, they are preferred over dominant markers for genetic studies of diploid or higher ploidy organisms [12][13][14][15][16][17][18][19].
For Pst, SSR markers have been developed using microsatellites enrichment DNA library [20] and expression sequence tag (EST) libraries [21][22][23], as well as genomic sequences [24], and are the primary molecular markers used in genetic and population studies. For example, SSR markers were used to reveal the existence of the sexual and parasexual cycles [25][26][27], estimate the genetic diversity of different Pst populations [28][29][30][31] and analyze the origin, migration routes and worldwide population structures of Pst [32]. However, only a small number of SSR markers were used in these studies. For more sophisticated studies, more SSR markers are needed.
Whole-genome sequence of multiple isolates can provides data for developing genetic markers. The first reported Pst genome is that of isolate PST-130 in the USA, which was obtained through next generation sequencing [33]. Bailey et al. [24] reported 25 SSR markers derived from the sequences of this isolate. The genome of the CYR32, one of the most dominant Pst races in China, was assembled using a 'fosmid-to-fosmid' strategy to reduce the affect of genome heterozygosity [34]. The assembled genome of CYR32 (~110 Mb) was larger than the draft genome of PST-130 (~64.8 Mb), but was comparable with the genome of PST-78 (~117Mb), another sequenced isolate from the USA (http://www.broadinstitute.org, unpublished). The genome assembly of CYR32 is suitable for the genome-wide analysis of SSRs.
In this study, we identified 4,792 SSR loci from the CYR32 genome sequence and determined the distribution patterns of SSRs in this isolate and five additional isolates from different countries. To provide the Pst community with more markers for more sophisticated studies, we developed a SSR marker database containing 1,113 SSR markers based on the genomes of the six isolates. We also validated 82 polymorphic SSR markers.

Pst Isolates
The sequence reads of six Pst isolates, 104E137A from Australia, CYR23 and CYR32 from China, Hu09-2 from Hungary, PK-CDRD from Pakistan and PST-78 from the USA, were used to identify polymorphic SSR loci [34]. For SSR marker validation, 21 isolates collected from the USA, Hungary and China were used (Table 1). Urediniospores were multiplied on wheat from a single urediniospore for each isolate and used for DNA extraction.

Pst genome sequences and INDEL discovery
The previously published genome sequence of CYR32 [34] was used as the reference genome. Sequence reads used in this study were obtained from PE (paired ends) libraries with 500bp inserts using the Illumina whole genome sequencing technology. High quality reads were extracted with NGSQCToolkit_v2.3.3 [35], SolexaQA_v.2.2 [36] and FASTX_Toolkit (http:// hannonlab.cshl.edu/fastx_toolkit/) software, and were aligned against the CYR32 reference genome using BWA-0.7.9a [37] software. The GATK [38] duplicate removal, INDEL (insertions or deletions) realignment and base quality score recalibration, and performed INDEL discovery across all six isolates were conducted simultaneously according to GATK Best Practices recommendations [39,40]. The average re-sequencing depth was 18.98, and genome coverage was 95.52%. A consensus genome sequence for each isolate was generated using Samtools-0.1.19 [41] software. The joint INDEL discovery across all six isolates was performed using Samtools-0.1.19. Only concordance INDELs, which were discovered by both methods [42] and passed a quality filter (QD>2.0, FS<200.0, ReadPosRankSum>-20.0) [39,40], were used to identify candidate polymorphic SSR loci.

SSR identification and primer design
The flow chart presented in supporting information S1 Fig illustrates the main steps used to develop SSR markers. SSR motifs were identified in the six genomes using a MISA script downloaded from http://pgrc.ipkgatersleben.de/misa/. Only perfect SSRs, including di-, tri-, tetra-, penta-, and hexa-nucleotide motifs with numbers of uninterrupted repeat units greater than 7, 6, 5, 4, and 4, respectively, were included in this study. In order to identify SSR loci with unique flanking sequences, repeat motifs and 300bp flanking sequences on each side were used for BLASTN search against the genomic sequences (evalue cut-off of 1e-10), and filtered with >90% identity and >85% alignment length of the flanking sequences. SSR loci with a single hit were identified as candidate loci for marker development. Primer pairs of SSR markers were designed using Primer 3 software (http://primer3. sourceforge.net/) with the following parameters: minimum, maximum, and optimal sizes were 18, 27, and 20 nt, respectively; minimum and maximum GC content were 20% and 80%, respectively; minimum, maximum, and optimal Tm were 57, 63, and 60°C, respectively; and product size range was from 100 to 300 bp.
The specificity of designed primers was confirmed through electronic polymerase chain reaction (E-PCR) [43,44] in the genomes using following parameters: the word size was 9, the discontinuous word was 1, the maximal deviation of observed product size to expected size was 100, both the number of mismatches and gaps allowed per primer were 1. The 84 previously published Pst SSR markers [20][21][22][23][24] were downloaded and amplified by E-PCR for comparison.
In addition, the genomic locations of SSRs were annotated. The exon, intron, and intergenic region were determined based on the original annotation of the CYR32 reference genome. The promoter regions were designated at 2 kb upstream of the start site of first exon. A protein was designated "secreted" if it contained a signal peptide and "pathogenicity-related" if it showed high similarity to a protein in the PHI database [45,46]. Moreover, a custom Perl script was used to directly identify polymorphic SSRs by the identical sequence position and nucleotide variation of identified INDELs and of developed SSR markers.

Experimental validation of polymorphic SSR markers
To experimentally validate putative polymorphic SSR markers, 104 primer pairs were synthesized by Sangon Biotech (Shanghai) Co., Ltd. Twenty-one Pst isolates in Table 1 were used in this experiment. Genomic DNA was extracted from urediniospores using a Fungal gDNA Kit (Biomiga). PCR reactions were performed in a 25μl volume containing 2.0μl template DNA (50ng/μl), 2.5μl reaction buffer (10×, Mg2+ free), 2.0μl Mg2+ (25mM), 2.0μl dNTPs (2.5mM), 1μl each primer (10mM), 0.2μl Taq DNA polymerase (5U/μl, Thermo Scientific) and 14.3μl ddH 2 O. The PCR conditions were as follows: 95°C for 4min; 35 cycles of 94°C for 45s, 55-58°C (varies for each primer pair) for 45s, and 72°C for 45s; and 72°C for 10min. PCR products were transferred directly from the thermocycler to the load tray of the Qiaxcel system. Separation was performed using the default OM500 method following the manual of QIAxcel DNA High Resolution Kit. The product sizes were automatically calculated in base pairs (bp), and gel views were exported using the QiaxcelScreenGel software. The number of alleles was recorded, and the polymorphism information content (PIC) was calculated. Statistical analysis was conducted by POPGENE version 1.32 software. Principal component analysis was conducted to show relationships of the 21 isolates based on genotypes of 82 polymorphic SSR markers using the NTSYSpc program [47]. A similarity matrix based on Dice coefficient was also generated using the SIMQUAL module in the NTSYSpc. Cluster analysis was conducted with the UPGMA (Unweighted pair-group method, arithmetic average) method in the SAHN module of the NTSYSpc. The COPH and MXCOMP modules were used to choose the dendrogram with the best fit to the similarity matrix. Robustness of branches of the dendrogram was determined by bootstrap analysis with the Winboot program [48].

The abundance of SSRs in the Pst genome
A total of 4,792 unique SSR loci were identified by searching through the genomes of the six Pst isolates from five countries with the MISA script. The number of SSRs varied among the different isolates (Table 2), ranging from 4,536 in PST-78 to 4,665 in CYR32 with the average of 4,576. Of the total of unique SSRs, 4,310 were common in all six isolates, and 492 in one or more, but not all isolates. Of the repeat motifs observed, the di-nucleotide motif was the most abundant (46.87%), followed by tri-(32.32%), penta-(9.39%), hexa-(5.95%), and tetra-(5.47%) nucleotide motifs ( Table 3). The average intervals for di-, tri-, tetra-, penta-, and hexanucleotide SSRs were 51.04, 74.32, 445.34, 257.61, and 411.99 kb, respectively ( Table 2). Considering the total of 4792 loci, the average interval was estimated as 22.95 kb, indicating the high abundance of SRRs in the Pst genome.

Screening for SSR candidate loci and developing Pst SSR markers
In order to identify candidate loci for marker development, all 4792 SSR motifs, together with their 300bp flanking sequences on each side were used for a BLASTN search against the genomic sequence. 1,446 loci (30.18%) produced single hits based on our cutoff across all six Pst isolates. Primers were designed for the 1,446 candidate loci with Primer3 software, and the specificity of these primers was validated in silico using E-PCR software, which result in 1113 primer pairs "amplified" expected bands and the remaining 333 "amplified" multiple bands or produced false matches (S2 Table). The 1113 SSR markers with specific primers form a new database of Pst SSR markers. This database accounts for 23.23% of the total 4792 identified loci (Table 3). In the database, SSRs with di-, tri-, tetra-, penta-, and hexa-nucleotide motifs accounted for 50.58%, 27.40%, 7.55%, 8.89%, and 5.57% of the SSRs, respectively (Table 3).

In silico comparison with previously reported SSR markers
A total of 84 SSR markers had been reported for Pst (S3 Table) prior to the present study.   RJ3N, RJ8N, RJ13N, and Pst002, respectively (S2B Fig). These results should be helpful in selecting SSR markers by reducing the possibility of using "different" primers to amplify the same SSR loci.

Distribution of SSR markers in different genomic regions
In order to investigate the distribution of SSR markers among different genomic regions, we mapped them to exon, intron, intergenic and promoter regions based on the original annotations of the CYR32 reference genome (Table 4). A high percentage of markers in our SSR marker database were located in intergenic region (39.26%), followed by promoter region (27.94%) and exon (20.13%), while those in intron were the fewest(12.67%). The 30 previously reported SSR markers were found mostly in exon (66.67%), because most were identified from EST libraries. Therefore, the newly identified SSR markers should provide more information in non-transcribed regions of the Pst genome. The SSRs containing tri-nucleotide repeats were the most frequent (>88.84%) among the five repeat types found in exon. Of the SSRs in exon, tri-and hexa-nucleotide SSRs that would not cause a frameshift accounted for 97.32%, while only 2.68% had potential to change the gene structure. We also calculated the frequency and percent of amino acids encoded by trinucleotide repeats in the corresponding 179 genes. These tri-nucleotide repeats encoded 11 kinds of amino acid (Table 5). Polar amino acids (QSTNG) accounted for 54.31% of those encoded, followed by acidic amino acids (ED,24.87%) and basic amino acid (HK,9.14%).
A total of 972 SSR markers in the developed database mapped in the exon, intron, promoter regions and within 2kb downstream regions of 828 Pst genes. Among them, 104 SSR markers were closely linked with secreted proteins which might be involved in the plant and pathogen interactions. In addition, there were 268 markers distributed among 228 genes found to be homologs of genes with characterized functions in the PHI database. More than half of these genes were related to reduced virulence or loss of pathogenicity (Fig 2), indicating that the genes may play important roles in infection and development process of the stripe rust pathogen [45,46,49]. These SSR markers may therefore facilitate the genetic study of these genes in Pst populations. In contrast, only 10 out of 30 previously reported markers were linked to PHI homologous genes, and only 1 marker was linked to a secreted proteins. These results further indicated the usefulness of the database in studying Pst.

Identification of polymorphic SSR markers using INDEL variation
In order to identify SSR markers with polymorphism, we identified INDELs present in the six isolates using GATK and Samtools software. As shown in S5 Fig, there Table), accounting for 9.43% of the newly developed SSR markers. The tree based on these INDELs (Fig 3A) showed a similar topology to the virulence-based tree (Fig 3B) for the six isolates except PST-78 from the USA, which harbored INDEL polymorphisms similar to isolates from Pakistan and Australia but had a much different virulence pattern. Moreover, the INDEL-based tree showed some correlation with geographical origin. The tree suggested that PST-78 (USA), 104E137A (Australia), and PK-CDRD (Pakistan) had the same origin, and that CYR23 (China), CYR32 (China), and Hu09-2 (Hungary) might have another origin (Fig 3C), which is consistent with a recent worldwide population structure study of Pst [32]. The SSRs with di-, tri-, tetra-, penta-, and hexa-nucleotide motifs accounted for 66.67%, 21.90%, 3.81%, 3.81%, and 3.81% of these 105 SSRs, respectively (Table 3). Of the 105 polymorphic SSR markers, 12 markers were in exons, 21 in introns, 31 in promoter regions, and 41 in intergenic regions (Table 4). Among them, 11 were associated with secreted proteins, while 24 were associated with pathogenicity-related genes. Moreover, only marker scaffold532-150474 shared a locus with a previously reported marker (SUNIPst09-40) (S2 Fig). Therefore, there are 104 markers would amplify loci which are not targeted by the previously reported 84 SSR markers.

SSR markers validated for quality and polymorphism
After excluding the SSR marker scaffold532-150474 sharing the locus which could be amplified with a previously reported marker SUNIPst09-40, the other 104 in silico polymorphic SSR markers were experimentally validated with 21 isolates (Table 1). Eighty-four of the markers generated specific bands, while 20 primer pairs failed to produce stable or clear bands due to a lack of sequence specificity in the genomic DNA samples. Of the 84 primer pairs, 82 revealed allelic difference among the 21 isolates tested (Fig 4, details are presented in S6 Table). For these polymorphic SSR markers, the motif length ranged from 14 to 42bp with an average of 19.48bp. Repeat numbers ranged from 4 to 15 units with an average of 7.96. In total, 477 alleles were amplified, with a range of 2-12 alleles at an average of 5.82 alleles. Product sizes ranged from 116 to 307bp. Polymorphic information content (PIC) values ranged from 0.17 to 0.88 with an average of 0.71. Observed heterozygosity (Ho) ranged from 0.00 to 0.95 with an average of 0.31, and expected heterozygosity (He) ranged from 0.17-0.88 with an average of 0.71.  Six markers were dominant in these 21 isolates, and the other 76 markers were co-dominant. The relatively high PIC and relatively low observed heterozygosity could be due to a higher conservation of these SSR loci among the 21 isolates, achieved by selecting unique flanking sequences in multiple Pst genomes. Based on the molecular genotypes of the 82 polymorphic SSR markers, both principal component analysis and cluster analysis separated these 21 isolates into four groups (Fig 5): group Ga with only two isolates from the US; group Gb with isolates from the US, China and Hungary; group Gc with isolates from Xinjiang, Qinghai and Tibet in China; and group Gd with isolates from Xinjiang and Gansu in China. This result was consistent with previous report that the Pst populations of China and the US in general evolved independently [31]. These results indicated that newly developed SSR markers were informative and useful; approximately 80.77% of the primers in our SSR database were effective. It was highly efficient (97.62%) to identify polymorphic SSR markers using the INDEL data based on whole genome re-sequencing of multiple isolates.

Discussion
SSRs are abundant and dispersed throughout Pst genomes. A total of 4,792 SSR loci were identified from six Pst isolates from five countries. This number is much larger than the 1,889 loci found in another genome-derived SSR identification [24] using the PST-130 sequence that cover only about 60% of the Pst genome [33]. The much larger number was achieved through the much better coverage of the sequence data from six isolates. This abundance of SSRs in Pst is neither the highest nor the lowest reported for fungi [50]. In the Pst genome, the di-and trinucleotide SSR motifs were more abundant than tetra-, penta-, and hexa-nucleotide motifs. SSR densities for different types of motifs varied from 51.04 to 411.99 kb. Moreover, we found that the AG/CT and ATC/ATG repeats were the most abundant di-and tri-nucleotide SSRs, respectively, in Pst. The abundance of AG/CT motifs is similar to the reports for other fungi, but our finding of abundant ATC/ATG motifs in Pst is different from other fungi [17,50], suggesting that Pst shares some common features in SSR loci formation and structure with other fungi, but also has its unique features. SSRs containing tri-nucleotide repeats were the most frequent (>88.84%) among the five repeat types found in exon, which was consistent with other reports [51][52][53][54]. Such phenomenon maybe due to the selection against frameshift mutations in coding regions [55]. More studies on the exon tri-nucleotide repeat SSR loci in comparison with other SSR loci may lead to a better understanding of the Pst evolution.
Because of abundance, easy to use, and highly polymorphic, SSR markers have been widely used in population studies of Pst [25,28,29,32]. However, the number of SSR markers publicly available prior to this study was limited [20][21][22][23][24]. In this study, we developed a database containing 1,113 SSR markers, which have unique flanking sequences across 6 Pst genomes. Moreover, the 1,113 markers were mostly abundant in intergenic regions. Compared with most of previously reported EST-SSRs developed based on EST libraries [21][22][23], these SSRs cover more non-transcribed regions, which can be advantageous in studies focusing on evolutionary or migratory studies of the pathogen without much selection. The 104 SSR markers identified among the six isolates were found to be closely linked with secreted proteins, should be useful makers to tag potential avirulence loci. In addition, 268 markers were found to be linked with 228 proteins that were homologous to proteins in the PHI database [45,46,49] and 53.51% of them were annotated as proteins related to reducing virulence or losing pathogenicity [49].
These results indicate that these markers may provide information of genetic variations of these genes and can be used for studying genes involved in infection and development process of Pst. For example, these markers could be used to study the associations between these genes and pathogenic traits in Pst population [56][57][58][59]. Of the 104 SSR markers used in experimental validation, 82 primer pairs were polymorphic among 21 isolates tested, and we determined the genetic relationships of the isolates using markers. Therefore, SSR markers identified in this study should be useful in a variety of applications, such as studying of population structures, mapping avirulence genes or genes for other important traits, and tagging or monitoring particular populations, virulence, and pathotypes of the pathogen.
Pst populations possess a high genetic diversity and virulence variability [7], giving them the ability to counteract the specific resistance genes in wheat cultivars [5,8,9]. Therefore, monitoring the pathogen populations is important for control of the disese. Traditionally, the Pst populations are monitored through virulence surveys [60] or just using a small number of molecular markers [28][29][30][31]. The availability of more markers can differentiate isolates better and improve our understanding of the genetic diversity and population structure of Pst in various agro-ecosystems. The population dynamics of Pst may provide useful information for rational deployment of resistance gene in wheat cultivars.
It is highly efficient to identify potential polymorphic markers by finding identical sequence positions and nucleotide variations of SSRs and of INDELs using whole genome re-sequencing of multiple isolates. In the present study, we identified 13,143 concordance INDELs using GATK and Samtools software across six Pst isolates from different countries. Then, 105 polymorphic SSR markers were identified in silico using proof of short tandem repeat INDELs from 1,113 SSR loci. From 84 markers generated specific bands, 82 primer pairs (97.62%) were polymorphic. In comparison, experimental testing those directly from potential markers usually results in less than 20% polymorphic markers [20][21][22][23][24]. Therefore, the strategy used in the present study further makes the development of useful SSR markers through the approach of genome re-sequencing of multiple isolates more efficient.
In conclusion, whole genome re-sequencing of multiple isolates is efficient for the developing SSR markers. Using the approach, we identified 4,792 SSR loci at an average interval of 22.95 kb for the stripe rust pathogen. We developed a database containing 1,113 SSR markers and validated the polymorphisms of 82 markers using 21 isolates. The SSR markers should be useful in various studies for a better understanding of the pathogen in order to more effectively control strip rust. The approach and methods shown in this study are applicable in developing SSR markers for any other organisms.