Development of a Gene-Centered SSR Atlas as a Resource for Papaya (Carica papaya) Marker-Assisted Selection and Population Genetic Studies

Carica papaya (papaya) is an economically important tropical fruit. Molecular marker-assisted selection is an inexpensive and reliable tool that has been widely used to improve fruit quality traits and resistance against diseases. In the present study we report the development and validation of an atlas of papaya simple sequence repeat (SSR) markers. We integrated gene predictions and functional annotations to provide a gene-centered perspective for marker-assisted selection studies. Our atlas comprises 160,318 SSRs, from which 21,231 were located in genic regions (i.e. inside exons, exon-intron junctions or introns). A total of 116,453 (72.6%) of all identified repeats were successfully mapped to one of the nine papaya linkage groups. Primer pairs were designed for markers from 9,594 genes (34.5% of the papaya gene complement). Using papaya-tomato orthology assessments, we assembled a list of 300 genes (comprising 785 SSRs) potentially involved in fruit ripening. We validated our atlas by screening 73 SSR markers (including 25 fruit ripening genes), achieving 100% amplification rate and uncovering 26% polymorphism rate between the parental genotypes (Sekati and JS12). The SSR atlas presented here is the first comprehensive gene-centered collection of annotated and genome positioned papaya SSRs. These features combined with thousands of high-quality primer pairs make the atlas an important resource for the papaya research community.


Introduction
Papaya (Carica papaya Linneaus) is an economically and nutritionally important fruit tree of tropical and subtropical regions. Papaya is well known for its nutritional benefits [1], medical [2] and industrial [3] applications. Due to its commercial importance, papaya production is currently ranked as the third major global production among tropical fruits [4]. Notwithstanding the increased papaya trade, a limited number of cultivars are commercially available, hampering papaya production worldwide. Further, the low genetic diversity of selected C. papaya cultivars [5][6][7] makes the species susceptible to bacterial and viral infections [8,9]. To improve disease resistance, genetic diversity and productiveness, researchers have been using molecular markerassisted selection (MAS), a well-established procedure employed in commercial breeding programs to enhance the gain from artificial selection.
Microsatellites, also known as simple sequence repeats (SSRs), are simple tandemly repeated DNA sequences, ranging from 2-6 base pairs per repeat unit [10]. These repeated sequences are highly variable in length, mainly due to unequal recombination events or DNA polymerase slippage. Microsatellite PCR amplification has been described as a reliable, rapid, and inexpensive technique. Combined with the highly polymorphic nature and codominant segregation of SSRs, PCR amplification of microsatellites is a powerful technique for plant breeding and genetic studies, such as MAS, population genetic analysis, quantitative trait locus (QTLs) mapping, DNA fingerprinting, and genome mapping [11][12][13][14][15][16].
Once a labor-intensive and time-consuming process, identification of new microsatellite markers became increasingly feasible with the improvement of molecular biology techniques and availability of genomic information for several plant species. Over the past decade, SSRs derived from expressed sequence tags (EST-SSRs) markers emerged as a feasible alternative in marker development for several crop species [17]. EST-SSRs are transcribed from coding sequences (CDS), which tend to be conserved between species, high interspecies transferability rates can be achieved [18,19]. Moreover, CDS markers are more informative than intergenic 'anonymous' markers because they are more likely to be functional [20,21]. However, there are also a few disadvantages of using CDS markers. High conservation may result in low polymorphism rates, which are of limited use in MAS studies. In addition, primers designed for exon-intron junctions may result in PCR amplification failures.
With the increasing number of sequenced genomes, SSRs can be computationally detected and classified according to their genomic locations in 59 untranslated regions (UTRs), exons, introns and 39 UTRs. By using this strategy, exon-intron junctions can be avoided during primer design and fully exonic markers can be selected. Completely sequenced genomes also allow the selection of intronic markers which are more polymorphic than exonic markers and segregate with a particular gene that may be associated with a biochemical function or phenotype of interest [22,23].
In the present work we describe the analysis of papaya SSR markers in a genome-wide scale, integrating SSR positioning and functional annotation data. Stringent primer design criteria were used to allow better results in genetic studies. This complete catalog will be of great value for the papaya research community, especially for groups conducting MAS projects. Using this map, researchers will be able to filter and choose interesting markers according to SSR type, length, sequence, region location (exon, intron or intergenic), linkage group and gene annotation.

Genomic data and SSR annotation
Papaya genome assembly and annotation files were downloaded from Phytozome v7.0 FTP (http://www.phytozome.net/) [24]. The genome assembly 113 consists of 244.5 Mb of gapless sequences, distributed in 3,207 scaffolds and 2,693 contigs. Genomic locations of detected SSRs were integrated with gene and exon coordinates from the reference GFF file. Based on their genomic mapping, SSRs were categorized as exonic (entirely within the CDS), exon-intron (in exon-intron boundaries), intronic (within introns) and intergenic (outside of genic regions). Identifiers and annotations based on papaya-Arabidopsis thaliana homology were obtained for genic SSRs. Papaya genes were also annotated using Gene Ontology (GO) terms from the Plant Ontology Tool

Simple Sequence Repeats Identification
Exact maximal repeats were detected in the papaya genome using mreps [25]. Perfect repeats with more than 12 nucleotides, motif lengths of 2-6 bp and at least 2 units of repetition were analyzed. Parameters were set as follows: -r 0 -minsize 12minperiod 2 -maxperiod 6 -exp 2. The mreps algorithm finds exact maximal repeats, removes redundancy by selecting the best period for each repeat, merges repeats with same period and eliminates statistically insignificant expected repeats [25].
For description of di-and trinucleotide motifs, circular permutations and complementary strand nucleotides were considered as equivalents and grouped in one class after determining the individual repeat frequencies. Thus, there are four possible dinucleotide motifs and ten possible tri-nucleotide motifs. For example, motifs AG, GA, CT and TC are equivalent and grouped as AG/GA/CT/TC. Likewise, motifs ACG, CGA, GAC, CGT, TCG and GTC are also equivalents and represented as ACG/ CGA/GAC/CGT/TCG/GTC.

Primer design
SSRs were retrieved from the genome with 250 bp upstream/ downstream flanking regions and had their low complexity regions masked with DustMasker [26]. Primers were designed with the standalone version of Primer3 [27] using the following parameters: primer length between 18 and 25 nucleotides (optimal length = 20 nt), melting temperature between 57 and 63uC (optimal Tm = 60uC), PCR product size between 250 and 350 bp (optimal 300 bp), GC content of 20-60%, and PRI-MER_MAX_HAIRPIN_TH = 24. All SSR sequences were also used as a repeat library (option PRIMER_MISPRIMING_LI-BRARY) to avoid primer design within the SSRs. Primers with low complexity regions were discarded.

Analysis of genes involved in fruit ripening
As a proof of concept, our gene-centered SSR map were used to identify SSR markers for genes potentially involved in cell wall remodeling, transcriptional regulation and hormone signaling. Genes with differential expression in tomato ripening fruits (determined by RNA-Seq) [28] were used to identify homologous genes in papaya. Fifty-three cell wall and 222 transcription/ ethylene proteins were used as BLASTP queries to search the papaya predicted proteins with the following criteria: E-value # 1e-30, similarity of at least 50%, query and hit coverage of at least 75%. Tomato protein sequences and annotations release ITAG2.3 were downloaded from the Sol Genomics Network FTP [29] (ftp://ftp.solgenomics.net/).

SSR screening and polymorphism survey
The papaya genotypes Sekati and JS12 were used for screening polymorphic genic SSRs. Total genomic DNA was extracted from young leaves according to the CBAB method [30]. A total of 73 primer pairs comprising ,8 genic SSR regions per chromosome were selected. PCR amplifications were performed in 15 mL reaction, containing 10 ng DNA, 10 mM Tris-HCl, pH 8.3, 50 mM KCl, 2 mM MgCl 2 , 100 mM dNTPs, 0.2 mM of each primer, and 1 U Taq DNA polymerase. PCR cycling was performed in an Eppendorf thermal cycler, according to the following profile: 4 min of denaturation at 94uC, 35 amplification cycles (94uC at 30 s, 58uC at 1 min, 72uC at 3 min), followed by a final extension of 7 min at 72uC. In 4 of these cases, the primer annealing temperature was set to 65uC. Amplified products were separated in a 4% agarose gel Metaphor, stained by GelRedTM/ Blue Juice mixture (1:1) and visualized through the MiniBis Pro photodocumentation system (DNR Bio-Imaging Systems Ltd., Jerusalem, Israel).

Data Access and Retrieval
Information regarding SSR identifiers, genomic coordinates, motif sequence, period, size and exponent, genomic location (i.e. exon, intron, exon-intron, intergenic), linkage group, and gene annotations are fully available in two user-friendly spreadsheets (Table S1, Table S2).

SSR classification and genomic positioning
We identified 160,318 SSRs with a density of 656 SSR/Mb in the most recent version of the papaya genome (see methods for details). Previous studies of papaya SSRs reported densities of 1,340 SSR/Mb [31] and 746 SSR/Mb [32]. Such disparities in the number of identified microsatellites are usual among different reports, mainly due to differences in the algorithms, parameter settings, minimal repeat length and redundancy filtering [33][34][35]. Differently from other two previous studies [31,32], we used the mreps algorithm, which was ranked as the best algorithm for repeat detection in a systematic study [35]. Specifically, mreps does not report all the overlapping repeats, but efficiently retains only the most credible overlapping ones, giving more reproducible and reliable results.
We detected 160,318 perfectly matching, non-redundant SSRs. After integrating SSRs and gene coordinates, we found that 36%  Table 3. Distribution of Class I and Class II SSRs in different genomic regions. of the papaya genes (9,992/27,769) have at least one SSR. A total of 21,231 SSRs were identified in genic regions, while 139,087 were intergenic (Figure 1). Because UTR annotations are not available for the papaya genome, SSRs located on these regions were not classified as such. As expected, most SSRs are intergenic (86.8%), followed by intronic (9.9%), exonic (3.3%), and only 73 (0.04%) SSRs in exon-intron boundaries (Table 1). Dinucleotide motifs were abundant in intergenic (39.4%) and intronic regions (45.3%), while tri-to hexanucleotides were uniformly distributed in such regions. On the other hand, exons and exon-intron boundaries were enriched in tri-(67.1% and 37%) and hexanucleotides (24.4% and 42.5%), which is expected due to the selective pressure against frameshift mutations in coding regions.
Based on repeat length, papaya SSRs were defined as class I ($ 20 nucleotides) and class II (between 12-19 nucleotides) (Figure 2). SSR lengths ranged from: 12-82 bases in exons; 12-30 bases in exon-intron boundaries; 12-201 bases in introns and 12-155 bases in intergenic regions. Most SSRs were classified as class II (79.0%-84.1%), regardless of their genomic location (Table 3). However, 24,234 primer pairs were designed for class I SSRs (Table S3). Since these longer sequences are typically hypervariable and more likely to be polymorphic, they are the preferable choice as molecular markers for diversity studies.
All SSRs were assigned to chromosomes according to their scaffold or contig localization. A total of 116,453 SSRs (72.6%) could be mapped to one of the nine papaya linkage groups. The number of SSRs in each chromosome ranged from 10,566 (6.6%) in LG7 to 14,773 (9.2%) in LG9. The proportion of SSR types and motifs among different chromosomes was similar to the overall genomic distribution (Table S4). The proportion of SSR genomic locations in each chromosome was higher in intergenic regions, followed by introns, exons and exon-intron junctions. There was no bias for SSR types and SSR genomic locations among the chromosomes (Table S5), which is highly desirable for researchers aiming to develop a collection of polymorphic markers for genetic studies.

Design of high-quality primer pairs for papaya SSRs
Aiming to provide a comprehensive source of SSRs to be used in marker-assisted selection and population genetics studies, all 21,231 genic and 139,087 intergenic SSRs were submitted to primer design. All primer pairs were optimized for the same PCR conditions (see methods for details). In a preliminary analysis, 20,659 (97.3%) and 118,831 (85.4%) primer pairs were respectively designed for genic and intergenic SSRs using Primer3 with default parameters. Manual inspection of results revealed a significant number of primers containing repeated or low complexity sequences. Therefore, we decided to adopt more stringent criteria for primer design: 1) Primers within repetitive sequences were removed; 2) Primers in soft-masked low complexity 39 end regions were not allowed (PRIMER_LOWERCASE_-MASKING = 1); and 3) A parameter to minimize hairpin formation (PRIMER_MAX_HAIRPIN_TH = 24) was employed. By using this stringent parameterization, we obtained a much more reliable primer set, although the overall success rate of  Table 4. Number of successfully designed primer pairs for SSR type and linkage group.

SSR
Type LG1 LG2 LG3 LG4 LG5 LG6 LG7 LG8+LG10 a LG9 LG represents the Linkage Groups associated to the nine papaya chromosomes. a LG8 and LG10 are associated to the same chromosome.
b SSRs located in scaffolds/contigs were classified as unplaced. doi:10.1371/journal.pone.0112654.t004 Table 5. Number of successfully designed primer pairs for each genomic context and linkage group.

Genomic location
LG1 LG2 LG3 LG4 LG5 LG6 LG7 LG8+LG10 a LG9 LG represents the Linkage Groups associated to the nine papaya chromosomes. a LG8 and LG10 are associated to the same chromosome.
b SSRs located in scaffolds/contigs were classified as unplaced. doi:10.1371/journal.pone.0112654.t005 primer design dropped to 71%. A total of 18,925 primer pairs were successfully designed for 89.1% and 67.9% of all genic and intergenic SSR sequences, respectively (Table 4). Although such stringent settings prevented primer design for 8% genic and 17% intergenic SSRs, this methodology resulted in an extensive and more reliable list of primer pairs designed for distinct SSR types, genomic locations and chromosome linkage groups (Table S1,  Table S2). Designed primers were uniformly distributed among SSR types (Table 4) and chromosomes (Table 5). Regarding SSR type, most of the 113,446 primer pairs were designed for dinucleotide repeats (39.7%), followed by tri-(18.2%), hexa-(15.2%), penta-(13.6%) and tetranucleotide repeats (13.3%) ( Table 4). As expected, the majority of exonic SSRs with designed primers are composed of trinucleotide repeats (Table S6).

Analysis of SSRs located in genes related to fruit ripening
Next, we aimed to use our SSR atlas to find genes related to fruit ripening, a trait of high agronomical interest in papaya. To achieve this goal, tomato gene expression and tomato/papaya orthology data were integrated. Tomato is the papaya's closest climacteric fleshy fruit with available genome-wide gene expression data [28]. Sato et al. reported significant differential expression of 53 cell wall and 222 transcription factors (TFs)/ ethylene-related genes during fruit ripening [28]. Based on BLASTP searches (see methods for details), 175 cell wall and 319 transcription factor orthologous genes were found in papaya (Table S7, Table S8). Our atlas include SSRs with primer pairs for 113 cell wall-related (257 SSRs, 40 exonic and 217 intronic; 2.3 SSRs/gene) and 187 TF/ethylene-related genes (528 SSRs, 127 exonic and 400 intronic; 2.8 SSRs/gene). These two groups comprise very good candidate pulp softening and pigmentation control genes [36]. By integrating this information in our genecentered SSR map, we provide an unprecedented list of markers for studying the genetic and functional variability of fruit ripening processes.
Fruit ripening is a developmental process characterized by remarkable changes related to flavor, sugar metabolism, color, aroma, texture, softening and nutritional content [37]. These metabolic and physical alterations are driven by genetically coordinated expression profiles in several metabolic pathways, such as cell wall disassembly, sugar hydrolysis, ethylene biosynthesis and pigmentation. Using KEGG Orthology (KO) we identified ripening-related pathways for cell wall genes and TFs harboring SSRs with primer pairs. Nine entries for cell wall genes were found in KO00050 -Starch and sucrose metabolism (5) and KO00040-Pentose and glucuronate interconversions (4). During climacteric fruit ripening, respiration rate increases and a series of enzymes degrade starch and synthesize sucrose (e.g. starch phosphorylase and sucrose synthase) [38]. The carboxylic acid glucuronate is the precursor of pectin, one of the main components of plant cell wall [39]. Pectin levels decrease during fruit maturation [39] typically due to the increased expression of pectin lyases [28]. However, since pectin lyases and methyltransferases are repressed during papaya ripening, pectin solubilization is probably catalyzed by polygalacturonases [40].
Among TFs/ethylene-related genes, 24 entries were found, mostly from the category KO04075 (Plant hormone signal transduction; 8 genes), such as auxin-responsive protein, ethylene receptor and responsive transcription factor 1 and ABA-responsive element binding factor. The phytohormone auxin plays important roles in fruit growth, regulating cell division, differentiation, lateral root formation and embryogenesis [41]. Particularly, genes involved in signaling and auxin response factors were up-regulated during papaya [40] and peach [42] ripening. In addition, auxin can stimulate ethylene biosynthesis via transcription of acetylcoenzyme A synthetase genes [42]. In turn, the involvement of ethylene response factor in climacteric fruit ripening is wellestablished and this regulator determines, for example, fruit firmness reduction and defense response to pathogens after ripe [36]. By aggregating these annotations, our atlas provides a rich resource from which the scientific community can rapidly draw genes and SSRs with designed primer pairs for genetic studies.

SSR screening and polymorphism survey
A 100% PCR amplification rate was achieved for the 73 genic SSRs (16 exonic and 57 intronic), allowing the identification of 19 polymorphic alleles (26%) ( Table S9). Twenty five of such genes are orthologs of tomato genes with differential expression during fruit ripening (Table S7, Table S8). Among the polymorphic alleles we found 5 cell wall and 2 transcription factors/ethylenerelated genes. Polymorphisms were also detected in genes from the cellulose synthase, pectin lyase-like and ethylene response factor families/superfamilies. Taken together, these results not only validate the use of our atlas as an efficient tool in papaya breeding projects, but also stimulate additional genetic and biochemical studies to detail the functions these polymorphic genes in papaya, as they may be useful in the production of fruits with increased shelf life.

Conclusion
Non-coding SSRs near genes or inside introns can directly affect gene expression [21,[43][44][45]. On the other hand, SSRs within exons often result in amino acid changes that may affect protein function. For example, a study using genic SSRs derived from candidate genes involved in wood formation identified two SSR markers (one in coding and the other in non-coding region) explaining 13.5% of the lignin content variation in Chinese white poplar [46]. These results demonstrate the power of genic markers to identify genotype-to-phenotype associations and make these markers very useful in genetic improvement of desired characteristics.
In the present work we surveyed the papaya genome for the presence of perfect, non-redundant SSRs. We analyzed the distribution of SSR locations (exon, exon-intron, intron, intergenic) and established a comprehensive atlas of SSR markers with SSR type, motif sequence, SSR size, genomic location (exon, intron or intergenic), linkage group location and gene-centered information (gene annotation and GO assignments). The resource reported here is fully accessible through our supplementary material (Table S1, Table S2), allowing plant breeders and researchers to easily choose gene-centered markers to test their association with biological processes or phenotypes of agronomic interest. Moreover, we achieved a 100% PCR amplification rate during a genetic survey of 73 SSR markers, supporting the high quality of the predicted SSRs and designed primers. The atlas developed in this study will certainly serve as a toolbox to assist and improve the efficiency of marker-assisted selection in papaya breeding and population genetic studies.

Author Contributions
Conceived and designed the experiments: NMV ALG HCCR MGP TMV. Performed the experiments: NMV HCCR TMV. Analyzed the data: NMV ALG TMV. Contributed reagents/materials/analysis tools: NMV ALG HCCR MGP TMV. Contributed to the writing of the manuscript: NMV ALG TMV.