First Microsatellite Markers Developed from Cupuassu ESTs: Application in Diversity Analysis and Cross-Species Transferability to Cacao

The cupuassu tree (Theobroma grandiflorum) (Willd. ex Spreng.) Schum. is a fruitful species from the Amazon with great economical potential, due to the multiple uses of its fruit´s pulp and seeds in the food and cosmetic industries, including the production of cupulate, an alternative to chocolate. In order to support the cupuassu breeding program and to select plants presenting both pulp/seed quality and fungal disease resistance, SSRs from Next Generation Sequencing ESTs were obtained and used in diversity analysis. From 8,330 ESTs, 1,517 contained one or more SSRs (1,899 SSRs identified). The most abundant motifs identified in the EST-SSRs were hepta- and trinucleotides, and they were found with a minimum and maximum of 2 and 19 repeats, respectively. From the 1,517 ESTs containing SSRs, 70 ESTs were selected based on their functional annotation, focusing on pulp and seed quality, as well as resistance to pathogens. The 70 ESTs selected contained 77 SSRs, and among which, 11 were polymorphic in cupuassu genotypes. These EST-SSRs were able to discriminate the cupuassu genotype in relation to resistance/susceptibility to witches’ broom disease, as well as to pulp quality (SST/ATT values). Finally, we showed that these markers were transferable to cacao genotypes, and that genome availability might be used as a predictive tool for polymorphism detection and primer design useful for both Theobroma species. To our knowledge, this is the first report involving EST-SSRs from cupuassu and is also a pioneer in the analysis of marker transferability from cupuassu to cacao. Moreover, these markers might contribute to develop or saturate the cupuassu and cacao genetic maps, respectively.


Introduction
The cupuassu tree, Theobroma grandiflorum (Willd. ex Spreng.) Schum., belonging to the Malvaceae family, is a fruitful species native to the Amazon [1]-as the cacao tree (Theobroma cacao L.) whose seeds are used as raw material for chocolate production. The cupuassu tree is considered one of the main tree crops in the Amazon region [2,3], being economically important in Brazil, with great potential at international level due to the multiple uses of its fruit pulp and seeds. From the pulp, several products are manufactured, such as juices, ice creams, liquors, jams, jellies, creams and sweets [2,3]. Cupuassu seeds have a high quality fat, composed mainly of oleic and stearic acid [4,5], from which a product similar to chocolate, called cupulate, can be obtained [6][7][8]. Moreover, cupuassu received attention because of its proteolytic activity, useful in food industry [8], its antioxidant and cytotoxic activity, as well as its action in increasing glucose tolerance [9][10][11]. Due to its potential for the "chocolate" industry -particularly in the actual period of announced cacao beans and chocolate shortage [12,13]studies related to cupuassu species are increasing at molecular and breeding level [14][15][16][17]. Moreover, the genetic proximity of cupuassu with cacao-that has been thoroughly studied during the last 10 years [18][19][20][21]-allowed the transfer of data and technologies, as well as comparison for improvement of breeding programs related to different characteristics such as pulp/seed quality and disease resistance.
Considering that in Brazil, the main phytopathological problem that affects the Theobroma genus is the witches' broom disease-caused by the hemibiotrophic basidiomycete Moniliophthora perniciosa [22]-the cupuassu breeding program should integrate the selection of lines that present both pulp/seed quality and resistance to this fungus. Such selection could be assisted by microsatellites (SSRs) markers that are short repeat motifs with high polymorphism due to indel mutation-type in one or more repeats [23]. SSRs distribution is considered as nonrandom across both coding and noncoding regions of genomic DNA, and some of these SSR structures are important for different cell function (e.g. gene transcription, chromatin organization, DNA replication, cell cycle), indicating that some of the SSR groups may not be neutral [23]. In plant genetics, the SSRs were preferred due to their high variability, abundance, multiallelic nature, reproducibility, polymorphism, transferability as well as their codominant inheritance, chromosome-specific location and wide genomic distribution [23][24][25]. SSRs, in many species, were widely used for genetic diversity studies, molecular mapping, molecular fingerprint and conservation strategies [26].
When these SSRs are identified in expressed sequence tags (ESTs), the selection of interesting plant genotypes could be quite efficient mainly because the markers are physically associated to coding regions and can enhance the evaluation of plant populations by enabling the variation assay in expressed genes with known function [27]. With the advent of low cost next generation sequencing (NGS) technologies, it is now possible to easily obtain thousands of ESTs that could be the main source for in silico SSR identification (then named EST-SSRs). Identification of EST-SSRs is also important in the study of different species from the same genus [28][29][30][31][32], in which gene function and biological processes could be conserved [24,33] and may be related to the same responses to biotic and/or abiotic stresses. Therefore, the transferability of SSRs or EST-SSRs between species may support the idea of similar existing function, as well as to contribute to comparative genomics and diversity analysis [34][35][36].
For this reason, herein, we focused on: i) the identification and description of SSRs from new generation sequencing-obtained ESTs of cupuassu; ii) the analysis of the related EST function; iii) the validation of the SSRs on cupuassu genotypes with varied pulp quality and resistance to witches' broom disease and diversity study in relation to both characteristics; iv) the transferability of cupuassu SSR to cacao genotypes. To our knowledge this is the first work involving EST-SSRs from cupuassu and is also a pioneer in the analysis of marker transfer from cupuassu to cacao.

Plant material
Cupuassu genotypes used for EST-SSR validation were selected focusing on subsequent applications in breeding program for pulp quality improvement and/or witches' broom disease resistance. Sixteen cupuassu genotypes from Embrapa Amazonia Oriental were used (Tables 1 and 2) in this study. Among them, fourteen were resistant to witches' broom disease and two susceptible ( Table 2; personal communication R.M. Alves). The genotypes 174 (Coari) and 1074, resistant and susceptible to witches' broom disease, respectively, were the genitors of several of the progenies used in the breeding programs in Brazil (Table 1) [14]. For marker transferability analysis, three Theobroma cacao L. genotypes, from Ceplac (Bahia, Brazil) were used: two resistant, SCA6 and TSH 516, and one susceptible, ICS1. The TSH 516 genotype corresponds to the SCA6 x ICS1 cross [37].

Cupuassu pulp quality analyses
For the pulp quality analyses, five cupuassu fruits were harvested from three different plants (n = 15) for each of the sixteen cupuassu genotypes described (Table 2). For the evaluation of the pulp characteristics (°Brix, acidity, humidity and pH), 20 g of pulp from each fruit were collected and analyzed as previously described [38]. The Brix was determined using a refractometer PR-101 (ATAGO). The total acidity, expressed in citric acid percentage, was determined by titration using 0.1 N NaOH. The pH was determined using a Horiba F-21 pH-meter. For the determination of humidity, the samples were oven dried at 105°C until weight stabilization.

Location of the EST-SSR in relation to the coding sequence of the cDNA
The open reading frame (ORF) of the 70 chosen ESTs was determined using the ORF Finder program (http://www.ncbi.nlm.nih.gov/gorf/gorf.html) and by comparison with cacao genome (http://cocoagendb.cirad.fr; [19]), and the SSR was localized in relation to the ORF. The possible locations were: in the 5' untranslated region (5'UTR), in the ORF region, or in the 3' untranslated region (3'UTR). In some cases, due to the EST sequence length or quality, it was not possible to clearly determine the ORF and consequently the location of the SSR.

DNA extraction, PCR amplification and electrophoresis conditions
Cupuassu and cacao DNA were extracted from young leaves as previously described [45] and quantified using Nanodrop 2000 (Thermo Scientific). The optimization phase of the 77 primers designed was performed using the cupuassu genotypes 174 and 1074 (see Plant material). For the optimization phase, PCR was performed in 13 μl containing 7.5 ng of DNA, 0.25 mmol.l -1 of each dNTPs, 10 mmol.l -1 of Tris-HCl pH 8.3, 50 mmol.l -1 of KCl, 2 mmol.l -1 of MgCl 2 , 0.2 μmol.l -1 of each primer, and 1U of Taq DNA polymerase (Phoneutria). Amplifications were performed using the Mastercycler PCR 5333 thermocycler (Eppendorf), using the following conditions: 96°C for 2 min, 30 cycles at 94°C for 1 min, 58°C for 1 min, 72°C for 1 min, and a final extension step at 72°C for 7 min. Amplified fragments were analyzed by electrophoresis on 4% denaturing TBE acrylamide gels. Polymorphism was evaluated by scoring the SSR bands. When comparing the genotypes, the presence or absence of a determined band (similar size) indicated similarity or dissimilarity between genotypes, respectively. The 10-bp molecular marker (Invitrogen) was used as a reference to score the bands. For the confirmation of the polymorphic primers, the amplifications were made on the 16 cupuassu genotypes (Table 1). PCR was performed as described above, excepted for the primers that were labelled with the M13 tail, and with the increase in the reaction of 0.2 μmol.l -1 of M13 primer labelled with NEDTM fluorescence, and 10 μmol.l -1 of 6-FAM. The amplification products were analyzed on the ABI3500 sequencer (Applied Biosystems) using GeneScan™ 500 LIZ™ dye (Life Technologies) as internal size standard. The allele size was defined using the GeneMarker software. The transferability of the developed EST-SSR primers was carried out by cross-species amplification on genomic DNA of three T. cacao genotypes (SCA6, ICS1, TSH516) using the same PCR and electrophoretic conditions (4% denaturing TBE acrylamide gel) as described above.

Sequencing of amplicons for marker confirmation
PCR amplifications were carried out in 20 μl reaction volume containing PCR buffer 1X (Invitrogen), 0.375 mM of each primer (see S1 Table), 10 ng/μl of cupuassu DNA (genotypes 1074 and 174) and 0.5 U of Taq polymerase (Invitrogen). Thermocycling conditions consisted of an initial melt at 95°C for 5 min followed by 28 cycles of 95°C for 30 s, 58°C for 90 s, 72°C for 30 s and a final extension step of 72°C for 10 min. All amplifications were performed in a MyCycler thermocycler (Bio-Rad Laboratories). PCR amplification reactions were checked on electrophoresis on 1.8% agarose gel stained with Gel-red I (Invitrogen). PCR products were cleaned with ExoSap-IT (USB) according to the manufacturer's instructions. Sequencing was performed on the ABI3100 equipment (Applied Biosystems) at Ceplac (Bahia, Brasil). The confirmation of the SSR marker was based on the comparison of number of repeated sequences of each allele among the different genotypes.

Genetic diversity and statistical analysis
The amplified SSR DNA bands representing different alleles were scored on the different genotypes. The genetic diversity parameters were assessed in terms of observed number of alleles (Na), observed heterozygosity (Ho), and expected heterozygosity (He) using the Genetic Data Analysis software [46]. Polymorphic information content (PIC) was obtained for each locus as previously described [47] and null alleles were examined using Micro-checker software, v.2.2.3 [48]. Factorial Component Analysis (FCA) was made with the GENETIX software [49]. Correlation test between molecular data and pulp quality or resistance to witches' broom disease was realized using the SAS program [50].
In silico comparison of Theobroma grandiflorum loci with Theobroma cacao var. Criollo For cupuassu/cacao loci comparison, T. grandiflorum ESTs were compared to cacao genome var. Criollo (CacaoGenDB; http://cocoagendb.cirad.fr) using the blastn tool of the CacaoGenDB configured with the following parameters: blast against gene sequences (including UTRs and introns) and expected e-value of 1.10 −10 [19]. Specific repeat motifs observed in cupuassu loci were searched in the corresponding region of the cacao sequence (ORF or UTRs). Primers used for SSR analysis in cupuassu and for transferability study in cacao were also blasted on the cacao genome using the specific Primer Blaster tool from CocoaGenDB, with an acceptability of up to three mismatches. Each cupuassu EST and the corresponding cacao sequences were compared and aligned using the Clustal Omega program (http://www.ebi.ac.uk/Tools/msa/clustalo/).

Transferability of EST-SSRs
The transferability of the cupuassu EST-SSRs to T. cacao was analyzed by cross-species amplification. From the 22 pre-selected EST-SSRs (polymorphic or not in cupuassu; Fig 1), 17 amplified cacao DNA, which corresponds to a transferability rate of 77% (Table 4). The amplifications were within the expected size, and 14 of the 17 cupuassu SSRs were polymorphic in cacao (Table 4). From the 11 EST-SSRs polymorphic in cupuassu, 8 were transferable to cacao and 6 were also polymorphic in this species (Tables 3 and 4). The 11 polymorphic locus of cupuassu were also compared to the cacao genome database (cacao var. Criollo) and several homolog sequences were encountered (Table 5). Eight cupuassu loci presented polymorphism when compared to cacao: six of them presented the same repeat motif, but with less repeats   (c2723, c5718, c70, c180, c193B, c203B) for at least one homolog sequence, and 2 of them did not present the repeat motif (c3202/3202B, c733; Table 5). Two loci showed the same motif/ repeat number in cupuassu and cacao (c339, c431B; Table 5). The in silico analysis showed that some primers were transferable allowing the identification of a polymorphic locus (e.g. c2723; Fig 6A, Tables 4 and 5). Some primers were transferable but the locus was non-polymorphic (e.g. c339; Fig 6B, Tables 4 and 5). The two other situations corresponded to primers that were not able to amplify the cacao gene, whatever if the locus was polymorphic or not (e.g. c193, c733; Fig 6C, Tables 4 and 5). It is interesting to note that some loci were transferable to cacao but presented different polymorphism depending on the cacao variety analyzed: for example, the c431B locus is polymorphic in SCA6/ICS1/TSH516 varieties (Table 4) but did not presented potential polymorphism in the in silico analysis using the Criollo variety (Table 5).

Discussion
In this article we obtained and analyzed a large number of ESTs from Theobroma grandiflorum (cupuassu) with the objective to identify new SSR markers useful for marker assisted selection in cupuassu with respect to both quality and resistance to witches' broom disease. Both of these characteristics are important from a practical point of view for increasing the development of cupulate production or pulp-derived products, as an alternative to chocolate production   Table 4 b Polymorphic in cacao genotypes SCA6, ICS1 and TSH516 -see also declared in crisis [12,13]. Moreover, the cupuassu breeding program needs the insertion of new markers for genetic fine mapping and selection of genome regions specifically involved in quality and/or resistance, in order to complement previous genetic analysis of cupuassu population [14,15,17]. Here we obtained SSR markers from NGS ESTs of cupuassu genotypes with different levels of resistance to witches' broom disease and pulp quality. It is important to highlight that we produced the first EST database from cupuassu as well as the first EST-SSRs for this species. In cacao, more than 200,000 ESTs from different plant genotypes and organs submitted or not to different biotic and abiotic stresses [18,44,[51][52][53][54], and more than 2,000 SSRs (whose 1631 [81%] were EST-SSRs) were already obtained (S3 Table) whereas in cupuassu, only genomic SSR were previously found (unpublished data, R.M. Alves). Furthermore, ESTs for use in molecular studies related to pulp or bean quality from the Theobroma genus are rare [18,53]. Under these conditions, our results are highly relevant due to the large amount of ESTs generated (8,330) as well as the functional data associated to some of the EST-SSR identified (Fig  3). SSRs were detected in 18% of the ESTs analyzed (Fig 1), which corresponds to a high frequency comparing to data produced from other crops [24,55,56] with similar technical approaches (e.g. NGS, Misa analysis). Here, the highest proportions of EST-SSRs identified were hepta-and trinucleotides (29.3% and 25.4%, respectively; Fig 2A). Trinucleotides were generally considered as the most abundant class of SSRs in plant ESTs [27,[55][56][57] but other  Table 1) based on allele frequencies using eleven polymorphic SSR markers (Table 3) and pulp quality characteristics ( Table 2). The susceptible cupuassu genotypes were indicated by squares (62 and 1074); the other ones were susceptible and indicated by diamonds. The cupuassu genotypes with SST/ATT parameter >7.0 were indicated in red; those with SST/ATT parameter 7.0 were indicated in blue. Orange circle separated the susceptible genotypes to the resistant ones (green circle). works also indicated dinucleotides [33,58]. Since the addition or deletion of three nucleotides within translated regions usually does not affect the ORFs, it is not uncommon to detect a high abundance of these repeat motifs in EST-SSRs [59,60], as we observed in our results (Fig 4B). But generally, it is accepted that the abundance of one or other SSR class may be due to the search criteria used for EST mining [26,58,61]. Nevertheless, the search criteria used for EST mining influences the frequency of the repeat number of the SSRs motifs; here the most frequent repeat number were 2, 4, 3 and 10 (44.6%, 17.8%, 11.3% and 7.75%, respectively; Fig 2B). Moreover, the SSRs containing the highest repeat numbers (10 to 19) were also the ones that contained exclusively mono-and dinucleotides (Fig 2C), while the SSRs with the lowest repeat numbers (2 to 6) contained larger motifs (tetra-to nonanucleotides; Fig 2C).
From the 1,899 EST-SSRs identified, 77 were tested as to their polymorphism in 16 cupuassu genotypes and 11 were polymorphic ( Table 3). The PIC values (average 0.5; Table 3) observed here was closed to the ones found in cacao and cupuassu studies using genomic SSR [14,62]. Such polymorphism was associated to genetic diversity of cupuassu according to the resistance parameter (characteristic that better discriminated the cupuassu genotypes) and, to a lesser extent, to SST/ATT parameter (Fig 5). The ATT data found in our study were consistent with the results obtained in other evaluations [63,64], and 13 of the 16 genotypes studied (81%) presented ATT values higher than the minimum required (1.5; Table 2) [65]. The pH of cupuassu genotypes used here also showed values closed to those observed in other studies [63,64,66,67] and all the genotypes (100%) presented values higher as to the required limit for good cupuassu quality (2.6; Table 2) [65]. The SST content also were consistent with other studies [63] and higher to the required limit [65] (Table 2); it is important to note that the harvesting period could influence the pulp quality as observed in other analyses where the SST values were lower than the expected values [64,67]. Genotypes 63 and 64 showed the highest SST/ATT and for this reason may be considered as good candidate for breeding programs   (Table 2). Generally, these data suggested that the cupuassu germplasm collection, as well as the cupuassu breeding program, generated material with high genetic variability related to pulp quality, and that the marker found here could be used for subsequent analysis of new crosses for cupuassu population and potentially for use in other Theobroma species. Because EST-SSRs are generated from coding and expressed sequences, which are generally well conserved between species, the possibility to find conserved primers flanking the repeatsand possibly polymorphic-motifs, is high [26,41]. Here we observed in vitro and in silico marker transferability between cupuassu and different varieties of cacao (resistant and susceptible to witches' broom disease; Tables 4 and 5). Generally, the in silico analysis confirmed the in vitro results, and different transferability situations were observed (Table 5 and Fig 6). Transferability requires not only polymorphism between cuapuassu and cacao sequences, but also good primer design, able to amplify the polymorphic regions ( Fig 6A). Therefore, the availability of the cacao genome and the study of the family of genes with interesting function can help to design primers able to amplify-and consequently to be efficiently transferred- between different species from the same genus. It is important to note that we report the first cupuassu-cacao marker transferability; whereas only a few studies of transferability between the two Theobroma species have been already reported and always from cacao to cupuassu [38,68]. The first report used cacao markers previously developed [69] (S3 Table) to define the natural mating system of Theobroma grandiflorum in its putative center of diversity [38] while the second specifically deals with marker transferability from cacao to cupuassu [68]. The polymorphism rate calculated in these studies was lower (43.8%; Alves et al., 2006) than the one obtained here from EST-SSRs (77%; Fig 1, Table 4). Generally, in the work presented here we obtained a higher transferability (77%) than presented in other tests regarding marker transferability between correlated species [31,35,70]. The success of transferability between species as observed for coffee [71], rice [70], bananas [72], barley [73] and gerbera [74] is due to saving time and costs in the development of new markers.

Conclusion
Here we obtained the first EST-SSRs from cupuassu. These markers were polymorphic in cupuassu and allowed diversity analysis of the studied genotypes, mainly in relation to pulp quality. Moreover, these markers were transferable to cacao genotypes. The detection of EST-SSRs was also an important point regarding sequence function; the sequences containing ESTs will be good candidates for functional studies related to pulp and seed quality as well as to resistance to witches' broom disease. Moreover, these markers may contribute to develop or saturate both the cupuassu and cacao genetic maps, respectively.
Supporting Information S1