Single Nucleotide Polymorphism (SNP) markers associated with high folate content in wild potato species

Micronutrient deficiency, also known as the hidden hunger, affects over two billion people worldwide. Potato is the third most consumed food crops in the world, and is therefore a fundamental element of food security for millions of people. Increasing the amount of micronutrients in food crop could help alleviate worldwide micronutrient malnutrition. In the present study, we report on the identification of single nucleotide polymorphism (SNP) markers associated with folate, an essential micronutrient in the human diet. A high folate diploid clone Fol 1.6 from the wild potato relative Solanum boliviense (PI 597736) was crossed with a low/medium folate diploid S. tuberosum clone USW4self#3. The resulting F1 progeny was intermated to generate an F2 population, and tubers from 94 F2 individuals were harvested for folate analysis and SNP genotyping using a SolCap 12K Potato SNP array. Folate content in the progeny ranged from 304 to 2,952 ng g-1 dry weight. 6,759 high quality SNPs containing 4,174 (62%) polymorphic and 2,585 (38%) monomorphic SNPs were used to investigate marker-trait association. Association analysis was performed using two different approaches: survey SNP-trait association (SSTA) and SNP-trait association (STA). A total of 497 significant SNPs were identified, 489 by SSTA analysis and 43 by STA analysis. Markers identified by SSTA were located on all twelve chromosomes while those identified by STA were confined to chromosomes 2, 4, and 6. Eighteen of the significant SNPs were located within or in close proximity to folate metabolism-related genes. Forty two SNPs were identical between SSTA and STA analyses. These SNPs have potential to be used in marker-assisted selection for breeding high folate potato varieties.


Introduction
Micronutrient malnutrition is a global health concern that affects as many as two billion people [1,2]. Health problems associated with malnutrition are responsible for over a million a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 deaths per year [3]. Folate (a.k.a. vitamin B 9 ) is an essential micronutrient in the human diet. Folate plays an important role in overall cellular and organismal health. In the absence of sufficient folate intake, cellular processes such as nucleic acid biosynthesis, the metabolism and catabolism of amino acids, and the methylation cycle cannot take place efficiently [4]. Folate deficiencies have been linked to many serious health concerns such as congenital birth defects, anemia, increased risk of stroke, certain types of cardiovascular diseases and cancers [5][6][7][8]. Neural tube defects (NTDs) such as spina bifida and anencephaly are some of the most common congenital birth defects, with an estimated 250,000 cases of NTDs worldwide [9]. It is estimated that up to 70% of NTDs can be prevented with proper folate intake or folate supplementation [6]. Low folate levels have also been linked to impaired cognitive performance and depression [10][11][12].
The cultivated potato (Solanum tuberosum L.) is one of the most consumed food crops worldwide, with a total world production of over 382 million tons in 2014, following maize, rice, and wheat (FAOSTAT data). Potatoes are cultivated in over one hundred countries, from sea level up to 4,700 meters above sea level. It is estimated that over one billion people worldwide consume potatoes regularly [13] which makes potatoes a fundamental element of food security for millions of people.
Biofortification of potatoes for increased folate content could be an efficient strategy to reduce folate deficiencies around the world. Potato has a tremendous genetic diversity with over 4,000 varieties and~100 wild related species [14][15][16][17]. This biodiversity represents a tremendous resource to search for traits of interest such as high folate content. Systematic screening of potato germplasm for tuber folate content has shown a 10-fold range of folate concentrations with some accessions of wild species containing up to 4-fold the folate content of modern potatoes [18,19].
Selecting for high folate content in potato tubers is a tedious process because potatoes must go through a full growing season to obtain tubers for testing, and it takes approximately three days to determine folate content for every eighteen to twenty samples, making large-scale selection for high folate trait practically impossible. Screening efficiency would be greatly improved by the development and use of molecular markers associated with high folate content. Tools such as SNP genotyping, whole genome sequencing, quantitative trait loci (QTL) analysis, and marker-trait associations are useful for the observation and mapping of genetic differences between breeding clones and the progeny of crosses, as well as the contributions of these differences to the coding regions of genes for specific traits [15,[20][21][22].
Molecular markers based genetic linkage maps for diploid and tetraploid potatoes have been constructed previously [23][24][25]. Association mapping, commonly known as linkage disequilibrium (LD)-based mapping is becoming extremely popular for the identification of QTL for investigating marker-trait associations in plants. This strategy serves as an effective alternative to the linkage-based traditional approaches and offers an advantage to breeders, making the requirement of inbred crosses unnecessary [26]. Association mapping has been reported for various phenotypic/morphological traits in potatoes [27][28][29][30]. These tools can make introgression of beneficial traits into potatoes more efficient and breeders can improve breeding efficiency by selecting individuals based on a comprehensive set of genetic data associated with desired traits (e.g., disease resistance, drought tolerance, skin/flesh color, and nutritional content). Theoretically, marker assisted selection (MAS) in potatoes can substantially reduce the amount of time it takes for breeders to develop superior breeding clones for commercial use, licensing and release.
The primary purpose of this study was to identify molecular markers associated with high folate content in potato tubers. We used association mapping to find marker trait associations and identify SNPs associated with high folate content in potato for their potential use in marker assisted breeding.

Plant material
High folate values were previously reported in bulked tubers from four seedlings of the Solanum boliviense PI 597736 accession [18]. Fine screening showed that this accession segregated for folate. A clone referred to as Fol 1.6 that had high tuber folate content was crossed with the low/medium folate recombinant inbred clone USW4self#3 (referred to as USW4s#3 thereafter) to generate an F 1 progeny. USW4s#3 is a selected self from a Minnesota advanced selection 20-20-34 (it has better than usual flowering, male fertility, and it and its progeny are known to self). Twelve of the resulting F 1 seedlings were intermated to produce an F 2 population named BRR3. True potato seeds from the resulting F 2 population were soaked in GA3 at 1 g/L overnight before germination in June 2014. When plantlets reached approximately 8-cm high, they were transplanted in 8-cm square individual pots containing Sunshine Mix LA4P. All-purpose fertilizer 20-20-20 was applied at 200 mg/L once a week until senescence. Plants were watered twice a week until senescence. Vines were killed on October 31 st , 2014, and tubers were harvested on November 11 th . Greenhouse temperature was set at 21˚C day time and 15˚C night time. Supplemental light was provided for 14 hours per day from a mixture of 400 Watt high pressure sodium and 1,000 Watt metal halide lamps. While 150 seedlings were planted, only 94 individuals from the progeny produced tubers and were subsequently used for folate analysis.
Once harvested, tubers were left with skin intact, washed with cold water in a strainer, weighed, and then flash-frozen with liquid nitrogen and stored at −80˚C. Frozen samples were then lyophilized in a freeze-dryer (VirTis Benchtop 4K, SP Scientific, PA) with vacuum pressure < 100 mTorr for three days. Samples were then ground to a fine powder with a Waring blender and transferred to scintillation vials for long-term storage at −80˚C.

Folate analysis
Folates were extracted by using a tri-enzyme extraction method, as previously published [18,19]. Potato samples (100 mg) were homogenized in 15-mL Falcon tubes containing 10 mL of extraction buffer consisting of 50 mM HEPES/50 mM CHES, pH 7.85, 2% (w/v) sodium ascorbate and 10 mM β-mercaptoethanol and deoxygenated by flushing with nitrogen. Once homogenized, samples were boiled for 10 min and cooled immediately on ice. The homogenate was then treated with protease (!14 units) and incubated for 2 h at 37˚C, boiled again for 5 min and cooled immediately on ice. Samples were then treated with α-amylase (!800 units) and rat plasma conjugase in excess (0.5 mL/sample), incubated for 3 h at 37˚C, boiled again for 5 min and cooled immediately on ice. After centrifugation at 3,000 g for 10 min, the supernatant was transferred to a new tube. The residue was re-suspended and homogenized in 5 mL of extraction buffer, re-centrifuged for 10 min, and the supernatant was recovered. Supernatants were then combined and the samples' volume was adjusted to 20 mL with extraction buffer. Aliquots of each sample were transferred to 1.5 mL microcentrifuge tubes, flushed with nitrogen and stored at −80˚C until analysis. Controls containing all reagents, but potato samples, were used to determine the amount of residual folates in the reagents. There were no detectable folates in any of the reagents used.
Folate concentrations were measured by microbiological assay using Lactobacillus rhamnosus. L. rhamnosus (ATCC 7469) cultures were obtained from the American Type Culture Collection (Manassas, VA, USA). Glycerol cryoprotected cells of L. rhamnosus were prepared as described previously [18]. Assays were performed in 96-well plates (Falcon microtiter plates). Wells contained growth medium supplemented with folate standards or potato extracts, each plated in triplicate. Bacterial growth was measured at 630 nm after 18 h, 21 h and 24 h of incubation at 37˚C. The 24-h reading was usually used for analysis unless saturation was reached, in which case, the 21-h reading was used. All measurements were made with a BioTek Instrument EL 311 SX microplate auto-reader (BioTekInstrument, Winooski, VT, USA), analyzed with the KCJr EIA application software (BioTekInstrument, Winooski, VT, USA) and compiled in Microsoft Excel. Folate concentrations were calculated by reference to a standard curve using 5-formyl-THF and expressed as nanograms of folate per gram of dry sample (ngÁg −1 dry weight).
A large batch of dried potato powder from tubers of Solanum pinnatisectum (PI 275233) was previously prepared to be used as reference material [19]. Each batch of extractions contained 18 samples plus the reference material. Values obtained for samples were normalized to values obtained for the reference material. The average folate concentration of the reference material across all the extractions was 1,105 ± 76 ngÁg −1 dry weight. Folate concentrations presented are normalized averages of three technical replications from single biological replication except the Fol 1.6 and USW4s#3 which are the normalized average of twelve technical replications from four biological replications. All calculations were performed with standard function settings in Microsoft Excel.

Genomic DNA isolation
Approximately 15 mg of freeze dried tuber sample were homogenized in 600 μL CTAB extraction buffer (2% cetyltrimethyl ammonium bromide, 1.4 M NaCl, 20 mM EDTA pH 8.0, 100 mM Tris-Cl pH 8.0, 0.2% β-mercaptoethanol) in a 1.7 mL microcentrifuge tube and incubated at 65˚C for 1 h with gentle mixing every 15 min. Cold chloroform (300 μL) was then added and the solution was vortexed briefly to form an emulsion. After centrifugation at 12,000 g for 5 min, the aqueous phase was transferred to a new 1.7 mL microcentrifuge tube. An equal volume of cold isopropanol was then added, the tubes were inverted several times to mix and placed on ice for 10-15 min. The resulting mix was centrifuged at 12,000 g for 10 min, the supernatant was discarded, and the pellet was washed with 300 μL of cold 70% ethanol. Samples were centrifuged for 2 min at 3,500 g. This washing step with ethanol was repeated two more times and then the pellet was re-suspended in 100 μL deionized water. Genomic DNA extracts were treated with 1 μL RNase A for 1 h at 37˚C with gentle mixing every 15 min. RNase A was then inactivated by incubating samples in a water bath at 65˚C for 5-10 min. Samples were then centrifuged quickly to remove bubbles and placed on ice. Samples were further cleaned by treatment with phenol and chloroform. Phenol:chloroform:isoamyl alcohol 25:24:1, pH 8.0 (100 μL) was added to the extracts in a chemical fume hood. Samples were vortexed briefly to form an emulsion. After centrifugation at 13,000 g for 15 min, the aqueous phase (~90-100 μL) was transferred to a new microcentrifuge tube. Cold chloroform (100 μL) was then added and samples were vortexed for 10 seconds. After centrifugation at 13,000 g for 15 min the aqueous phase (80-85 μL) was transferred to a new microcentrifuge tube. DNA was precipitated by addition of 3 M sodium acetate pH 5.3 (1/10 of the sample volume) and 95% ethanol (2.5 volumes) at −20˚C for 2 hours or overnight. Samples were then centrifuged for 30 min at 13,000 g at 4˚C. The supernatant was discarded and the pellet was washed three times with 300 μL of cold 70% ethanol as described above. Pellets were re-suspended in 30 μL deionized water and stored at -20˚C.

SNP genotyping
At least 400 ng of genomic DNA from 96 samples (94 progeny and parents) were loaded onto a 96-well microplate and desiccated in an Eppendorf Vacufuge Plus (Eppendorf North America, Hauppauge, NY) in 30 min intervals at 25˚C until all samples were completely dried.
Samples were then sent to GeneSeek (Neogen Corporation, Lincoln, NE) for custom SNP Profiling using the Illumina platform Infinium SolCAP 12K SNP array.

SNP quality and filtering
The idat data file developed from 12K SolCAP array was imported to the Illumina GenomeStudio software (genotyping version) (Illumina, San Diego, CA) and analyzed for allele calling. For calling SNPs using the diploid model on the v2 SolCAP 12K array, auto-clustering was run in GenomeStudio using standard settings, followed by importing the three cluster calling files from the v1 SolCAP 8303 SNP array. All the SNPs were sorted based on GenTrain score and each of the SNP marker was manually checked for three clusters calling, i.e., AA, AB and BB type alleles (too high or low GenTrain score generally corresponds to monomorphic and bad markers, respectively). Final allelic data was exported from GenomeStudio and first filtered to remove "BAD" and "QUESTIONABLE" SNPs based on quality comments from the Gene-Seek's data summary file. After initial filtering, there were 10,120 SNPs showing amplification. Further filtering for SNPs with less than 10% missing values resulted in 6,759 high quality SNPs that were used for SNP-trait association study.

Population structure and marker properties
Allelic data was first used to survey any kind of internal structure, if present in the population. Factorial analysis (dissimilarity) and neighbor joining (unweighted neighbor joining) was performed using all the polymorphic makers in Darwin (http://darwin.cirad.fr/darwin) [31]. Polymorphic information content (PIC), heterozygosity (Het), diversity (Div), and minor allele frequency (MAF) for significant markers were calculated using JMP Genomics 8 (JMP, A Business Unit of SAS Cary, NC).

Marker-trait association analysis
SNP-trait association analysis was performed in JMP Genomics 8 (JMP, A Business Unit of SAS Cary, NC) using two different approaches: STA and SSTA. Both STA and SSTA analyses are the best fit for analyzing bi-allelic markers for marker-trait association. In brief, folate data was log transformed to fit a more continuous distribution. The annotated chromosomal locations provided from the SNP array was used for the analyses. Datasets were transformed into SAS format "sas7bdat" and uploaded into JMP Genomics. Both functions were run at default parameters for folate content as a continuous trait distribution, with non-delimited genotypes, Benjamini and Hochberg correction (FDR) and Fisher based p-value adjustment. SSTA treats genotypes as categorical variables and uses an ANOVA function for genotypes. STA uses SAS PROC MIXED approach to find associations between markers and the continuous traits.

Folate content
The F 2 segregating population was developed by crossing a high folate wild diploid S. boliviense parent (Fol 1.6) with a low folate diploid S. tuberosum parent (USW4s#3), and eventually intermating F 1 individuals. F 2 individuals were grown throughout the summer and fall in a greenhouse, but many did not tuberize. The lack of tuberization from one third of the interspecific population is mainly attributed to alleles from S. bolivense, which usually produces tubers under short day conditions. Percent tuberization (62%) of BRR3 population is within the range of previously reported haploid-wild species hybrid [32]. Tubers from 94 F 2 individuals were used for folate analysis and SNP genotyping.
Folate content in the F 2 BRR3 population ranged from 304 ± 16 to 2,952 ± 276 ng g −1 dry weight, representing a 10-fold difference between the lowest and highest folate concentrations. The majority of individuals tested (52% of all progeny) had folate concentrations between 500 and 1,000 ng g −1 dry weight (Fig 1). Approximately 40% of individuals showed folate concentrations below 500 or between 1,000 and 1,500 ng g −1 dry weight. The remaining~8% had folate concentrations above 1,500 ng g −1 dry weight, with three individuals between 1,500 and 2,000 ng g −1 dry weight and four above 2,000 ng g −1 dry weight (Fig 1). The parents Fol 1.6 and USW4s#3 had folate concentrations of 950 ± 130 and 639 ± 71 ng g −1 dry weight, respectively. It is clear from these results that the folate content trait shows transgression or transgressive segregation. The presence of extreme or transgressive phenotypes in segregating hybrid populations is common as previously reported [33]. Transgression can be caused by the action of complementary genes, overdominance and epistasis [33]. Note that values presented for the female parent USW4s#3 and the seedlings were from the 2014 growing season, while values presented for Fol 1.6 were from tubers harvested in 2015. Fol 1.6 was chosen as a parent in this cross because it showed tuber folate levels above 1,600 ng g −1 dry weight in two previous harvests in 2011 and 2012. The lower values of Fol 1.6 presented here can probably be attributed to the particular growing conditions. The trait reliability estimate (broad sense heritability) of folate content from a S. microdontom population [34] is 0.67 [unpublished, Goyer et al., In Prep], suggesting that the folate trait is highly heritable and that high folate plants tend to consistently have higher folate content than low folate plants.

Selection of SNP markers
A total of 6,759 high-quality SNPs (4,174 polymorphic and 2,585 monomorphic) were selected for mapping after very stringent screening of the 12K SNP array using the GenTrain score in Genome Studio. Sample BRR80 was removed from further analysis as it behaved as an off type for the majority of monomorphic markers. All the SNPs selected showed less than 10% "No call rate". SNP genotyping showed that the USW4s#3 parent is moderately heterozygous (34.8%), and that the S. boliviense parent Fol 1.6 is less heterozygous (5.3%) (S1 Table). Our initial assumption was that both parents used in this study were highly homozygous and hence we developed an F 2 segregation population by intermating twelve F 1 plants. The presence of moderate levels of heterozygosity in the parents could explain the unexpected segregation distortion among the F 2 population, which can consequently not be considered a true biparental population. Hence, the conventional QTL mapping could not prove effective in the present study. Instead, in such random mated, heterogeneous population, the most effective strategy to identify marker-trait association is association mapping. Association mapping identifies QTL by calculating marker-trait associations that could be attributed to linkage disequilibrium between markers and functional polymorphisms in diverse individuals [35]. It is often more rapid and cost-effective than traditional linkage mapping. Linkage and association mapping are complementary approaches and are more similar than often assumed [36,37].

Population structure
Population structure is one of the major constraints that might skew the results of association studies [38]. As association mapping evaluates whether specific alleles within a population are linked with specific phenotypes more frequently than expected [39], the presence of population structure within the samples could lead to spurious linkages/associations [40]. Population structure of the F 2 individuals was surveyed to identify any accidental selfing and unrelated outcrossing individuals. All the individuals were scanned using factorial analysis (based on dissimilarity with number of axes set at 5) and neighbor joining (based on unweighted pairwise). Both these analyses suggest the absence of any potential subgroups (outcrossing groups). However, four samples (BRR13, BRR27, BRR6 and Fol 1.6) clustered together to form a separate group (Fig 2, numbers  23, 43, 69, and 94, respectively, and percent heterozygosity in S1 Table). Percent heterozygosity analysis of these individuals revealed that these were the least heterozygous individuals in the population (S1 Table). The average heterozygosity in the population is 19.9% while these individuals have an average of 6.3% heterozygosity, which is almost equal to the Fol 1.6 parent (5.3%). The heterozygosity values suggest that these individuals are the product of accidental selfing.

Marker-trait association
Although larger population sizes increase the power of marker-trait associations, previous studies have used population sizes similar to the one used in this study [41,42]. Association mapping analyses were run with and without the aforementioned outliers but the results remained unaffected by the presence or absence of these outliers. Therefore, they were included in the final analyses. Two different marker trait association approaches available in JMP Genomics, SSTA and STA, were used in this study. Both these methods have been designed to handle large-scale bi-allelic genetic data with known locations on the chromosomes to test association with quantitative as well as qualitative phenotypic traits. SSTA can handle complex survey designs to depict marker-trait association by testing a single SNP at a time. It performs two types of analysis, SNP markers based ANOVA testing and REGRES-SION analysis for qualitative covariates. STA performs marker-trait association analysis by using SAS PROC MIXED model for continuous traits.
SSTA and STA analyses using genotype test identified 489 and 43 significant markers, respectively, with -log10(p) ! 1.3 (default parameter in JMP Genomics) ( Table 1, S2 and S3 Tables, S1 Fig). These results show that the mixed model approach is more stringent and hence reduces the number of false positives. The mixed model approach is more powerful as it can handle the internal population structure by accommodating the phenotypic covariance that could be due to genetic relatedness [43]. As STA considers neighboring markers for association while SSTA surveys for single marker trait association without any influence of linkage, it is expected to see more marker associations from the later approach.
SSTA identified 489 significant markers located on all twelve chromosomes, with chromosome 3 containing the highest number of markers (Table 1, Fig 3, S2 Table). The large number of SNPs identified from SSTA can be attributed to the fact that this approach uses linear regression. Furthermore, the large number of significant SNPs may also point to the overall homozygosity of Fol 1.6 parent (S1 Table) and low level of recombination events in chromosome 3 (calculated average recombination frequency on 152 markers is 20.89). STA analysis identified 43 significant markers that were located on three chromosomes only, chromosomes 2 (one marker), 4 (seven markers), and 6 (35 markers) ( Table 1, Fig 4, S3 Table). Chromosomes 4 and 6 showed recombination hotspots and the average recombination frequency was 37.21 and 40.34, respectively. The significant markers on chromosomes 2, 4 and 6 identified by STA analysis could explain an average of 15%, 24% and 16% variance, respectively (S3 Table).
Of particular interest are significant SNPs that were identified by both SSTA and STA methods. There were 50 and 35 significant markers on chromosome 6 identified by SSTA and STA analyses, respectively, out of which 35 were common in both analyses. Similarly, STA identified seven markers on chromosome 4, whereas SSTA identified 31 markers on the same chromosome, out of which seven were common in both analyses. This set of markers showed an average of 0.29 PIC, 0.43 Het, 0.36 Div, and 0.24% MAF ( Table 2). Similar values have been reported earlier in a collection of Andigenum group used for genetic diversity and association mapping with SolCap array SNP markers [30]. MAF determines the accuracy of marker trait association. It generally ranges between 0.01 and 0.50 and the power to detect significant markers increases as MAF increases [44]. MAF under 5-10% can artificially increase the association score, thereby resulting in spurious associations. Other genetic indices like    PIC, Het and Div could be considered within range for a mapping population with narrow genetic base. These common significant SNPs span a region of 10.24 Mb in chromosome 6. In chromosome 4, the common SNPs are located in two hotspots that are 60 Mb apart, one of which spans 0.7 Mb and consists of six SNPs, whereas the second spot consisted of only one SNP marker. Hence, these 42 SNPs located on chromosome 6 and 4 ( Table 2) could be considered as best candidate markers for high folate content.
One possible weakness of SNP as a marker is ascertainment bias [22]. The use of a biased set of pre-ascertained SNPs designed from S. tuberosum genome are more likely to address only the common alleles rather than the rare alleles from S. bolivense. This could result in loss of differential real folate alleles from S. bolivense. S. bolivense population shows segregation for folate trait, but the underlying alleles, even if segregating, are less likely to be marked up by the SNPs derived from S. tuberosum.

Location of SNPs and folate genes
Folate biosynthesis is a well characterized pathway in plants [45]. It involves ten enzymatic steps that are catalyzed by nine proteins (S4 Table). Three enzymes that may be involved in salvage and/or homeostasis of folate have also been characterized [46][47][48][49] (S4 Table). Searches for potato homologs of Arabidopsis and tomato genes revealed 16 folate-related genes in the potato reference genome that are located on different chromosomes (S4 Table). Comparison of genomic location of these genes and that of the 497 SNPs identified by SSTA and STA analyses showed that one significant SNP was located within the 5-FCL gene, and 17 additional SNPs were located in close proximity (<0.1 Mb) to folate metabolism genes (Table 3). We hypothesize that there are two major QTLs located in chromosome 4 and 6, and additional minor QTLs in other chromosomes that contribute to folate biosynthesis.

Conclusions
In this study, a diploid segregating population was developed and genotyped to find significant associations between SNPs and folate content. A total of 497 SNPs were identified with potential associations with folate content by using two analytical methods. A set of 42 identical SNPs located on chromosome 6 and 4 was identified by both methods. Additional significant SNPs were located within or in close proximity to folate metabolism-related genes. Considering high reliability of folate trait with presence of high frequency of minor alleles, the set of markers identified in this study could facilitate the systematic screening of high folate germplasm and the introgression of the high-folate trait into new potato varieties. Because folate analyses are tedious and materials used in this study cannot really be evaluated in the field, folate was quantified from a single tuber harvest. Better replication, validation, and further tests in a fieldready background are next steps to this preliminary work. Moreover, there may be variation between genotypes in terms of folate retention after cooking. It will be important to evaluate and select germplasm that maintains high folate content after cooking.