Development and Evaluation of a 9K SNP Array for Peach by Internationally Coordinated SNP Detection and Validation in Breeding Germplasm

Although a large number of single nucleotide polymorphism (SNP) markers covering the entire genome are needed to enable molecular breeding efforts such as genome wide association studies, fine mapping, genomic selection and marker-assisted selection in peach [Prunus persica (L.) Batsch] and related Prunus species, only a limited number of genetic markers, including simple sequence repeats (SSRs), have been available to date. To address this need, an international consortium (The International Peach SNP Consortium; IPSC) has pursued a coordinated effort to perform genome-scale SNP discovery in peach using next generation sequencing platforms to develop and characterize a high-throughput Illumina Infinium® SNP genotyping array platform. We performed whole genome re-sequencing of 56 peach breeding accessions using the Illumina and Roche/454 sequencing technologies. Polymorphism detection algorithms identified a total of 1,022,354 SNPs. Validation with the Illumina GoldenGate® assay was performed on a subset of the predicted SNPs, verifying ∼75% of genic (exonic and intronic) SNPs, whereas only about a third of intergenic SNPs were verified. Conservative filtering was applied to arrive at a set of 8,144 SNPs that were included on the IPSC peach SNP array v1, distributed over all eight peach chromosomes with an average spacing of 26.7 kb between SNPs. Use of this platform to screen a total of 709 accessions of peach in two separate evaluation panels identified a total of 6,869 (84.3%) polymorphic SNPs. The almost 7,000 SNPs verified as polymorphic through extensive empirical evaluation represent an excellent source of markers for future studies in genetic relatedness, genetic mapping, and dissecting the genetic architecture of complex agricultural traits. The IPSC peach SNP array v1 is commercially available and we expect that it will be used worldwide for genetic studies in peach and related stone fruit and nut species.


Introduction
Dissection of the genetic components underlying complex agricultural traits in plants has so far used mainly experimental bi-parental crosses and a limited number of genetic markers. The growing number of draft genome sequences of many species coupled with Next Generation Sequencing (NGS) technologies has quickly changed the paradigm of plant genome analysis. Alignment of the short NGS reads obtained from diverse and unrelated varieties to a reference genome allows the identification of large numbers of Single Nucleotide Polymorphisms (SNPs) and small Deletion/Insertion Polymorphisms (DIPs) and the contextual estimation of their minor allele frequencies (MAF).
Simultaneous genotyping of hundreds of thousands of SNPs in a single assay has become feasible due to innovative combinations of assay and array platform multiplexing [1]. Illumina's InfiniumH BeadArray Technology platform is an extremely high-throughput SNP genotyping system that allows the detection of up to 2.5 million SNPs per single DNA sample [2]. Multiplex SNP genotyping enables cost effective marker-assisted selection strategies, whole genome fingerprinting, genome-wide association studies (GWAS), map-based gene cloning and population-based analyses. The availability of such tools fosters the application of GWAS in plants and animals [3], [4], [5], [6] and provides the opportunity to apply genomic selection (GS) methods to agricultural species, including Prunus. High-density SNP genotyping arrays have been designed for several domestic animals including cattle [3], pig [4] and chicken [5]; arrays are being developed in several plant species including apple [6], maize, tomato, and cherry (http://www.illumina.com/agriculture).
In peach [Prunus persica L. (Batsch)] and related Prunus species, QTL studies have been conducted using experimental bi-parental crosses and a limited number of genetic markers [7], [8], [9], [10], [11], [12], [13]. A recent association study by Aranzana et al. [14] using a limited number (50) of SSR markers in unrelated accessions from American and European origin indicated that linkage disequilibrium (LD) in peach is quite high, up to 13-15 cM with stratification of peach accessions into at least three subpopulations. This study suggests that a small number of markers (about 600) might be sufficient to scan the peach genome. However, the peach gene pool used in this study [14] is known to have a narrow genetic base [15] in comparison with Eastern germplasm [16] where LD is likely to have a lower level of conservation. Moreover, studies in grape [17], [18] and maize [19] suggest that SNPs estimate a much lower decay of LD than SSRs. Hence, we suggest that a higher number of SNP markers covering the entire genome would be necessary to scan the peach genome and to perform GWAS.
The availability of the peach reference genome sequence recently released by the International Peach Genome Initiative [20] facilitated genome wide variant detection and the development of dedicated genomic tools. This and the ability to acquire massive sequence datasets from next generation sequencers, allowed the efficient identification of a large number of genetic markers, such as SNPs and small DIPs, enabling the development of a SNP array in this important horticultural crop. We describe here how members of The International Peach SNP Consortium (IPSC), that includes institutions from the U.S., Italy, and Spain, have worked together to identify genome-wide sequence variation and to develop a moderate-density peach high-throughput InfiniumH SNP genotyping platform relevant for worldwide peach breeding germplasm, utilizing SNPs discovered using next generation sequencing platforms.

Materials and Methods
Whole genome re-sequencing of peach breeding accessions A SNP detection panel of 56 peach breeding-relevant accessions assembled with the goal of achieving an efficient coverage of the genetic background of cultivated peach (Table 1) was used for lowdepth genome re-sequencing. The accessions were founders, intermediate ancestors, and important breeding parents used in international peach breeding programs, chosen both for the significance of their contribution to breeding germplasm according to pedigree records, and for genetic diversity based upon relatedness estimates from SSR studies [21], [14], (Verde I. unpublished data). Accessions were divided into 12 pools. For each accession in pools 1 through 11, paired-end libraries were prepared as recommended by Illumina (Illumina Inc., San Diego, CA, USA) separately at the USDA-ARS-National Clonal Germplasm Repository (NCGR) and the Istituto di Genomica Applicata (IGA, Udine, Italy) laboratories. In summary, library preparations were performed using minor modifications of the Illumina DNA-seq Sample Preparation protocol (Illumina, Inc., San Diego, CA). Briefly, 1-3 mg of genomic DNA was sheared by sonication using Diagenode's Biorupter XL sonicator system (Sparta, NJ, USA). This was followed by standard blunt-ending and 'A' was performed. Then, Illumina adapters with indexes (3 bp or 6 bp, at the NCGR and IGA, respectively) were ligated to the ends of the fragments. After the ligation reaction and separation of un-ligated adapters, samples were amplified by PCR to selectively enrich for those fragments in the library with adapter molecules at both ends. The samples were quantitated and quality tested using the NanoDrop ND-1000 UV-Vis Spectrophotometer (Thermo Scientific, Wilmington, DE, USA) and Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA). Libraries were pooled in equimolar ratios to yield a total concentration of 10 nM. Aliquots of pooled libraries (5 pmol) were processed with the Illumina Cluster Generation Station, following the manufacturer's recommendations. Pools were sequenced in one lane of Illumina GA II with 94 cycles per read at the Istituto di Genomica Applicata (IGA, Udine, Italy) for pools 1-5 and with 80 cycles per read at the Center for Genome Research and Biocomputing (CGRB, Oregon State University, Corvallis, OR, USA) for pools 6-11, as specified in Table 1. Libraries for accessions in pool 12 were constructed for 454 GS FLX sequencing with MID-labeled libraries. Nuclear DNA from the eight accessions of pool 12 was digested with Alu1 and size-selected for 400-500 bp fragments. At IGA the CASAVA 1.7.0 version of the Illumina pipeline was used and at the OSU CGRB the CASAVA 1.6.0 version of the Illumina pipeline was used. Raw sequences were retrieved and kept separate for each accession and then aligned to the Peach v1.0 reference genome [20] using CLC Genomics Workbench (CLC Bio, Aarhus, Denmark) at IGA and Soap2 [22] at the CGRB. In this paper, ''chromosome'' refers to one of the eight pseudomolecules (scaffolds) of the Peach v1.0 reference genome.

SNP detection
SNP detection and filtering followed a multi-step procedure ( Figure 1). SNPs from sequences generated at IGA (pools 1-5) were detected using CLC Genomics Workbench (CLC Bio, Aarhus, Denmark), using default parameters for filtering except: (1) minimum Illumina quality score (Qscore) was 25; (2) minimum coverage of 30 reads and maximum coverage was 2.56 average coverage, corresponding to 161 reads; and (3) a minimum of five reads supported the presence of the minor allele in the accessions. SNPs in repetitive regions were also removed with internal scripts. SNPs from the 23 CGRB-sequenced accessions (pools 6-11) and from the eight accessions in pool 12 were detected using SoapSNP (http://soap.genomics.org.cn/soapsnp.html) essentially as recommended by Li [22]. In filtering for pools 6-12, SNPs were kept if: (1) the Illumina Qscore was more than 30 (except for pool 12 for which an Illumina Qscore could not be obtained); (2) the maximum number of reads for either allele across all accessions was less than the average read depth of all SNPs plus three standard deviations, 380 in this case; (3) a minimum of five reads supported the presence of the minor allele in the accessions, providing a minimum coverage of 10 reads for the SNP; and (4) the average copy number of the SNP flanking region was less than two, corresponding to non-repetitive regions of the genome. These detection and filtering efforts yielded ''Stage 1 SNPs'' ( Figure 1). We compared the SNP calls in three of four pairs of accessions independently sequenced in different labs (  (Figure 1).

SNP validation with GoldenGateH assay
A set of 96 SNPs was chosen from the Stage 1 SNPs from pools 6-11 to validate the efficiency of SNP detection and adjust subsequent filtering parameters ( Figure S1). The initial selection comprised 74 Stage 1 SNPs evenly spread over the eight pseudomolecules representing the haploid chromosomes and linkage groups (LGs) of the Peach v1.0 'dhLovell' genome assembly [20]. One SNP was chosen to be located within 200 kb of each end of each LG. SNPs chosen between these ends were then evenly spaced along each LG according to their total genetic distance [23], corresponding to one SNP every 2-5 Mb.  [24], at an average spacing of 60 kb (ranging from 8 kb to 113 kb). The final eight validation SNPs targeted candidate genes for the Y locus on LG1 (four SNPs at 123, 270, and 341 kb intervals), the Cs locus on LG3 (two SNPs 38 kb apart), and the G locus on LG5 (two SNPs 272 kb apart) according to the genomic positioning reported by Dirlewanger [23]. While trait locustargeted SNPs were chosen for variation in genic regions when possible, preference was given to achieving uniform target spacing in designated windows. Approximately 20% of the 96 SNPs chosen were planned to be accession-specific, i.e., their minor allele would be detected in only one re-sequenced accession of the detection panel (for which data available at the time included only accessions from pools 6-11). Sixteen of the evenly spaced SNPs and four of the trait locus-targeted SNPs met this criterion. Accession-specific SNPs were from nine peach accessions, with the almond 'Nonpareil' and the peach reference cultivar 'dhLovell' having five and six such SNPs, respectively. The 96 SNPs also deliberately included a wide range of MAFs.
To test the variables of genomic location, genic location, and MAF for their effects on genotyping efficiency, a validation panel of 160 Prunus accessions (54 peach cultivars, three interspecific hybrid cultivars, three almond cultivars, 59 breeding selections of peach and related species, and 41 seedlings of breeding populations as listed in Table S1) was screened with the 96-SNP subset, using the GoldenGateH assay. Individuals in the validation panel were founders, intermediate ancestors, important breeding parents, and seedlings of modern peach cultivars, forming a complex pedigree structure linking much of the world's cultivated

SNP final choice
Stage 1 SNPs from pools 1-5 and pools 6-12 were independently converted to Illumina Assay Design Tool (ADT) format with custom scripts and scored by Illumina. ''InfiniumH I''type SNPs (A/T or C/G transversions) were removed, as well as SNPs with any failure code or ADT score ,0.9. Then the two datasets were merged, and duplicates were removed with custom scripts. The remaining SNPs were filtered by removing those: (1) where the last 25 bp of the 50 bp probe were duplicated; (2) supported by less than five accessions; (3) not in predicted coding regions; and (4) not located on one of the first eight pseudomolecules representing the eight peach chromosomes. This process yielded ''Stage 2 SNPs'' (Figure 1).
Pre-validated SNPs were obtained from several sources: (1) polymorphic SNPs from the GoldenGateH validation activity; (2) peach RosCOS SNPs [26]; and (3) SNPs requested for inclusion by the international peach genomics community. Pre-validated SNPs were filtered to remove InfiniumH I SNP types and those with an ADT score ,0.6.
For the final choice of 9,000 SNPs for the IPSC peach 9 K SNP array v1, filtered pre-validated SNPs were automatically included. For the remainder, SNPs were chosen to provide an even spacing across the genome, corresponding to one SNP selected for every 4.74 Stage 2 SNPs (40,794/8,613).

SNP array evaluation
The IPSC peach 9 K SNP InfiniumH II array v1 was evaluated using 709 accessions divided in two independent evaluation panels, one panel from the European Union (EU) and the other from the USA (US). The EU panel comprised 232 accessions, of which 229 were peach cultivars and three were wild related Prunus species or their hybrids with peach. The US panel comprised 479 accessions that included pedigree-linked cultivars, breeding lines, and seedlings (Table S2). Overall, selected material comprised cultivars (45%), advanced selections (4%) and seedlings (51%). Accessions with pure peach and almond ancestry accounted for 82% and 2%, respectively, while 16% of genotyped material had interspecific backgrounds with almond (7%), and peach and almond wild relatives, 5% and 4%, respectively, in their pedigrees. Some US panel accessions were related Prunus species or were known interspecific hybrids: 5% had peach-related (P. davidiana and P. mira) ancestry, 10% had almond (P. dulcis), and 3% had almond-related (P. argentea and P. scoparia) ancestry. Genomic DNA extraction and quantitation were conducted as described above for the SNP validation panel for the U.S. accessions. For the EU panel, genomic DNA was extracted using the DNeasy Plant Mini Kit (Qiagen) and quantitated using a Fluoroskan Ascent (Thermo Scientific, Finland) microplate reader. The IPSC array, employing exclusively Illumina InfiniumH II design probes and dual color channel assays (Infinium HD Assay Ultra, Illumina), was used for genotyping, following the manufacturer's recommendations. SNP genotypes were scored with the Genotyping Module of the GenomeStudio Data Analysis software (Illumina, Inc.). A GenTrain score of .0.4 and a GenCall 10% of .0.2 were applied to remove most SNPs that did not cluster (homozygous), or had ambiguous clustering. SNPs that did not cluster for more than 50% of samples were also eliminated from further consideration. The threshold of allowed No Calls (failed genotyping) was 'relaxed' in anticipation of the presence of null alleles for some SNPs contributed by non-peach species.

SNP detection
A total of 25.4 Gb of DNA sequence (111.76 coverage of the peach genome) was obtained from 279.7 million reads (Illumina and 454) generated for the 56 peach accessions multiplexed among 12 pools (Table 1). Excluding one sequencing run for 'Lovell' that generated an unusually low number of reads, the total number of reads sequenced using the Illumina platform averaged 2.166 coverage per accession and ranged from ,0.56 to 19.2 million reads in 'Tabacchiera' and 'Sahua Hong Pantao', respectively. The total number of reads per accession sequenced using the 454 platform averaged 0.226 coverage and ranged from 0.145 to 0.198 million reads in 'Sweet Cap' and 'Big Top', respectively. The number of SNPs identified after Stage 1 detection was 943,549 for pools 1-5 and 57,933 for pools 6-11 ( Figure 1). When the same filtering parameters as those used for pool 6-11 were reapplied to pools 6-12, the number of SNPs increased to 78,805.
When the SNP calls were compared among the three pairs of independently sequenced accessions, the majority of the positions were not covered by at least one of the two datasets and

SNP validation
Several SNP characteristics were associated with performance differences in the GoldenGateH assay (Table 2). Exonic and intronic SNPs were the most successful, with approximately 75% of polymorphisms verified; in contrast, intergenic SNPs were the worst performers, with only a third being polymorphic and a third failing (Table 2, Figure S2). Although MAF observed from sequencing of the detection panel (n = 23) and MAF observed after GoldenGateH genotyping of the validation panel (n = 119) were not well correlated (R = 0.12), the higher the detection panel MAF of a SNP, especially .30%, the more likely its validation panel MAF was .10%. Yet SNPs with a detection panel MAF of ,20% were more likely than higher MAF SNPs to convert to a validation panel MAF of ,10% without a higher-than-average rate of failure or monomorphism (Table 2). Of the 74 SNPs evenly distributed over the genome, those on LG6, LG1, and LG3 had the highest conversion to polymorphism yet did not have a higher-thanaverage proportion of exonic and intronic SNPs (Table 2, Figure  S2). LG1 and LG6 also had the highest SNP representation because they were the longest in genetic length.
LG8 gave a high proportion of polymorphic SNPs with low MAF (,10% in the validation panel).
LG2 had a very high failure rate yet only an average proportion of exonic and intronic SNPs (Table 2, Figure  S2). Accession-specific SNPs had a close to average performance for rate of failure, monomorphism, and polymorphism, but provided a higher-than-average proportion of low MAF SNPs observed in the validation panel. No SNPs designated as accessionspecific in the detection panel remained accession-specific after genotyping the validation panel, although those developed from the almond accession 'Nonpareil' were obviously indicative of almond introgression in the validation panel. SNPs targeted to the F-M locus also provided a high proportion of low-MAF SNPs, but had a high failure and monomorphism rate as did the SNPs targeted to other trait loci ( Table 2). All trait locus-targeted SNPs were associated with a very high proportion of intergenic SNPs ( Figure S2). Overall, SNP failure was similar across all parameters except for a high level for SNPs that were on LG2, targeted to trait loci, or intergenic. Monomorphism was highest for SNPs with a detection panel MAF of 21-30%, trait locus-targeted, or those that were intergenic (Table 2).
Of the 32 validation SNPs that were polymorphic in the interspecific T6E bin-mapping set, 31 had joint genotypes corresponding to their expected genomic region. The presence of null alleles was evident for 14 of these polymorphic SNPs. One intergenic SNP targeted to the F-M locus on LG4 (bin 4:63; Scaffold_4:22274908) resulted in polymorphism (MAF 23%) with a joint genotype that unexpectedly placed it on LG2 (bin 2:13).
Of the 55 SNPs polymorphic in the entire validation panel, 52 (95%) were polymorphic in at least one of the five six-seedling breeding progenies. One SNP was polymorphic in all five progenies, 10 in four progenies, 17 in three progenies, 10 in two, and 14 in just one progeny. Proportions of the 55 polymorphic SNPs that were polymorphic within individual progenies were 58%, 40%, 53%, 44%, and 42% from Arkansas

SNP final choice
Stage 2 filtering reduced the one million available SNPs by 96%. Restricting the SNPs to InfiniumH II types eliminated 18% of Stage 1 SNPs from consideration, the thresholds of ADT$0.9 and supporting evidence from at least five accessions eliminated 50% of Stage 1 SNPs, and restriction to genic regions cut an additional 23%. Considering only exonic SNPs more than halved the remaining 86,360 to 41,800. Finally, 1006 SNPs located on peach genome scaffolds as yet unassigned to one of the eight chromosomes of this crop were discarded, resulting in 40,794 Stage 2 SNPs (Figure 1, Table 3). The distribution of Stage 2 SNPs varied among and within the eight peach chromosomes (Figure 2, Table 3).
A total of 649 pre-validated SNPs considered for inclusion in the final array were reduced to 387 by the filtering parameters used for these SNPs. Of the 55 polymorphic SNPs from the GoldenGateH validation assay, 45 were included. Of the 453 peach RosCOS SNPs considered, 225 were included in the final array. SNPs requested for inclusion from the international community were 108 from the DRUPOMICS project (Verde I., unpublished data) and 33 from the University of Chile (Silva H., unpublished data), The IPSC peach 9 K SNP array v1 achieved an average spacing of 26.7 kb between SNPs ( Table 3). Distribution of SNPs along the Peach v1.0 pseudomolecules varied according to the number of Stage 2 SNPs observed throughout the genome (Figure 2). The largest average gap between successive SNPs (42.0 kb) occurred on chromosome 1 and the smallest on chromosome 4 (18.6 kb). A total of 252 gaps were larger than 150 kb and the largest single gap was 915.8 kb on chromosome 5 ( Table 3), but the vast majority of gaps were less than 5 kb (Figure 3). For most of the regions with large gaps there were simply no SNPs available to reduce the gap size considerably. However some of the gaps were caused by the loss of 856 SNPs that occurred during the manufacturing of the array.

SNP array evaluation
Of the 9,000 candidate SNPs, 8,144 remained on the array after Illumina technical dropout (loss during array manufacturing). Of these, 8,125 were located on the first eight pseudomolecules of the peach genome. The evaluation of IPSC peach 9 K SNP array v1, performed in Europe and U.S., revealed no significant difference in the number of polymorphic, monomorphic and failed SNPs between peach-only samples and samples with interspecific backgrounds (data not shown). The only difference was observed for SNPs located in RosCOS, with only 2% of them being polymorphic in peach-only samples versus 6% in all samples, including those with interspecific ancestry. Moreover, independent evaluation of the IPSC peach 9 K SNP array v1 using EU and US accession panels revealed almost identical results for the numbers of polymorphic, monomorphic and failed SNPs. Genome scans performed with the IPSC peach 9 K SNP array v1 on the EU and US evaluation panels resulted in a SNP polymorphism rate of 70% for each panel and a failure rate of only 5% for each (Table S3). Of the polymorphic SNPs, 92% were observed with a MAF.0.10 for each panel, and 74% and the 69% had a MAF.0.20 in the EU and US evaluation panels, respectively (Table S3). When the US evaluation panel was reduced to only cultivars and advanced selections, the proportion of MAF.0.20 (73%) was almost identical to that observed in the EU evaluation panel. Proportions of SNPs in each MAF range were similar between the two evaluation panels regardless of the type of the material analyzed ( Figure 4). In addition to 5,967 SNPs (73%) polymorphic in both evaluation panels, 425 (5%) and 477 (6%) were polymorphic only in the EU and US panels, respectively ( Figure 5).
The IPSC peach 9 K SNP array v1 achieved a total of 6,869 polymorphic SNPs across the 709 accessions scanned in the two panels combined (Table 3), representing 84.3% of SNPs present on the array. These polymorphic markers provided an average spacing of 31.5 kb across the peach genome, which was consistent among the chromosomes. Fourteen SNPs out of the nineteen localized on the unanchored scaffolds of Peach v1.0 assembly were polymorphic. They provide useful coverage for the unmapped fraction of the Peach v1.0 assembly by helping to reduce gaps. The largest gaps between polymorphic SNPs were on chromosome 1 (1,254 kb). Physical positions and MAFs of polymorphic SNPs were compared for the US peach and non-peach cultivars and selections ( Figure 6). The coefficient of regression (r) between MAF in ''peach cultivars and selections'' and in ''non-peach cultivars and selections'' within the US panel was 0.437. Predicted SNPs in Table 3. Chromosome distribution and performance of SNPs on the IPSC peach 9 K SNP array v1. RosCOS loci were mostly monomorphic, with only 14 being polymorphic.

SNP detection
The large difference in the number of SNPs identified between pools 1-5 and pools 6-11 is not surprising given the much higher sequencing coverage of pools 1-5 (84.86) than 6-11 (25.16) ( Table 1) and the greater genetic diversity expected in pools 1-5. Updates in software and hardware for the Illumina Genome Analyzer during the study were major reasons for the difference in coverage between the two main pool groups. Accessions in pools 6-11 were sequenced in February 2010 with GAIIx technology, when the expected number of reads was 17-21 million reads per lane. Accessions in pools 1-5 were sequenced between May and June 2010 with an Illumina GAIIx, when a normal run at IGA was producing about 60 million reads per lane. This higher coverage in pools 1-5 and the SNP validation results based on pools 6-11 prompted some adjustments to the filtering adopted for SNP prediction in pools 1-5. Modern western peach germplasm has a narrow genetic base [15]. In addition to including cultivars of modern U.S. and European breeding programs, pools 1-5 also included landraces from around the world (China, Italy, Japan, Spain, Pakistan, and Brazil). In contrast, the set of accessions of pools 6-11 was less diverse, comprising mostly founders of importance to modern U.S. peach breeding programs.
The low overall percentage of SNPs with matching genotypes in the three independently sequenced pairs of accessions is caused by low sequence coverage targeted here. The sequences generated in this project that were not included in the IPSC peach 9 K SNP array v1 are a valuable resource for peach and other Prunus species as they can be searched for additional sequence variation in regions underlying traits of economic importance.

SNP validation
SNP validation was a critical step for empirically determining appropriate parameters for prioritizing detected SNPs, improving the SNP detection for pools 1-5 that was conducted after SNP detection for pools 6-11, and to choose from among approxi-   mately one million SNPs just 9,000 used in the final array design. The validity of the GoldenGateH genotyping results for extrapolation of parameter thresholds to all detected SNPs was supported by the correct genomic location assignment of polymorphic SNPs by bin-mapping. Our observation that conversion of NGSdetected SNPs to true SNPs was more reliable for polymorphisms located in predicted exons and introns was consistent with observations in apple [6]. Better performance of genic SNPs may be due to a lower level of undetected sequence variation in SNP-flanking regions, as opposed to reduced sequence conservation in UTRs and intergenic space, as argued by Chagné et al. [6]. If so, deeper sequencing of detection panel accessions to identify a larger proportion of the polymorphisms would enable avoidance of SNPs with polymorphisms in flanking probe sequences and hence increase the probability of successful SNP marker development. The low depth of sequencing of individual accessions in the detection panel is the probable reason for the lack of significant correlation between detection panel MAF and validation panel MAF. However, detection panel MAF was still a useful parameter for developing rare (,10%) or common (.10%) SNPs. Rare SNPs would be useful for detecting unique haplotypes in a germplasm set, although they would be uninformative for most individuals. Common SNPs would be useful for detecting the majority of genetic variation in a germplasm set and, with a high degree of saturation of any given genomic region, may detect most or all unique haplotypes and thus overcome a lack of rare lineagespecific SNPs. The accession specificity of a SNP in the detection panel was found to be a redundant criterion for developing rare lineage-specific SNPs, and in any case, the relatively small number of accessions in the detection panel and low sequencing coverage of each negated true specificity.
The reason for differential performance of SNPs among some LGs is not clear. In the development of a genome-wide array, it is not advisable to avoid certain LGs or to saturate others just to increase the overall proportion of successful SNPs. These differences were not maintained in the final array (Table 3) and hence probably represent sampling error in the relatively small set of validation SNPs.
Although only about half of the successful SNPs segregated in any of the four individual F 1 populations included in the validation assay, together these four populations captured most (95%) of the polymorphisms. This observation highlights the importance of utilizing diverse target germplasm rather than a narrow subset of the target genepool, such as a single population, for SNP evaluation and assessment of array informativeness.

SNP final choice
The SNPs identified within this internationally coordinated effort were subjected to stringent filtering criteria to maximize the efficacy of the IPSC peach 9 K SNP array v1. Given that most candidate SNPs were essentially anonymous, InfiniumH I SNPs were avoided because these require two bead types and thus use two SNP slots of the assay, unlike the single bead type for InfiniumH II SNPs. InfiniumH II SNPs have also been reported to perform better (92% vs. 85%) [3]. Furthermore, InfiniumH II SNPs were the most abundant class detected in our study: 82% of the SNPs detected were of this type. SNPs with an ADT score higher than 0.9 were used to increase the probability of SNP success as observed in the validation assay. This Illumina design score reflects the predicted ability of the SNP-flanking sequences to provide a successful assay [5]. The availability of a large number of SNPs allowed us to employ a stringent threshold that was higher than that used in the recently released Illumina Beadchips for apple, pig and chicken, which used ADT thresholds of 0.7, 0.8 and 0.6, respectively [6], [4], [5].
An important aspect we took into consideration in choosing the SNPs for the array was their genomic context. The large number of SNPs discovered in this study allowed us to select 9,000 exonic markers to be included in the array. The high level of transferability of transcriptomic markers across Prunus species [27] makes these exonic SNPs useful tools even for related nonpeach species such as almond. We also expected exonic SNPs to be more commonly associated with causative mutations underlying phenotypic differences than intergenic SNPs, because of their potential to alter protein sequences.
Finally, another important criterion for a whole genome genotyping assay is to have a uniform distribution of the SNPs across the genome, as this greatly facilitates finding associations between markers and phenotypes. The SNPs on our array cover most of the peach genome with markers well distributed over all chromosomes. The average gap size across the genome achieved was 26.7 kb and increases to 31.5 kb when only polymorphic SNPs are considered. The average ratio of genetic to physical distance in peach is about 440 kb/cM, which was obtained by comparing the Peach v1.0 assembly with an updated version of the Prunus reference map [23] (IPGI unpublished results), and this gives an average of 13.3 polymorphic SNPs per cM for our array. Such resolution provides unprecedented power to dissect the peach genome for pinpointing QTLs and determining genetic relatedness. However, SNPs on the array developed here were not evenly spaced across the genome physically or genetically, and a few regions remained with gaps up to 915.8 kb (1,254 kb when considering only polymorphic markers). As the SNP spacing on the array was determined from the density of SNPs detected across the genome and was randomly but proportionally reduced by filtering and non-polymorphism, we expect that the largest gaps exist in regions of low polymorphism (low informativeness) in this crop. Even the largest gap corresponds to a genetic distance of about 2.8 cM, which is still powerful for QTL and relatedness studies. LD decay in peach was estimated at about 13-15 cM [14], so even the regions with large gaps have a density of markers at least 5-fold higher than estimated to be needed to perform optimal GWAS in peach.

SNP array evaluation
In summary, the total of 6,869 SNPs on the IPSC peach 9 K SNP array v1 verified as polymorphic through extensive empirical evaluation represent an excellent source of markers for studies in genetic relatedness, genetic mapping, and dissecting the genetic architecture of complex agricultural traits. The SNPs included in the array were successfully used for genotyping 709 accessions in two independent evaluation panels. The majority (86.9%) of these markers were polymorphic in both experiments ( Figure 5) indicating that the array contains a very low number of false positive SNPs. The comparison ( Figure 6) between peach and nonpeach accessions showed a common pattern in MAF distribution and large common gaps. The correlation (r = 0.437) between MAF calculated within peach vs. non peach cultivars and selections shows that the array can be efficiently used in peach related species as well. The most notable difference in MAF for the polymorphic SNPs between the peach and non-peach selections was on chromosome 2 around 19 Mbp, where the non-peach group had higher MAF. The common gaps on chromosome 1 (,21 Mbp), chromosome 2 (,8.5 Mbp), chromosome 4 (,24.5 Mbp), chromosome 5 (,7 Mbp), and chromosome 8 (,10 Mbp), may represent putative centromeric regions (S. Scalabrin, personal communication).

Conclusion
The SNP array described here will foster genetic studies in the stone fruits and will help bridge the gap between genomics and breeding activities because breeding germplasm was the basis of detected SNPs and SNP choices of the final array. The IPSC peach 9 K SNP array v1 is commercially available and we expect that it will be used worldwide for genetic studies in peach and related stone fruit and nut species.