Genome-Wide SNP Detection, Validation, and Development of an 8K SNP Array for Apple

As high-throughput genetic marker screening systems are essential for a range of genetics studies and plant breeding applications, the International RosBREED SNP Consortium (IRSC) has utilized the Illumina Infinium® II system to develop a medium- to high-throughput SNP screening tool for genome-wide evaluation of allelic variation in apple (Malus×domestica) breeding germplasm. For genome-wide SNP discovery, 27 apple cultivars were chosen to represent worldwide breeding germplasm and re-sequenced at low coverage with the Illumina Genome Analyzer II. Following alignment of these sequences to the whole genome sequence of ‘Golden Delicious’, SNPs were identified using SoapSNP. A total of 2,113,120 SNPs were detected, corresponding to one SNP to every 288 bp of the genome. The Illumina GoldenGate® assay was then used to validate a subset of 144 SNPs with a range of characteristics, using a set of 160 apple accessions. This validation assay enabled fine-tuning of the final subset of SNPs for the Illumina Infinium® II system. The set of stringent filtering criteria developed allowed choice of a set of SNPs that not only exhibited an even distribution across the apple genome and a range of minor allele frequencies to ensure utility across germplasm, but also were located in putative exonic regions to maximize genotyping success rate. A total of 7867 apple SNPs was established for the IRSC apple 8K SNP array v1, of which 5554 were polymorphic after evaluation in segregating families and a germplasm collection. This publicly available genomics resource will provide an unprecedented resolution of SNP haplotypes, which will enable marker-locus-trait association discovery, description of the genetic architecture of quantitative traits, investigation of genetic variation (neutral and functional), and genomic selection in apple.


Introduction
Understanding the links between phenotypic variations and their underlying DNA variation is the major challenge facing plant geneticists today. Recent advances in genomics technologies, including highly parallel sequencing and genetic marker methods for genome-wide assays of allelic variation, have now made highresolution genetic characterization of crop germplasm feasible.
The use of genomics tools has tremendous potential to assist Rosaceae crop breeders to produce significant genetic gains for consumer and grower traits more precisely and efficiently, as well as to improve understanding of the genetic architecture of agronomic characters [1]. Genetic marker-based strategies such as QTL interval mapping, pedigree-based analysis, association mapping, and genomic selection are now powerfully enabled by genomics technologies. Although genetic markers such as Simple Sequence Repeats (SSRs) [2,3,4] and Single Nucleotide Polymorphisms (SNPs) [5,6,7] have been developed for apple, their density across the genome is not sufficient for fine dissection of functional genetic variation. We focus here on the development of a medium to high-throughput multiplexed SNP assay tool based on the Illumina InfiniumH assay for genome-wide evaluation of allelic variation in apple breeding germplasm. This product of international research community collaboration is intended to benefit apple breeders directly, and consequently growers and consumers who demand better quality pome fruit that is produced more sustainably.
SNP marker development for a genome screening tool is based on three steps: detection, validation, and final selection. The objectives and throughput required are different for each step. The detection step requires identification of a large pool of SNPs in the crop and typically involves non-targeted techniques such as resequencing, or High Resolution Melting (HRM) [8], of a few representative yet diverse individuals. This maximizes opportunities for detecting SNPs, while ensuring that the SNPs will be suitable for their final application for screening breeding germplasm. While sequencing has historically been carried out using the Sanger method, high-throughput sequencing-by-synthesis techniques have now made it possible to re-sequence entire genomes affordably [9]. Although not strictly necessary, validation is desirable to inform SNP assay development, in order to maximize the number of functional polymorphic markers in the final genetic analysis. Because of the current prohibitive cost of validating every detected SNP, validation is often based on a subset of the detected SNPs that are screened over an informative set of individuals. If a range of SNP filtering parameter values are included in the SNPs used for the validation step, suitable parameter thresholds can then be established for those SNPs to be selected for the final assay. For development of genome-wide SNP assays, adequate genome coverage by the final set chosen is critical [10]. In crops with whole genome sequences and saturated genetic maps, such as apple [11], genome coverage can be on a physical and/or a genetic basis, as desired. When such genomics resources are lacking, the assumption may be made that random distribution of a large number of SNPs will achieve adequate coverage, with the caveat that certain genomic regions will be over-or underrepresented depending on the SNP detection approach adopted.
Once SNP sets have been developed, screening can be carried out with a range of highly parallel techniques on segregating germplasm sets according to the research need or breeding application. Multiplexing and automation have enhanced the efficiency of SNP genotyping enormously and several highthroughput platforms are now available for the genotyping of a variable number of samples for one to up to one million SNPs in parallel. They include, but are not limited to, TaqManH from Applied Biosystems and array-based technologies from Illumina (GoldenGateH and InfiniumH) or Affymetrix [10,12,13].
We used next-generation sequencing (NGS) to detect SNPs covering the apple genome. This effort involved re-sequencing a small set of cultivars, ancestors, and founders, chosen to represent the pedigrees of worldwide apple breeding programs by RosBREED, a consortium established to enable marker-assisted breeding for Rosaceae crops (www.rosbreed.org; [14]). We validated the SNPs detected and determined adequate filtering parameter values, using the Illumina GoldenGateH assay to screen a larger set of accessions from the international apple breeding germplasm. Based on results of the GoldenGateH assay, we further refined SNP-filtering criteria to enable the development of an 8K InfiniumH II array and evaluated it, using a segregating population of apple seedlings.

Materials and Methods
The workflow and design parameters described below are summarized in Figure 1.  Table 1. The raw sequenced data were retrieved each apple accession and then aligned separately to the reference genome of 'Golden Delicious' [11] using Soap2 [15].

SNP validation with the GoldenGateH assay
A subset of 144 SNPs was chosen to validate the efficiency of SNP detection and fine-tune the filtering parameters ( Figure 2). The initial selection was of 100 Stage 1 SNPs evenly spread across the apple genome, i.e., the 17 LGs representing the 17 haploid chromosomes of apple, according to the primary assembly of 'Golden Delicious' [11]. These included a SNP located in the first 200 kb from each end of each LG. SNPs between these were evenly spaced along the LG every 6 [19]). These Ma locus SNPs were spaced an average of 75 Kb apart (ranging from 5 Kb to 186 Kb). It was planned that approximately 20% of the 144 SNPs would be accession-specific, i.e., their minor allele would be detected in only one re-sequenced accession of the detection panel. As only four of the evenly spread SNPs met this criterion, 24 additional SNPs were chosen as accession-specific. These latter SNPs were spread over the genome at 1-2 per LG but otherwise randomly distributed on each LG. Approximately 20% of the 144 SNPs were allocated to meet the criterion of being within exons of candidate genes for fruit quality and plant architectural traits. These included 24 of the evenly spread SNPs, four of the Ma locus SNPs, and one of the accession-specific SNPs, scores. Approximately 5% of SNPs were chosen from those previously validated in 'Golden Delicious' (GDsnp) using the SNPlex technique [11] and these included seven evenly spread SNPs and one Ma locus SNP, to serve as positive controls.
To test the above SNP parameters for their effects on genotyping efficiency, a validation panel of 160 apple accessions (Table S1) was screened with the 144 SNP subset, using the GoldenGateH assay. Individuals in the validation panel are founders, intermediate ancestors, and important breeding parents of modern apple cultivars, forming a complex pedigree structure linking much of the world's cultivated apple crop and incorporating the 27 accessions of the SNP detection panel. The validation panel also included bin-mapping sets of approximately eight seedlings each for: 'Golden Delicious'6'Anna' [20], 'Malling 9'6'Robusta 5' [21], 'Prima'6'Fiesta' [19], 'Royal Gala'6'Braeburn' [11], and 'Telamon'6'Braeburn' [22]. Genomic DNA was purified from each accession using the E-Z 96 Tissue DNA Kit (Omega Bio-Tek, Inc., Norcross, USA). DNA was quantitated with the Quant-iT TM PicoGreenH Assay (Invitrogen, Carlsbad, USA), using the Victor multiplate reader (Perkin Elmer Inc., San Jose, USA). Concentrations were adjusted to a minimum of 50 ng/ml in 5 ml aliquots and were submitted to the Research  Technology Support Facility at Michigan State University (http:// rtsf.msu.edu/illumina-beadxpress-reader-system), where the Gold-enGateH assay was performed following the manufacturer's protocol (Illumina Inc., San Diego, USA). Following amplification, the PCR products were hybridized to VeraCode microbeads via the address sequence for detection on a VeraCode BeadXpress Reader. SNP genotypes were scored with the Genotyping Module of the GenomeStudio Data Analysis software (Illumina Inc.).

SNP final choice for 8K InfiniumH II array
A clustering strategy was devised that would evenly span the genetic map of apple with clusters of exonic SNPs, in order to provide a final SNP genome scan with the capability of determining SNP haplotypes at distinct loci. The design featured focal points at approximately 1 cM intervals, with 4-10 SNPs clustered at each focal point. Choice of focal points began with GDsnps [11] according to their location in the IASMA-FEM 'Golden Delicious'6'Scarlet' reference genetic map (http:// genomics.research.iasma.it/cgi-bin/cmap/viewer), followed by Rosaceae Conserved Orthologous Set (RosCOS; [23]) loci to fill genetic gaps between GDsnps, according to physical map location [24]. Chromosome ends, the first 200 Kb of each LG according to the apple draft genome pseudo-chromosomes [11], were also targeted. SNPs not located within 650 Kb of a focal point SNP were then discarded. Within this pool of candidates, SNPs were binned according to MAF (bins of 0.1, 0.2, 0.3, 0.4, and 0.5 corresponding to MAF ranges of 0.01-0.1, 0.101-0.2, and so on). Two SNPs were chosen from each bin for each cluster, to give a total of up to ten SNPs per cluster (including the focal point SNP). To minimize detection of redundant haplotypes within clusters in subsequent genotyping, the minimum distance between SNPs was set at 2 Kb. Additional SNP clusters were developed from candidate genes for fruit quality, tree architecture, and flowering that were chosen using the available literature [25,26,27]. The best hits from a BLAST search of these candidate genes against the apple ''Consensus CDS Peptides'', available from the GDR website (http://www.rosaceae.org), created a list of homologous apple sequences that were then located in the apple genome and became new focal points. Stage 2 SNPs identified in these genes were subjected to Stage 3 filtering criteria, except that the ''2 Kb minimum distance'' rule was relaxed so that multiple SNPs could be chosen within a gene to form a cluster. Instead, up to four SNPs within each candidate gene were manually chosen to lie as far from each other as possible. Finally, SNPs with an ADT score below 0.7 were discarded.

SNP array evaluation and cluster file development
A set of populations from various crosses, as well as accessions from the apple germplasm, were used to evaluate the Apple 8K InfiniumH II array. This set included a 'Royal Gala'6'Granny Smith' F 1 population of 186 seedlings [28], seven controlled F 1 crosses that are used as a training population for genomic selection at Plant & Food Research and comprise 1313 individuals [29], and a set of 117 accessions from the Plant & Food Research germplasm collection (S. Kumar, unpublished). Genomic DNA (gDNA) was extracted using the NucleoSpinH Plant II kit (Macherey-Nagel GmbH & Co KG, Düren, Germany), and quantitated using the Quant-iT TM PicoGreenH Assay (Invitrogen). 'Royal Gala', 'Granny Smith', and 'Golden Delicious' were used as controls. Two hundred nanograms of gDNA were used as template for the reaction, following the manufacturer's instructions. SNP genotypes were scored with the Genotyping Module of the GenomeStudio Data Analysis software (Illumina Inc., San Diego, CA). Individuals with low SNP call quality (p50GC,0.54), as well as seedlings putatively resulting from an unintended pollination, were removed from the analysis. SNPs with a GenTrain score .0.6 were retained and those with scores ranging between 0.3 and 0.6 were visually checked for accuracy of the SNP calling. Clusters were manually edited when the parent-offspring segregation was not correct, or when the number of missing genotypes was greater than 20.
Two trios with both parents and one seedling were used to test the usefulness of SNP clusters for identifying haplotypes: 'Royal Gala'6'Braeburn'2.'Scifresh' and '(Royal) Gala'6'Splendour'2.'Sciros'. SNPs were coded using A and B alleles and haplotypes were inferred using FlexQTL TM (www.flexqtl.nl) for each cluster of SNPs.

NGS re-sequencing of apple breeding accessions and SNP detection
A total of 67 Gb of DNA sequence from 898 million 75-base reads were generated for the 27 Malus accessions ( Table 1). The total number of reads per accession ranged from six to 60 million, for 'Crimson Crisp' and 'McIntosh', respectively. The difference was due to an increase in the cluster density and resulting sequencing yield of the Illumina GA instrument over the period that sequencing was performed. A total of 10,915,756 SNPs was identified using SoapSNP, of which 2,113,120 (19.5%) passed the filtering criteria for Stage 1 detection ( Table 2, Figure 1). The average SNP frequency was one per 288 bp. Of these SNPs, 611,599 (28.9%) were predicted to be located in exonic regions and passed the Stage 2 filter (Table 2, Figure 1). When the SNP calls based on re-sequencing data for the two independently sequenced 'Cox's Orange Pippin' samples were compared, an average of 21.3% of detected SNPs gave different genotypes between the two samples.

SNP validation
In the GoldenGateH validation assay, 148 apple accessions in the test panel gave good quality scores (call rate .0.8 and 10%GC Score .0.5) and 12 accessions failed because of poor DNA quality. Seventy-three (50.7%) of SNPs were polymorphic, 46 (31.9%) had failed reactions, and 25 (17.3%) were monomorphic (MAF ,0.05 or A/B frequency ,0.1) ( Table 3). All eight GDsnps were polymorphic. Evenly spread SNPs and SNPs at the Ma locus had similar success rates to the overall set (54% and 55% polymorphic SNPs, respectively). SNPs located within candidate genes were generally more successful than the overall set (63% polymorphic). Of the 28 SNPs chosen on the basis of their accession specificity, 14 (50%) had a MAF ,0.05, of which 11 were not monomorphic. The Illumina ADT score calculated for each SNP did not significantly influence the success of the SNPs (Table 3).

SNP final choice
The third stage of filtering, which involved designating focal points at approximate 1 cM intervals over the apple genome and choosing SNPs in 100 Kb windows around them (Figure 3), was initially based on 712 suitable GDsnp markers. These markers left 107 gaps in the genome greater than 3 cM and these gaps were filled using two SSRs and 128 RosCOS as additional focal points, giving a total of 842 focal points. By this stage, 6074 SNPs had been chosen in addition to the initial 712 GDsnps. Next, a further 5528 SNPs were identified in candidate genes and narrowed down to 1652 SNPs after removing putative redundancies. Filtering of the 8438 SNPs chosen by this stage, using the ADT score, reduced the pool to 7793 SNPs: 693 GDsnps, 6028 SNPs around focal points, and 1072 SNPs within candidate genes (Figure 1). Finally, the 74 validated SNPs from the GoldenGateH assay were included, to provide a grand total of 7867 SNPs (Table S2) in 1355 clusters for construction of the final apple InfiniumH II SNP array, officially named the International RosBREED SNP Consortium (IRSC) apple 8K SNP array v1.

The IRSC apple 8K SNP array v1
The 7867 SNPs were uniformly represented on the 17 apple LGs (Table 4). The number of SNP clusters ranged from 64 to 113 on LGs 6 and 2, respectively, with an average of 5.8 SNPs per cluster. The average physical distance between SNP clusters ranged from one cluster every 316.2 kb to 538.3 kb on LG 16 and 3, respectively. The overall cluster density on the 'Golden Delicious'6'Scarlet' reference genetic map was one cluster every centiMorgan, with only five gaps between clusters larger than 10 cM.

IRSC apple 8K SNP array v1 evaluation
Evaluation of the IRSC apple InfiniumH II 8K array using 1619 individuals, including individual accessions and segregating populations, yielded 7692 successful beadtypes (97.7%) of which 5554 (72.2%) were polymorphic ( Table 4). The remaining 2138 SNPs (27.8%) exhibited poor quality genotype clustering or were monomorphic. Numbers of polymorphic SNPs per LG ranged  Table 3. Results from a GoldenGateH assay of 144 single nucleotide polymorphisms (SNPs) screened over 160 apple accessions (Table S1).  (Figure 4). The average distance between markers was 0.88 cM and 0.91 cM for 'Royal Gala' and 'Granny Smith', respectively, and the largest gap between informative clusters was 24.3 cM for 'Royal Gala' and 25 cM for 'Granny Smith'. However, the marker density could be increased in these large gaps by adding markers heterozygous for both parents. The small region of LG1 presented in Figure 3 spanned 25 polymorphic markers in two trios with both parents and one progeny ( Figure 5): 'Royal Gala'6'Braeburn'2.'Scifresh' and '(Royal) Gala'6'Splendour'2.'Sciros'. The haplotypes inherited from each parent could be inferred for the eight SNP clusters, with three of the clusters having four haplotypes. One putative recombination event was detected in the '(Royal) Gala'6'Splendour'2.'Sciros' trio between the second last and last two clusters.

A SNP-based strategy for apple genome scanning
An international community of apple genomics, genetics, and breeding researchers recently united with the goal of developing high-resolution genome scanning capability for germplasm of cultivated apple. The resulting medium to high-throughput multiplexed SNP assay presented and evaluated here is the culmination of a collective technical effort and shared resources that would have been beyond the means of any single entity. This publicly available genomics resource will provide unprecedented resolution for discovery of marker-locus-trait association discovery, for description of the genetic architecture of quantitative traits, for investigation of genetic variation (neutral and functional), and will enable genomic selection in apple. To ensure maximum utility, the apple SNP array involved several key design features.
To ensure that genome scans using the SNP array relate directly to the draft apple genome sequence of [11], priority was given to including on the array previously validated and genetically mapped SNPs from 'Golden Delicious' that had been used to anchor the genetic and physical maps of apple. GDsnps also provided internal controls for validation of SNPs with the GoldenGateH assay. The series of previously validated GDsnps [7,11] as a whole performed better than the new series of SNPs in both the GoldenGateH assay and the apple InfiniumH 8K array evaluations (100% and 90.7% success rates, respectively). This result underscores that prior validation is useful when developing a high-throughput SNP array, to minimize the proportion of attempted SNPs that are not informative. Future SNP arrays for apple will be firmly based on the SNPs validated here with the first apple InfiniumH array.
Genetic and physical distribution of SNPs across the apple genome was also a strong consideration in array design. The whole genome sequence of apple [11] was a critical resource that greatly increased the likelihood that the final array would cover the majority of the apple genome. Organizing each SNP according to genetic and physical location enabled efficient genome saturation, helped to avoid redundancy and spanned gaps that would have occurred if a collection of random SNPs had been employed. Attention was given to ends of chromosomes; however, not beyond the edges of a genetically defined linkage group. Our genetic positioning of focal points relied on the assembly of the apple genome sequence into linkage groups by [11] that involved anchoring of metacontigs to a genetic map comprising 1643 SNP loci. Errors in this assembly will reduce the effectiveness of the IRSC apple 8K SNP array v1 to span the apple genome genetically. As the number of high quality (MAF.0.05, p50GC .0.4, call rate .0.95) polymorphic markers provided by the array developed here is greater than that was used for the genome assembly (3371 vs. 1643), the new set of SNPs from the array will be useful for improvement of the apple genome assembly.
A haplotype-targeting strategy was adopted to maximize information gained using the SNP array and to decrease computation time for mapping software. Clusters of multiple SNPs were spaced at 1 cM intervals (equivalent to an average of 446.2 kb, [11]) rather than evenly spreading independent loci (approximately every 0.17 cM or 75 Kb). As bi-allelic SNPs are often less informative than co-dominant multi-allelic markers such as SSRs, we clustered four to 10 SNP markers within a small, essentially non-recombining, physical distance, i.e., 650 Kb around focal points. Individual SNPs within clusters were spaced at a minimum of 2 Kb to reduce redundancy of haplotype capture. This strategy will enable combination of the information from individual SNPs into haplotypes that will provide fully informative and easy to handle multi-allelic markers and capture all unique haplotypes within genotyped germplasm. A preliminary analysis using two parent-child trios indicated that our clustering strategy is capable of facilitating the identification of haplotypes within and among SNP clusters. Nevertheless, it was recently observed that the 'Golden Delicious' genome assembly contains some erroneous gene position assignments as reported recently for a set of candidate genes for flowering [30] and RosCOS markers [31]. Therefore, it is advisable to perform independent linkage analysis of polymorphic SNPs to ensure that they locate in the expected cluster, to avoid haplotype construction using SNPs that are not truly physically linked. Use of a consensus map of fully informative markers is more powerful for QTL analysis than partially informative markers, as all alleles of each marker locus can be simultaneously evaluated for their phenotypic effects. A highly saturated genome of fully informative markers is also highly amenable to integrated QTL analysis over multiple pedigreeconnected families of variable size with the FlexQTL TM software (www.flexqtl.nl), which supports the Pedigree-Based Analysis approach [32]. We chose GDsnps as focal points for clusters because of their established genetic resolution of 1 cM. RosCOS loci were used because their genetic locations were known and they provide an efficient orthologous marker system for the assessment of comparative genome synteny across Rosaceae family genera [24]. Finally, candidate genes were included on the array because of their putative enhanced capacity for association of detected genetic variation with horticultural trait variation. Our principal source of SNPs was from whole-genome resequencing of apple accessions for which pedigree connections were considered, in order to represent founder contributions to breeding germplasm efficiently and thereby increase the probability of detecting informative haplotypes segregating in breeding programs. Previously available SNP markers for apple were derived from the 'Golden Delicious' genome sequence [11] and while 'Golden Delicious' is a common founder in pedigrees of cultivated apple, it nevertheless accounts for only a fraction of potentially unique sources of genetic variation among modern apple cultivars [7]. Most accessions in the SNP detection germplasm set were included because they represent the most common founders in the ancestry of cultivated apple. The recentgeneration cultivars 'CrimsonCrisp' and Co-op 15 have missing links in their ancestry and were included in the sequencing panel to help to capture informative haplotypes in current breeding programs.
To obtain sufficient sequencing depth for each accession, we chose the NGS technique provided by Illumina GA II instrumentation. Because of its high information output and relative affordability, such high-throughput sequencing by synthesis has revolutionized agricultural and horticultural genomics, enabling sequencing of complete genomes of plant species including several crops of the Rosaceae family: apple, peach, and strawberry. Further to these physical mapping successes, sequencing by synthesis has revealed detailed information on sequence polymorphism within species of interest [9,33,34,35,36].
The sequencing of 'Cox's Orange Pippin' at two locations enabled us to compare SNP detection in independent samples. We believe that the high proportion of SNPs with non-matching genotypes between the 'Cox's Orange Pippin' samples was due to low sequence coverage providing insufficient reads at a locus to support detection of both alleles at heterozygous SNP loci. Two strategies can be used for SNP detection using NGS. Our strategy was to sequence a wide range of cultivated apple accessions at relatively low coverage to capture as many breeding-relevant SNPs as possible. An alternative approach would have been to resequence fewer accessions at higher coverage to enable prediction of SNP genotypes more correctly for each accession. However, this latter strategy would have reduced opportunities to identify rare haplotypes carried by some breeding lineages. The sequences of 27 apple accessions generated in this study are an important resource for apple geneticists, providing a comprehensive inventory of point mutations potentially underlying genetic variation for important horticultural traits beyond their use here in development of the 8K SNP array. The strategy of re-sequencing higher numbers of accessions at low coverage proved successful, as shown by evaluation of both the GoldenGateH and InfiniumH arrays. While the in silico genotypes of the 27 accessions based on whole-genome sequencing are not always correct, most of the predicted SNPs are real and were converted into SNP markers in the final InfiniumH II array, to provide more than 5500 polymorphic markers in a diverse breeding germplasm set.
To enhance the capacity to identify and characterize genetic control regions for horticultural traits -a common intended use of the array -hundreds of candidate genes were built into the array design. Many of these candidate genes have the advantage of being well studied by researchers [25,37,38,39,40,41,42], thus providing loci of known or readily interpretable genetic variation of immediate interest to genome scan users and a relatively fast path to practical application. Similarly, the choice of exonic SNPs for the array biases the genome scan to the preferred target of the gene space of apple genomes. Although the bias toward exonic SNPs was initially only a guide based on performance in the GoldenGateH array, the sheer number of SNPs available from the detection step allowed the final design to feature only those SNPs putatively located in exons. However, it is unlikely that the SNP array will span causative mutations for all characters, as such mutations are often located in non-coding regulatory regions, such as the mutation demonstrated to control red pigmentation of the apple fruit flesh and foliage, which is located in the promoter region of MYB10 [39].

SNP filtering and validation
A set of 7867 apple SNPs corresponding to 1355 clusters was employed to construct the IRSC apple 8K SNP array v1. These SNPs were chosen from more than 2 million SNPs detected across the apple genome, utilizing 67 Gb of NGS data from 27 apple accessions. The number of SNPs detected was more than 250 times as many as needed for array construction, allowing very fastidious filtering. Firstly, some SNP types were filtered out because of the choice of the technology for the final array, with C/ G and A/T transversions being rejected as they require two InfiniumH II probes for detection. Further decisions on the filtering criteria used to develop the final SNP array were made on the basis of empirically determined thresholds obtained from a GoldenGateH validation assay of a representative subset of 144 SNPs, in order to optimize the success rate for screening apple breeding germplasm. The success rate in the validation assay varied according to the type of SNPs and their parameters.
The main reason for choosing exonic SNPs for construction of the InfiniumH II array was the observation that randomly distributed SNPs tended to perform more poorly in the Gold-enGateH than SNPs located within coding regions. This imbalance in performance might be explained by a lower number of nucleotide polymorphisms in exonic regions, increasing stringency of hybridization between the array features and target genomic DNA. The presence of undetected SNPs in the sequences flanking genes has been observed to reduce both the SNP success rate [43,44] and the rate of SNP conversion [12]. This finding represents a major concern for those developing high-throughput SNP screening tools in highly heterozygous genomes, such as the forest tree species genomes [12,43]. Based on recent reports, apple is among the most genetically polymorphic agricultural species analyzed to date. For example, [5] estimated a frequency of 1 SNP per 149 bp based on public apple ESTs, while [7] identified an average of 1 SNP per 455 bp in a single cultivar and 1 SNP per 52 bp across germplasm. This can be compared with figures ranging from 1 SNP every 61 bp in maize [45], 1 SNP/117 bp in grapevine (based on the 'Pinot noir' genome sequence; [46]) or 1 SNP/64-104 bp (based on multi-locus cultivar analysis [46,47], to 1 SNP/5,700 bp in japonica rice cultivars [48]. To address this very specific issue, SNPs were discarded from consideration during the filtering step, whenever another SNP was detected within 50 bases. Nevertheless, because of the low sequencing coverage for each of the 27 accessions, it is probable that many flanking SNPs have not been detected. MAF influenced the success of SNPs in the GoldenGateH assay, where SNPs with low MAF based on the NGS detection were less likely to be polymorphic than SNPs with higher MAF. Our strategy of choosing SNPs exhibiting a range of MAFs within clusters evenly spaced across the apple genome will help to capture available haplotypes with high-MAF SNPs (i.e., MAF.0.4), improving detection of heterozygosity for any assayed germplasm individual, while lower-MAF SNPs (MAF ,0.2), tend to come from specific founders and improve identification of rare haplotypes.

IRSC apple 8K SNP array v1 evaluation
Based on information derived from a consensus apple genetic map of 1343 cM (M. Troggio, unpublished), the IRSC apple 8K SNP array v1 corresponds to an average density of one focal point every 1 cM. Overall, the IRSC apple 8K SNP array v1 was successful, as 5554 markers were polymorphic in the examined germplasm set, with 4368 SNPs having a MAF.0.05, and 3371 having a high call rate, reliability score, and MAF. While this result indicates that at least 3371 markers are likely to provide a genotype when the array is used for screening any germplasm set, any given genetic mapping experiment can expect to obtain ,4000 polymorphic bi-allelic markers or ,1000 fully informative haplotypes. The number of SNPs in the array that can successfully provide a genotype might be improved by enhancing DNA quality of the template used for the experiment. In our array evaluation, the set of accessions tested was from a breeding program where the number of individuals was maximized at the expense of the DNA quality.
We have demonstrated that the array is effective for use in genetic mapping, providing 994 focal points in a 'Royal Gala'6'Granny Smith' segregating population, a sufficient number for the construction of a highly saturated genetic map. This resolution will be also sufficient to implement genomic selection in apple [29]. However, the resolution will not be high enough for association mapping studies using unrelated germplasm, because a recent study indicates that linkage disequilibrium among apple cultivars decays faster than this, based on calculations in a apple germplasm collection [7].

Conclusion
The International RosBREED SNP Consortium apple 8K SNP array v1 has been developed for public use by apple geneticists worldwide. The design and evaluation of the array has indicated that it will be effective for a wide range of germplasm and applications such as high-resolution genetic mapping, QTL detection and characterization, marker-assisted introgression, and genomic selection.

Genomic resources
All SNPs detected, the SNPs chosen for the IRSC apple InfiniumH II 8K array, and the GenomeStudio cluster file developed are deposited in the Genome Database for Rosaceae (www.rosaceae.org). SNPs are available in dbSNP (http://www. ncbi.nlm.nih.gov/projects/SNP/) under accessions ss475875741 to ss475892397.

Supporting Information
Table S1 The 160 apple accessions used for the GoldenGateH SNP validation assay. (DOCX)

Table S2
List of 7867 apple SNPs on the Apple InfiniumH II array v1. The NCBI dbSNP accession, location on the 'Golden Delicious' genome assembly [11], and source of SNP are indicated. (XLS)