Identification of Genetic Loci Associated with Quality Traits in Almond via Association Mapping

To design an appropriate association study, we need to understand population structure and the structure of linkage disequilibrium within and among populations as well as in different regions of the genome in an organism. In this study, we have used a total of 98 almond accessions, from five continents located and maintained at the Centro de Investigación y Tecnología Agroalimentaria de Aragón (CITA; Spain), and 40 microsatellite markers. Population structure analysis performed in ‘Structure’ grouped the accessions into two principal groups; the Mediterranean (Western-Europe) and the non-Mediterranean, with K = 3, being the best fit for our data. There was a strong subpopulation structure with linkage disequilibrium decaying with increasing genetic distance resulting in lower levels of linkage disequilibrium between more distant markers. A significant impact of population structure on linkage disequilibrium in the almond cultivar groups was observed. The mean r2 value for all intra-chromosomal loci pairs was 0.040, whereas, the r2 for the inter-chromosomal loci pairs was 0.036. For analysis of association between the markers and phenotypic traits, five models comprising both general linear models and mixed linear models were selected to test the marker trait associations. The mixed linear model (MLM) approach using co-ancestry values from population structure and kinship estimates (K model) as covariates identified a maximum of 16 significant associations for chemical traits and 12 for physical traits. This study reports for the first time the use of association mapping for determining marker-locus trait associations in a world-wide almond germplasm collection. It is likely that association mapping will have the most immediate and largest impact on the tier of crops such as almond with the greatest economic value.

To design an appropriate association study, we need to understand population structure and the structure of linkage disequilibrium within and among populations as well as in different regions of the genome in an organism. In this study, we have used a total of 98 almond accessions, from five continents located and maintained at the Centro de Investigación y Tecnología Agroalimentaria de Aragón (CITA; Spain), and 40 microsatellite markers. Population structure analysis performed in 'Structure' grouped the accessions into two principal groups; the Mediterranean (Western-Europe) and the non-Mediterranean, with K = 3, being the best fit for our data. There was a strong subpopulation structure with linkage disequilibrium decaying with increasing genetic distance resulting in lower levels of linkage disequilibrium between more distant markers. A significant impact of population structure on linkage disequilibrium in the almond cultivar groups was observed. The mean r 2 value for all intrachromosomal loci pairs was 0.040, whereas, the r 2 for the inter-chromosomal loci pairs was 0.036. For analysis of association between the markers and phenotypic traits, five models comprising both general linear models and mixed linear models were selected to test the marker trait associations. The mixed linear model (MLM) approach using co-ancestry values from population structure and kinship estimates (K model) as covariates identified a maximum of 16 significant associations for chemical traits and 12 for physical traits. This study reports for the first time the use of association mapping for determining marker-locus trait associations in a world-wide almond germplasm collection. It is likely that association mapping will have the most immediate and largest impact on the tier of crops such as almond with the greatest economic value.

Introduction
Almond (Prunus amygdalus Batsch) is the most important nut crop worldwide [1]. The kernel is the edible part of the nut and is considered an important food with a high nutritional value. It may be consumed raw or cooked, blanched or un-blanched, combined and/or mixed with other nuts. Kernel quality is a priority objective in any breeding program. The physical traits and chemical components of the kernel contribute to almond quality and both must be considered in breeding programs searching for high-quality almonds [2].
The nutritional value of the almond kernel stems mainly from its lipid content, with a higher level of monounsaturated and polyunsaturated fatty acids (PUFA) than saturated fatty acids. This fat profile is highly recommended in human nutrition. Among the fatty acids, oleic and linoleic acids represent over 90% of the total lipid content [3]. The predominance of unsaturated fatty acids in almond is very important from nutritional and health viewpoints, since these fatty acids do not contribute to the synthesis of cholesterol in the human body [4]. Therefore, almond, along with other nuts, plays an important role in the human diet [2]. Almond oil contains tocopherols, an indispensable nutrient for humans and animals due to their antioxidant and radical scavenging abilities. Moreover, tocopherols are also important for the oxidative stability of vegetable oils. Hence, high tocopherol contents or an altered tocopherol composition in almond kernel oil may be a target in almond breeding for increased quality.
Although quality is often related to the chemical composition of any fruit or nut, some physical parameters must also be taken into account when evaluating quality. The physical traits of the almond nut do not affect the organoleptic characteristics of the kernel, but have a special importance in the industry because of the different steps involved in almond processing [2]. For example, shell hardness plays an important role during harvest and the industrial processes, since the shell is sometimes not well sealed through the suture line in soft-shell cultivars, leaving an entry point for dust, insects and micotoxin-producing fungi, such as Aspergillus producing aflatoxins [5]. The size and the shape of the almond nut must also be taken into account for designing and adjusting appropriate technologies for harvest, dehulling, transporting, classifying, processing and storing the crop. Besides, the size and shape of the kernel may define its utilization in specific commodities, such as chocolate bars, sugared almonds and sliced kernels.
It has been suggested, that most quality traits in almond are quantitatively inherited and therefore under polygenic control [6]. The heritability and genetic control of quality traits in almond been established in our germplasm [1,7]. For example, Font i Forcada et al. [7] reported a heritability (h 2 ) of 0.60 for γ-tocopherol content. Heritability estimates depend on the variation between genetic and the environmental effects [8]. For instance, a trait with a high heritability estimate indicates additive gene action [9] and little environmental influence and high genetic advance. In addition, several QTLs affecting both physical and chemical traits of almond nut have been identified [3,10]. These results suggest a significant and positive effect of the pollen on the transmission of most chemical components of the almond kernel and also provide an understanding of genetic control to facilitate development of new cultivars for the benefit of producers, shippers and consumers in the almond industry.
Almond is the most diverse of all species in the genus Prunus [11]. The resulting high variability in almond [12] may not only be due to its self-incompatibility system and the use of open-pollinated seedlings in traditional almond culture, but also because almond is the most southern Prunus species, and thus subject to more diverse growing conditions than other stone fruits, resulting in an adaptation to more diverse microclimates. The CITA almond germplasm collection based on accessions from all over the world is a fair representation of the genetic diversity available in almond [13]. Taking this diversity into account, this collection is being used as the almond reference collection for the Spanish Plant Genetic Resources Network, the Spanish and the European Community Plant Variety Offices, and the 'Group de Recherches et d'Études Médirerranéen pour l'Amandier' (GREMPA). This collection has been widely studied since its original establishment by Dr. A.J. Felipe in the 1960s. A large amount of published and unpublished data from the collection is available, such as complete descriptors of morphological traits, including tree, shoot, leaf, nut and kernel traits, bloom time, chilling and heat requirements [14], S f allele diversity [15], genetic diversity [12], and kernel composition [15]. This collection was also used to start the CITA almond breeding program, where not only very important agronomical traits, such as late bloom and self-incompatibility, were assessed, but also all aspects of almond quality [1].
Association mapping (AM), also known as linkage disequilibrium (LD) mapping may complement classical linkage mapping for elucidating the genetic basis of complex traits [16]. This approach relies on the strength of association between genetic markers and phenotype. Thus, it detects and locates genes relative to an existing map of genetic markers [17]. While linkage analysis searches for associations within populations developed from bi-parental crosses, AM utilizes unstructured or loosely structured populations (e.g., a collection of varieties or landraces), case-control designs and transmission disequilibrium tests [18]. Consequently, this method detects relationships between phenotypic variation and gene polymorphism in existing germplasm and in unrelated individuals. The effectiveness of this approach is greater when information on population structure and LD is available [19]. Since many important crops have complex population structures that arose from a long domestication and breeding history [20], understanding these structures is important to avoid identifying spurious associations in association mapping. In addition, AM complements and enhances previous QTL (Quantitative trait loci) information by incorporating the effects of recombination occurring in many past generations into a single analysis [21]. Association mapping has been successfully used to identify genes involved in several traits in different plant species (maize, sunflower, lettuce, potato, and wheat), but only a few studies have been carried out in fruit tree crops, such as peach [19,22], sweet cherry [23] and apple [24]. Consequently, AM in tree species requires both genotyping and phenotyping of large populations with unique architectures.
The purpose of our study was to phenotype the worldwide collection of almonds available at CITA with the goal of carrying out an association study to link SSR genotypes to the chemical and physical nut and kernel traits. We believe this will be one avenue to provide molecular information for exploring QTLs of important chemical and physical traits in almond. To our knowledge, this is the first report of an association study in almond.

Phenotypic diversity
The variation observed in the chemical and physical traits was high, thus confirming the diversity in the germplasm and the representativeness of the gene pool (Table 1). Most traits displayed a normal distribution. The protein content ranged between 10 and 30% (± 4.58) of the kernel dry weight (DM) while the oil content ranged between 51 and 67% (x 3.18) of the kernel DM. The range for the main fatty acids in comparison to the total oil content was 63-80% (±3.66) for oleic acid and 11-27% (x3.20) for linoleic acid. The ranges tocopherol homologues were 331-585 mg kg -1 (±5.38) for α-tocopherol, 0.11-3.14 mg kg -1 (± 0.06) oil for δ-tocopherol, and 5-57 mg kg -1 (±2.69) for γ-tocopherol (Table 1). Physical traits also showed a wide range in variability (Table 1), including nut weight; 1.6 to 9.9 g (±1.43), kernel weight; 0.6 to 3.5 g (± 0.46), nut thickness; 8.9 to 19.9 mm (±1.62) and kernel thickness; 6 to 12 mm (±0.85). These results highlight the importance of the genetic background of each accession in the phenotypic profile of the nuts and kernels.

Allelic variation at SSR loci
All 40 loci analyzed showed polymorphism, producing well-defined and reproducible bands in all accessions. A total of 557 alleles were amplified from all loci scored in all 98 accessions, with an average of 13.9 alleles per locus, and a range from 2 alleles in EPDCU5100 and CPSCT018 to 23 alleles in BPPCT053 ( Table 2). All primers produced a maximum of two bands per genotype in accordance with the diploid nature of almond. Genotypes showing a single band were considered homozygous at that locus. The observed heterozygosity ranged between 0.24 (BPPCT030) and 0.94 (CPPCT040), with an average of 0.66 across all 40 SSRs. Expected and observed heterozygosity values were compared with the fixation index (F) which was on the average, 0.11, ranging between -0.14 (EPDCU5100) and 0.52 (BPPCT010). The high F values observed corresponding to high homozygosity (particularly in individuals with only one band) suggests the presence of null alleles [25]. The F value was positive in 33 and negative in 7 SSR loci.

Cluster analysis and grouping of accessions
Cluster analysis based on the Neighbor-joining method detected three major clusters and subclusters (Fig 1). In some clusters, the groupings concurred with the geographical origin of the accessions. The first and largest cluster (blue) contained only the Spanish accessions (57). This group comprises cultivars from the north-eastern part of Spain including Guara, Bertina, Muel, Desmayo Largueta or Castilla as well as cultivars from the south and south-eastern parts such as Marcona, Ramillete, Coop. Mañán, Atocha, Garrigues, Del Cid or Malagueña. The new releases from different Spanish breeding programs such as Mardía, Soleta, Masbovera, Tarragonès, Marta and Belona were also grouped in this cluster. In addition, this cluster contains a

Population structure and linkage disequilibrium
Population structure in the accessions was assessed using the STRUCTURE software package. Bar plots were obtained with different values of K, from two to ten (Fig 2). We also performed twenty independent runs per K value with both the MCMC replications and analysis of the rate of change in the log probability of the data (ΔK) which displayed a maximum value at K = 3 [26]. The level of partitioning corresponds to a very strong differentiation into two major groups (Fig 3). The first group contained accessions from Mediterranean countries (Western-Europe), mostly from Spain, but also from Italy, France and Portugal, etc., while the second group contained accessions from non-Mediterranean countries (including America, Australia and Eastern-Europe). The proportion of genotypes assigned to each population was not symmetric, and many accessions were assigned to one population or the other, indicating that population structure exists [27].
Linkage disequilibrium (LD) was estimated after removing low frequency alleles (MAF 0.05) from all loci in the 98 accessions (Table 3 and Table 3. LD based on r 2 , averaged for map distance classes and germplasm groups based on population structure analysis in the STRUCTURE. 0.091 intra-chromosomal and 0.082 for inter-chromosomal loci pairs) and in non-Mediterranean accessions (r 2 values of 0.073 for intra-chromosomal and 0.062 for inter-chromosomal marker loci pairs). The overall level of LD detected was low, which could be mostly likely due to poor marker coverage.

Associations between traits and SSR markers
Association analysis was performed for nut and kernel traits in the 98 almond accessions. Marker-trait associations were obtained for chemical traits (Table 4 and S1 Table) and physical traits (Table 5 and S1 Table). We tested five models in TASSEL to determine associations and to also account for the influence of population structure by comparing their ability to reduce the inflation of false positive associations. The P-values were plotted in a cumulative fashion for each model and the distribution examined. According to Stich et al. [28] the distribution of P-values ideally should follow a uniform distribution with less deviation from the expected Pvalues. The association analysis using the GLM approach (being the naïve model), Q-model and Pmodel, detected a large number of associations between the markers and phenotypes which after Bonferroni correction for multiple testing of accessions, reduced down the number of Table 4. Statistical significance of the p values and associations observed between markers and chemical traits in 98 almond cultivars.

SSR
LG  Table 5. Statistical significance of the p values and associations observed between markers and nut and kernel physical traits of almond.

SSR
LG associations. The Q-model (GLM with Q-matrix as correction for population structure) showed 50 associations for chemical traits and 32 for physical traits. It appears that these models may not have accounted for the heterogeneity of the genetic background, which may have resulted in false positive associations. The K-model (MLM with K-matrix as correction for population structure) and QK-model (MLM with Q-matrix and K-matrix as correction for population structure) showed good fit for the P-values (P < 0.001), while the other models were characterized by the excess of small Pvalues (abundance of spurious associations) (Fig 5). These two later models showed high uniform distribution of P-values.
Taking into account the performance of the different models, only results from the K-model will be presented and discussed here since this appeared to have controlled population structure and kinship relationships better. Sixteen significant associations were detected between four SSRs and nine chemical traits ( Table 4). The BPPCT011 locus was associated with α-T, δ-T, γ -T and stearic acid, the CPDCT025 with δ-T, γ -T, oleic, linoleic, stearic and palmitic acids and protein content, UDP96-003 with γ -T, oleic and linoleic acids, and oil content, and finally, EPDCT3392 with protein content. For physical traits, 12 marker-trait associations were identifiedsuch as BPPCT11 with nut T/L and kernel lengh, EPDCT0532 with nut weight, thickness and T/L, and kernel size, CPSCT006 with nut length and T/L, and kernel length, and CPPCT021 with nut lenght and thickness, and kernel length. The association analysis performed using the mixed linear model (MLM) showed results consistent with those identified by the GLM approach. The percentage of phenotypic variation explained by these markers ranged between 92.9 and 66.4%, with CPDCT025 having the maximum value and EPDCT3392 the minimum value. The P-values showing the level of significance of the associations between SSR markers and phenotypic traits are shown in Fig 5. The percentage of phenotypic variation explained by each marker ranged between 99.5 and 24.6%, with UDP96-003 having the maximum value and PMS2 the minimum value. Please note that some associations were observed in the same regions where QTLs had previously been identified (Tables 4 and 5).

Genetic diversity
Over the last decade, SSR markers have proven to be a marker of choice in many breeding programs due to their abundance, co-dominance and feasibility. These markers are still being used today for linkage mapping and genetic diversity analysis in many crop plants, such as almond [12,31]. In the present work, we used a set of 40 genomic SSR markers spanning the almond genome to evaluate population structure and to identify the genetic mechanisms underlying inheritance of chemical and physical almond seed traits. Our plant material consisted of 98 almond accessions from different countries maintained at the CITA almond collection.
A total of 557 alleles were obtained from 40 SSR loci, with a mean of 13.9 alleles. This average allele number, when compared with other almond studies, is lower than 18.7 obtained by Fernandez i Marti et al. [12], but very similar to the 14.6 reported by Elhamzaoui et al. [31], and much higher than 4.7 obtained by Martínez-Gómez [32]. The reason for this discrepancy in the mean number of alleles (18.66 vs 13.9) could be because wild genotypes were used in Fernandez i Marti et al. [12], which resulted in more alleles some of which are novel, whereas mainly domesticated germplasm were used in this study. Also, the F values were positive in 33 loci but negative in 7 loci in this study, indicating a high level of heterozygosity as would be expected in a self-incompatible species such as almond. The average heterozygosity of 0.66 observed in this study was very similar to 0.62 obtained by Fernandez i Marti et al. [12], but slightly higher than 0.59 obtained by Elhamzaoui et al. [31] and 0.59 obtained by Martínez-Gómez [32]. These results suggest that our plant material showed enough variability and diversity suitable for association mapping.

Population structure and LD
The complex breeding history of many important crops and the limited gene flow in most wild plants have created complex stratification within the germplasm [33]. In many fruit species, domestication occurred relatively late, so the bottlenecks may have been relatively recent and of short duration [34]. Hence, despite the diversity observed in almond, genetic bottlenecks may have occurred during almond dissemination.
The presence of population stratification and unequal distribution of alleles could result in non-functional, spurious associations [35]. In order to understand the distribution of genetic diversity in the almond cultivars, we used a model-based clustering approach as implemented in 'STRUCTURE" to infer population structure. The results showed two separate pools of cultivars, the Mediterranean and non-Mediterranean, which are in line with those obtained by Fernandez i Marti et al. [12], where only 17 SSRs were used. In addition, the cultivars from Europe grouped separately from the Asian, American and Australian cultivars in Fernandez i Marti et al. [12]. We also performed principal component analysis (PCA) in TASSEL and obtained similar results (Fig 6). The first ten principal components explained a cumulative variation of 64.26%.
Linkage disequilibrium (LD) was estimated after eliminating low frequency alleles (MAF 0.05) from all loci in the 98 accessions (Table 3). A LD up to 20 cM was observed which dissipated at farther distances. The range of LD in the region from 0 to 10 cM was 0.061 and 0.058 in the Mediterranean vs non-Mediterranean groups, respectively, increasing to 0.087 and 0.079 in the region from 10 to 20 cM, and finally decreasing to 0.045 and 0.039 between 20 and 30 cM. A high level of LD up to 20 cM was also observed in the Mediterranean and non-Mediterranean accessions. The r 2 values for all intra-chromosomal and inter-chromosomal loci pairs were similar (0.040 vs 0.036). Unlinked marker pairs showed a similar percentage of significant LD in the Mediterranean (r 2 values of 0.091 for intra-chromosomal and 0.082 for inter-chromosomal loci pairs) and in the non-Mediterranean groups (r 2 values of 0.073 for intra-chromosomal and 0.062 for inter-chromosomal marker loci pairs). The overall level of LD detected was low, which could probably be attributed to poor marker coverage.
For cultivated species, it is generally observed that domestication and breeding have a strong impact on the level of genetic diversity in populations. In addition, domestication and breeding are expected to influence the level of LD within a population. Not surprisingly, we detected a rapid decline of LD in our almond cultivars. Nevertheless, the decline of LD in almond seems to be relatively comparable to previous studies based on sequence data in self-incompatible crop species. For example, LD decays to negligible levels within 150-750 bp in wild tomatoes [36] and within 0.2-1.5 kbp in maize [37]. In addition, the extent of LD we observed in the self-incompatible almond species is comparable to the extent of LD recently published in other almond studies (r 2 = 0.04 for intra-chromosomal regions and r 2 = 0.03 for inter-chromosomal regions) [12], or in the self-incompatible sweet cherry (r 2 = 0.028) [38], and in self-compatible peach (r 2 = 0.041 for intra-chromosomal regions and r 2 = 0.028 for inter-chromosomal regions) [22]. On the whole, direct comparisons of the extent of LD among studies should be treated with caution due to use of different LD measurement parameters and/or molecular marker systems in different studies.
Although LD in general decays more rapidly in an outcrossing species than in a selfing species, since recombination may be less effective in a selfing species [39], the level of LD decay observed in our study is comparable to the decay found in self-compatible Prunus species such as peach [19,22]. Population structure influences the magnitude and pattern of LD in almond, as in other species, such as Arabidopsis thaliana [40], maize [37], and barley [41]. The population structure observed among the different almond groups in this study could be responsible for the level of LD detected. A significant effect of domestication on the LD extent has been demonstrated in several crop plants due to bottlenecks and genetic drift [42]. Use of small sample sizes is expected to bias LD estimations [43]. The small sample size coupled with the low numbers of markers in our study could also explain why we observed similar extents of LD to the self-compatible peach [22].
In the presence of LD, it will be possible to identify genetic regions (if LD extends to a distance of several centi Morgans) or genes (if LD decays quickly, in a few thousand base pairs) associated with a particular trait of interest by genome-wide associations or by individual SNPs (Single Nucleotide Polymorphisms) or SNP haplotypes within a candidate gene [44], respectively.

Association Analysis
Association mapping is increasingly being utilized to detect marker-QTL linkage associations using plant materials routinely developed in breeding programs. Compared to conventional QTL mapping approaches, association mapping could be a more practical approach for cultivar development, considering that markers linked to major QTLs can immediately be utilized in marker assisted selection (MAS), once new QTLs are identified. In general, association mapping may more suited to organisms with little or no pedigree information, populations with rich allelic diversity, moderate to high nucleotide diversity, and for traits with little or no selection history and also controlled by many loci with small effects, in addition to having older alleles with low frequencies [45,46,47]. These situations appear applicable to vegetatively propagated fruit trees, particularly almond, with a high level of heterosis and a short breeding history [1].
Limited information exists in Prunus species on QTLs linked to nut and kernel traits. In sweet cherry, [48] identified three QTLs linked to fruit size on LG2 and LG6 using SSR markers, while in peach, [49] identified one QTL linked to fruit weight on LG1 using RFLP, AFLP, RAPD, and SSR markers. [30], in their first complete genetic framework map for physical traits of almond identified for the first time, the association of physical parameters of almond nut and kernel, with 14 putative QTLs. More studies are needed in this area to facilitate QTL colocalization and/or synteny analysis among the Prunus species, with a view to undergoing candidate gene identification and fine mapping.
In our study, microsatellite markers were located in regions where QTLs had previously been identified using conventional QTL mapping approaches [3,30]. This suggests that association mapping using plant materials routinely developed by breeders can effectively detect major QTLs. The SSR loci associated with chemical and physical traits in our association analysis included five DNA markers (BPPCT011, UDP96-003, EPDCT3392, CPSCT006 and CPPCT021) that were previously identified for linkage mapping and QTL identification [3,30].
Regarding chemical traits, a total of 16 associations were found between four DNA markers and nine chemical traits phenotyped. BPPCT011 located on LG1 was associated with α-tocopherol, δtocopherol, γ-tocopherol and stearic acid. The map positions for these markertrait associations are in agreement with a previous study [3], which reported a QTL mapped close to BPPCT011, and also associated with three tocopherol compounds and stearic acid. Additional associations were found between γ-tocopherol and UDP96-003 (LG4). This association is also in agreement with the QTLs identified for the same traits by Font i Forcada et al. [3]. Similarly, protein content showed association with the locus EPDCT3392 located on LG 7, which was in agreement with the QTL found at the same position as Font i Forcada et al. [3]. In addition, this study identified other loci associated with δtocopherol, γ-tocopherol, oleic, linoleic, stearic, palmitic fatty acids and protein content on LG3 (CPDCT025); and oleic, linoleic and oil content on LG4 (UDP96-003) that were not reported by Font i Forcada et al. [3]. The discrepancies in marker-locus-trait associations between the two studies could be attributed to number of marker loci used. In addition, this study used a large outbred population which invariably has more meiotic recombination events and a higher number of alleles segregating unlike [3], who used a bi-parental mapping population where only two alleles are expected to segregate at any locus.
Font i Forcada et al. [3] proposed candidate genes including Acyl-CoA (controlling the synthesis of long-chain saturated fatty acids), Enoyl-CoA hydratase (controlling the two main fatty acids such as oleic and linoleic) or ACP (controlling stearic acid) for the fine mapping of almond quality components. In the present work, the association found between αtocopherol, δtocopherol, γ-tocopherol and stearic acid and the marker BPPCT011 on LG 1 appears to map within the interval of the Acyl-CoA gene (ppa002255m). This gene is very close to CPPCT042, located also on LG1 [3]. Additionally, three more candidate genes located on LG 3, LG 4 and LG 7 have been identified to control protein ligase and fatty acid content in plant cells (ppa002006m, ppb019158 and ppa024762), and appear to be within the interval flanked by the SSRs CPDCT025, UDP96-003 and EPDCT3392, which are associated in our study with tocopherols, fatty acids, oil and protein content. Since this is the first association mapping in almond, we suggest further studies including QTL validation and fine mapping to identify the causative polymorphisms involved in fatty acid variation.
On the other hand, a total of 12 associations related to physical nut and kernel traits were identified in this study. One of the associations was between the marker CPSCT006 and nut length and T/L ratio, and kernel length, which is corroborated by the QTLs identified on LG 5 (CPSCT006) by Fernández i Martí et al. [30]. The other association was between the marker BPPCT011 (LG 1) and the nut T/L ratio and kernel length. This association maps close to the QTL (CPPCT042) for the same trait found on LG1 by Fernández i Martí et al. [30]. Locus CPPCT021 (LG 6), associated with nut length and thickness and kernel length in this study, was also found in Fernández i Martí et al. [30]. This association also agrees with the QTL found in the same position by Fernández i Martí et al. [30]. Other important associations observed in this study were between the marker EPDCT0532 (LG 3) and nut width, thickness and T/L and kernel size, and this is in agreement Compared to peach and apple, almond has a poorly developed genomic infrastructure. In order to accelerate association studies in this crop, a large-scale SNP discovery may be required. Recent advances in genomic tools, including genome sequencing and high-density SNP genotyping [50], have enabled the development of new approaches for mapping complex traits. The identification of causal genes underlying specific traits is a major goal in plant breeding, subsequently offering opportunities to develop genomic selection tools. Next generation sequencing (NGS) is playing an increasingly significant role in facilitating SNP discovery in less-characterized fruit tree species. Moreover, due to the abundance of SNPs within a genome, and the availability of high-throughput sequencing methods, SNPs are increasingly becoming one of the commonly used markers for genotyping horticultural species.
Apple [51] and peach [52] genomes, two of the most important fruit tree species, have been sequenced in the last few years, thus facilitating the rapid design of high-throughput SNP genotyping assays such as the 1536 SNP GoldenGate for apple [24] and the 9000 SNP Infinium chip for apple and peach [53,54] and the 6000 SNP Infinium chip for cherry [55]. These chips are supposed to provide excellent opportunities for candidate gene-based association mapping. However, in almond, the application of NGS technologies and bioinformatic tools to generate high frequency single nucleotide polymorphisms still remains unexplored. Hence, the use of high-density genetic linkage maps based on SNP markers in genome mapping and phenotypic selection is still very limited in almond germplasm and breeding populations. Thus expressed sequence tags (EST) and microsatellite (SSR) are the more common sources of molecular markers used for association mapping [24].
It is worthy to note that an international consortium has recently been created in 2014 to sequence the whole genome of the almond cultivar 'Texas' for intended release at the end of 2015 (Almond International Consortium). This advancement in genome sequencing will offer important new possibilities for SNP discovery and genome wide association studies (GWAS) in this nut crop. The cost of growing and maintaining tree crops until they reach maturity is very high, thus any effort to carry out early selection will be highly desirable to reduce orchard costs at the seedling stage.
In the few studies on association mapping in other Prunus species, such as peach [19,22] and sweet cherry [23], the authors used mostly SSR markers. Marker numbers ranged from 15 in Ganopoulos et al. [23] to 40 in Font i Forcada et al. [22] and then 53 in Cao et al. [19]. Based on our results we believe that the 40 SSR markers we used may appear adequate for association mapping in almond. Of a total of 28 (16 for chemical and 12 for physical traits) significant associations, more than 10 have been confirmed previously through linkage analysis and QTL mapping. These associations should be followed up using MAS strategies to improve efficiency of selection in breeding programs. The remaining marker-locus-trait associations identified using new markers should be validated in each specific breeding program before deployment for MAS. This study reports for the first time an association study in almond for nut and kernel traits. We suggest that association mapping is a very powerful and complementary approach to conventional mapping approaches for identifying marker-locus-trait associations.

Conclusions
Our results provide new details about population structure and linkage disequilibrium in a self-pollinating crop like almond. The low level of LD observed in this study is consistent with the broad nature of our germplasm and the breeding history of almond. These results provide an insight into the genetic architecture of important fruit quality traits for almond (tocopherol content, fatty acids, oil and protein content). A total of 28 significant associations between SSR markers and chemical and physical nut and kernel traits were detected in this study. Among them, six chemical and eight physical associations were the same as in previous studies based on conventional QTL mapping. These results will allow for improved efficiency in breeding for quality in almond, and in particular, the simultaneous selection for physical and chemical traits. This study also identified many new marker-locus trait associations, which may be useful for predictions of genetic progress in a breeding program. These associations would provide an efficient platform for classical and MAS breeding in almond.

Plant material and phenotypic evaluation
A germplasm collection of 98 almond (Prunus amygdalus Batsch) cultivars encompassing a wide range of geographic origins was used (S2 Table). This set included 56 native local Spanish cultivars and 42 cultivars mostly from the USA, France, Greece, Italy, and Portugal but also from Algeria, Argentina, Australia, Bulgaria, Tunisia and Ukraine. All these genotypes are located and maintained at the public Almond Germplasm Bank of the Centro de Investigación y Tecnología Agroalimentaria de Aragón (CITA) located at Zaragoza (Spain). The trees were grown under continental Mediterranean conditions within the Ebro Valley, NE Spain [13].
Phenotypic evaluation of physical nut and kernel traits were performed over a period of three years based on approximately 50 mature fruits randomly collected from each tree, as described in Font i Forcada et al. [7]. The fruit was considered mature when the mesocarp was fully dry and split along the fruit suture while the peduncle was near to complete abscission [56]. After discarding the mesocarp, the nuts were left at room temperature for 2-3 weeks. Following nut measurements, shells were cracked to expose the kernel, the width (W), thickness (T), length (L), and weight of which were measured with a digital calliper with a precision of 0.01 mm. These variables allowed the estimation of the T/L ratio and the size dimensions (L x W x T) for the kernels.
For chemical traits, the oil was extracted from 3 g of ground almond kernels using a Soxtec Avanti 2055 fat extractor (Foss Tecator, Höganäs, Sweden). Later, butylated hydroxytoluene methanolic solution was added as an antioxidant agent. The oil extraction was duplicated using 30 fruits from each genotype. These analyses were carried out during three years. The oil sample was used first to prepare the methyl esters of the corresponding fatty acids (FAMEs; oleic, linoleic, palmitic, stearic and palmitoleic). The FAMEs were separated using a gas chromatograph HP 6890 and detected using a flame ionization detector, equipped with a capillary column (HP-Innowax 30 m x 0.25 mm i.d.) and 0.25 μm film thickness (Agilent Technologies, Waldbronn, Germany). Tocopherol content (α, δ and γ) was determined by high performance liquid chromatography (HPLC) according to a modification of the method described by López-Ortiz et al. [57]. Finally, the protein content was determined from the total N content obtained by the Dumas method [58] and applying a conversion factor as shown: % Protein = Kc Ã % total Nitrogen (Kc = 6.25). A sample of 0.2 g of almond flour was weighed and introduced into the analyzer LECO FP-528 Protein/Nitrogen Analyzer (LECO Corporation, Saint Joseph, MI, USA).

Genotyping
Genomic DNA was isolated using the PowerPlant DNA isolation Kit (MO BIO Laboratories, CA, USA). The DNA was quantified and diluted to 10 ng uL -1 for PCR amplifications. For association mapping and genetic diversity analysis, 40 sets of fluorescently labeled SSR primers (PET, NED, VIC and 6-FAM) were subsequently used to genotype the 98 accessions as they showed high polymorphism at a single locus. These primers previously developed in peach, plum, sweet cherry and almond (S1 Table) spans the eight Prunus linkage groups. Polymerase chain reactions were performed in a 20 uL volume containing 1× PCR buffer, 1.5 mM MgCl 2 , 0.2 mM dNTPs, 0.2 μM of each primer, one unit of Taq DNA Polymerase (Invitrogen, Madrid, Spain) and 20 ng of genomic DNA. The thermocycling programme included denaturation for 1 min at 94°C, followed by 35 cycles of 15 s at 94°C, 15 s for the annealing temperatures indicated in S1 Table for the different primers used, and 1 min at 72°C, and a final extension of 2 min at 72°C. The PCR reactions were carried out in a 96-well block thermal cycler (Applied Biosystems, Madrid, Spain). Each reaction was repeated and analyzed twice to ensure reproducibility and the size standard used in the sequencer was Gene Scan 500 Liz (Applied Biosystems). Sequencing was done using an AB3130xl DNA analyzer and fragment analysis was done using the GeneMapper software v4.1 (Applied Biosystems). S1 Table shows the details of the SSR primer pairs.

Genetic diversity analysis
The data obtained with the 40 SSRs allowed the calculation of several genetic parameters including the number of alleles per locus (A), the number of effective alleles [A e = 1/(1-H e )], the observed heterozygosity (H o = number of heterozygous individuals/number of individuals scored), the expected heterozygosity (H e = 1-∑ρ i 2 , where ρ i is the frequency of the i th allele), and the Wright's fixation index (F = 1-H o /H e ) for comparing both heterozygosities. All parameters were estimated using the PopGene 1.31 software. Phylogenies were estimated using the Neighbor-joining method [59]. Neighbor joining analyses were conducted with the PAUP Ã v.4.0b10 phylogenetics package [60] to generate an un-rooted phylogenetic tree (Fig 1).
To determine population structure and/or identify the number of subgroups in the almond germplasm, different methodologies and software packages were employed and compared. For quantitative assessment of the number of groups in the panel, a Bayesian clustering analysis was performed using a model-based approach implemented in the software package STRUC-TURE v 2.3 [27]. It was run under the assumption of admixture with independent allele frequencies. Analyses were performed for the number of subgroups ΔK (where K specifies the number of subpopulations or clusters), based on the rate of change in the log probability of the data [26] ranging from one to ten. We performed twenty independent runs per K value starting with 10,000 burn-in period and 100,000 MCMC replications. A burn-in of 20,000 and 250,000 Markov Chain Monte Carlo replications seemed to be the best fit for the data at K = 3. A model-based clustering algorithm was applied to identify subgroups with distinctive allele frequencies. In the second approach, principal component analysis (PCA) based on the dissimilarity matrix was performed using TASSEL.
Low frequency alleles (considering MAF 0.05) were removed before estimation of linkage disequilibrium (LD) between pairs of multi-allelic loci for the same or different linkage group (LG), using the r 2 coefficient (TASSEL 3.0; [46]). The statistical r 2 gives an indication of both recombination and mutation [15]. The significance level of LD between loci was examined using a permutation test implemented in TASSEL for multiallelic loci, using the "rapid permutation" option.

Association analysis
The hypothesis for associating each molecular marker with the trait of interest was tested using the software TASSEL. Different statistical models were used to calculate P-values, along with accounting for population structure to avoid spurious associations. Five models comprising both general linear models (GLM) and mixed linear models (MLM) were selected to test for marker-trait-associations. Results were compared to determine the best model. The first ten significant PCs explained 64.26% of the cumulative variance of all markers. The following models were tested: i) Naïve model: GLM without any correction for population structure; ii) Q-model: GLM with Q-matrix as correction for population structure; iii) P-model: GLM with PCs as correction for population structure; iv) QK-model: MLM with Q-matrix and K-matrix as correction for population structure and kinship relationships; v) K-model: MLM with Kmatrix as correction for kinship relationships structure [27,28]. The mean value of the markers at P < 0.005 was used for determining the significance of marker-trait associations. We focused the association mapping on all LGs on the Prunus reference map. Significant markers were identified using the Bonferroni procedure at the p<0.00125 experimental-wide threshold. Alleles with a frequency (MAF) lower than 5% were removed [61] before data analysis. The critical P-values for assessing the significance of marker-trait-associations were calculated based on a false discovery rate (FDR), which was found to be highly stringent. Considering the stringency of the model used for accounting for population structure and cryptic variation due to kinship relationships, most of the false positives were inherently controlled. Among the five models, the best model was selected based on the smallest mean square difference (MSD) between the observed and expected P-values, since the random marker P-values follow a uniform distribution [62]. To detect significant markers, the phenotypic variation (R 2 ) was calculated using a simple regression equation implemented in GLM procedure in TASSEL. Association study was determined using 40 SSRs [63][64][65][66][67][68][69][70][71][72].
Supporting Information S1

Author Contributions
Conceived and designed the experiments: CFF AFM. Performed the experiments: CFF ME AFM. Analyzed the data: CFF NO SRCW AFM. Contributed reagents/materials/analysis tools: CFF ME RSC AFM. Wrote the paper: CFF NO RSC AFM.