Construction of Core Collections Suitable for Association Mapping to Optimize Use of Mediterranean Olive (Olea europaea L.) Genetic Resources

Phenotypic characterisation of germplasm collections is a decisive step towards association mapping analyses, but it is particularly expensive and tedious for woody perennial plant species. Characterisation could be more efficient if focused on a reasonably sized subset of accessions, or so-called core collection (CC), reflecting the geographic origin and variability of the germplasm. The questions that arise concern the sample size to use and genetic parameters that should be optimized in a core collection to make it suitable for association mapping. Here we investigated these questions in olive (Olea europaea L.), a perennial fruit species. By testing different sampling methods and sizes in a worldwide olive germplasm bank (OWGB Marrakech, Morocco) containing 502 unique genotypes characterized by nuclear and plastid loci, a two-step sampling method was proposed. The Shannon-Weaver diversity index was found to be the best criterion to be maximized in the first step using the Core Hunter program. A primary core collection of 50 entries (CC50) was defined that captured more than 80% of the diversity. This latter was subsequently used as a kernel with the Mstrat program to capture the remaining diversity. 200 core collections of 94 entries (CC94) were thus built for flexibility in the choice of varieties to be studied. Most entries of both core collections (CC50 and CC94) were revealed to be unrelated due to the low kinship coefficient, whereas a genetic structure spanning the eastern and western/central Mediterranean regions was noted. Linkage disequilibrium was observed in CC94 which was mainly explained by a genetic structure effect as noted for OWGB Marrakech. Since they reflect the geographic origin and diversity of olive germplasm and are of reasonable size, both core collections will be of major interest to develop long-term association studies and thus enhance genomic selection in olive species.


Introduction
Recent advances in genomic tools, including genome sequencing [1] and high-density single nucleotide polymorphism (SNP) genotyping [2], and statistical methods have enabled the development of new approaches for mapping of complex traits. The identification of causal genes underlying specific traits is a major goal in plant breeding, subsequently offering opportunities to develop genomic selection tools [3][4]. Association mapping (also known as linkage disequilibrium (LD)-based association mapping) [5] has been proposed to associate single DNA sequence changes with traits of interest using collections of unrelated individuals, as an alternative or complement to quantitative trait locus (QTL)-mapping (also known as family-based linkage mapping) [6]. Association mapping has been largely documented and successfully used to identify the genetic basis of many complex diseases in humans [7], and is now emerging in plants [8][9]. It has the advantage of being rapid and cost effective as many alleles may be assessed simultaneously, resulting in higher resolution mapping by the use of most recombination events that occur over time, while avoiding the need to expensively and tediously develop crossing populations, particularly for perennial and forest tree species [10]. The number of markers needed to map specific associations depends on the extent and distribution of LD within the species and among linkage groups [5]. Many studies have thus proposed an estimate of LD in different plant species as a preliminary step for association analysis [11][12][13][14]. Association mapping results obtained in a number of annual species, e.g. Arabidopsis thaliana [15][16], Oryza sativa [17][18], Triticum aestivum [19] and Zea mays [20][21], indicate that the approach is promising to identify markers correlated with desirable traits such as flowering time [15][16]20], seed morphology [19,22] and disease resistance [15,[23][24]. However, for woody and perennial species, studies have been performed on a limited number of species, such as Pinus taeda L. [25], Eucalyptus spp. [26] and Prunus persica [27].
Beyond the importance of ex situ conservation of genetic resources to avoid genetic erosion and provide plant breeders with easy access to study ranges of variation in phenotypic traits, germplasm collections could serve as a reservoir of outstanding genes to enhance agronomic traits so as to meet the needs of diverse agricultural systems. However, field evaluation and use of large germplasm collections for association mapping purposes are mostly constrained by problems of accession redundancy, economic cost and time, especially for clonally propagated perennial species where clones have to be maintained and evaluated for several years at different sites. Genetic resource assessments could thus be more rational if focused on a subset of accessions, or so-called core collection (CC; also known as core subset), which includes in the sample as much variability present in the whole collection as possible with minimal size [28]. Determining the best sample size to use and genetic criteria to be optimized for association mapping in one core collection is an open issue requiring further investigation, especially for perennial species. Over the last decade, several core subsets have been proposed for both annual species, e.g. Arabidopsis thaliania [29], Oryza sativa [30], Triticum aestivum [31] and Zea mays [32], and perennial species, e.g. Annona cherimola [33], Malus domestica [34], Prunus armeniaca [35] and Vitis vinifera [36], using different ecogeographical, agro-morphological, biochemical or molecular data. Despite the many approaches used to design core collections that optimize the genetic distance between accessions and/or the allelic diversity [37][38][39][40][41][42][43][44], most of core collections have been constructed based on the so-called maximizing method (M-method) [37] through the MSTRAT program [40] by optimizing the number of alleles/trait classes for germplasm conservation purposes, whereas core sizes depend on the number of accessions and the diversity available in the base collections. Sample sizes of 5-20% of the whole collection, encompassing at least 70% of observed alleles, were considered optimal in many studies [45][46].
Olive, which is one of the most important fruit crops in the Mediterranean area [47], is cultivated in more than 24 countries, whereas more than 1200 olive varieties have been reported [48][49] and conserved in many germplasm collections around the world [50], including two worldwide olive germplasm banks (OWGB) in Cordoba (Spain) [51] and Marrakech (Morocco) [52]. The available diversity has been evaluated using morphological descriptors and diverse molecular markers (AFLP, SSR, SNPs, DArt) [53][54][55][56][57][58]. However, only a few cross-breeding programs make use of olive germplasm for QTL mapping [59] as many constraints currently hinder the development of bi-parental populations, i.e. a long juvenile period [60], low fruit set [61], low seed germination [62] and lack of knowledge about trait heritability [63][64][65]. LD-based association mapping is thus considered to be a suitable approach to determine the genetic basis of traits in olive varieties according to the available diversity. Moreover, the development of a core collection is thus essential to effectively optimize the use of such diversity. Two core collections encompassing total allelic diversity of OWGB Cordoba have currently been reported [51,66]. However, only a single core collection was proposed in each study, which hinders effective and flexible use of the broad range of olive diversity, and western Mediterranean accessions, particularly those originating from Spain (more than 40% of entries in the CC), are over-represented in both core collections. In addition, despite using two different sampling algorithms via MSTRAT [40] and CORE HUNTER [43] programs, these core collections were developed based only on capturing total alleles (or allelic coverage; Cv) as main criterion, which is questionable for sampling as it excludes selection of highly genetically distant entries, whereas both core collections were not investigated regarding the genetic structure and relatedness between selected entries for association mapping.
Here a two-step method using nuclear microsatellite loci, cpDNA haplotypes and agro-morphological traits is proposed, combining the assets of MSTRAT and CORE HUNTER programs, with the aim of building flexible olive core collections from OWGB Marrakech suitable for association studies. We specifically aimed to (1) compare various sampling methods and sizes to select the best ones based on diverse criteria, and (2) propose many core collections with optimal sizes for field evaluation and which reflect the geographic and diversity of olive. The convenience of the developed core collections for association mapping is examined with regard to genetic structure, relatedness and linkage disequilibrium.

Dataset
A total of 561 accessions from 14 countries, maintained in the ex situ OWGB Marrakech collection, were used in this study (Table  S1). A set of 17 SSR loci was used for accession genotyping (Text S1). Plastid DNA (or cpDNA) was characterized using 37 polymorphic loci and two cleaved amplified polymorphism sites (CAPS-XapI and CAPS-EcorRI), as described by Besnard et al. [67] (Text S1).
The phenotypic data was from olive databases and national catalogues based on passport data and variety name as identification key [68][69][70][71][72]. Data on 72 agro-morphological traits classified into 213 trait classes according to standards described by the International Olive Oil Council (IOOC) was compiled for 425 varieties (Table S2).

Construction of Core Subsets
To compare the performance of current state-of-the-art methods to construct core subsets, as a benchmark, we estimated the minimum size necessary to capture all the observed alleles using the MSTRAT program ( Figure 1). The size assessment indicated that 80 entries were necessary to capture the total allelic diversity (16% of OWGB Marrakech). Then, at this sample size, four different sampling methods were first tested: 1. The maximizing method (M method) implemented in the MSTRAT program. By using an iterative maximization procedure, MSTRAT examines all possible core subsets and singles out those that maximize the number of alleles (and/or trait classes) in dataset for one sample size. The program allows to specify a compulsory set of accessions, called a ''kernel'', that will always be included in the core subset. In this case, maximization was focused on complementing alleles not included in the kernel. The Shannon-Weaver diversity index [73] was used as a second criterion to classify core subsets capturing the same number of alleles. 2. The advanced stochastic local search method (ASLS method) implemented in the CORE HUNTER program. The program is able to select core subsets using diverse allocation strategies by optimizing one genetic parameter or many parameters simultaneously, whereby the best solution among all replicas is reported. For instance, optimizing only the genetic distance, i.e. ''D CE strategy'', the proposed core subset typically consists of genetically distant accessions, whereas the ''Cv strategy'' emphasizes the selection of genotypes with the most diverse alleles. Three allocation strategies were used: (i) optimizing each of the following measures independently (average Cavalli-Sforza and Edwards genetic distance ''D CE strategy'' [74], allelic coverage or number of alleles ''Cv strategy'', Shannon-Weaver diversity index ''Sh strategy'', or Nei diversity index ''He strategy'' [75]); (ii) optimizing all measures simultaneously with equal weight assigned to each one ''multi-strategy''; and (iii) optimizing both D CE and Cv simultaneously (''D CE Cv strategy'').
A previous analysis revealed that when a weight of 60% was assigned to D CE and 40% to Cv, all observed alleles were captured in the sampled subset ( Figure S1). 3. The maximum length sub-tree method (MLST method) implemented in the DARWIN v.5.0.137 program [41]. Starting from a diversity tree, the procedure is performed step by step. At each step, the unit for each pair with the minimal length of the external edge in the tree is removed. The procedure searches for the most unstructured tree, i.e. a star-like tree, by successive pruning of redundant units. The genetic distance between genotypes was calculated using the sample matching coefficient [76] and the tree was drawn based on the Neighborjoining method [77]. 4. The random method (R-method) using the POWERMARKER v.2.25 program [78]. Samples were selected arbitrarily without replacement of genotypes.
Moreover, four other sizes were tested by the optimal methods selected at 16% sample size, i.e. 4% (20 entries), 8% (40), 24% (120) and 32% (160). To simplify the notation, we assigned a code to each sampled subset, as shown in Table 1 and in Table S3. For instance, CC1-80 is the subset sampled at 16% sample size (80 entries) using the ''Cv strategy'' with the ASLS method. Twenty replicates and 100 iterations were generated independently for each sample size and method without prior knowledge of the origin of the respective varieties. Once the optimal sampling method and size were selected, two procedures were performed in the second sampling step: (i) sampling with both nuclear markers and agro-morphological traits and (ii) using only nuclear markers ( Figure 1). These procedures were compared in order to test the effect of using phenotypic traits when sampling entries. In addition, 14 reference varieties were considered significant when constructing the core subsets. These varieties were considered to be the most prominent and most cultivated in the olive-growing

Comparison of Sampling Methods and Sample Sizes
To test the ability of each sampling method and size in capturing the diversity and representativeness in the sampled Figure 1. Current study flow chart to construct core collections from OWGB Marrakech. There were two main steps. As a benchmark, a sample size was determined using the MSTRAT program to compare different sampling methods and sizes; 80 entries were necessary to capture all alleles. A primary core collection (CC 50 ) was constructed using the CORE HUNTER program at 8% sample size (step 1). Then CC 50 was used as a kernel to select the minimum size required to capture the total diversity using the MSTRAT program (step 2). At this step, two procedures were performed, i.e. sampling with nuclear markers and trait classes (A; 94 entries were necessary) or using only nuclear markers (B; 92). For both procedures, a set of 72 genotypes was used in all independent runs while a combination of 22 complement genotypes could be selected from a panel of 106 genotypes to capture all of the allelic and phenotypic diversity (CC 94 ) or 20 genotypes from a panel of 91 genotypes to capture the total allelic diversity (CC 92 ). doi:10.1371/journal.pone.0061265.g001 subset as compared to OWGB Marrakech, different criteria were considered: (i) the recovery of maximum alleles, trait classes and cpDNA haplotypes observed in the whole collection; (ii) a high and significant Shannon-Weaver diversity index estimated by the t-test (p#0.05); (iii) no significant differences in the Nei diversity index and in allelic richness computed by the Mann-Whitney test (p#0.05) with the PAST program [79]; and (iv) the presence of the 14 reference varieties defined above.

Assessment of Core Collections for Association Mapping Purposes
As the sub-structure within subsets and the relatedness between genotypes (known also as the kinship coefficient) are the major components to take into consideration in association mapping analyses [80][81][82], an assessment of both factors in proposed core collections was performed. Two approaches were used to assess the genetic structure; (i) principal coordinate analysis (PCoA) implemented in the DARWIN v.5.0.137 program using a simple matching coefficient to describe the spatial distribution of genotypes; and (ii) model-based Bayesian clustering implemented in STRUCTURE v.2.2 [83] according to the parameters described in Haouane et al. [52]. The reliability of the number of K clusters was checked using the ad-hoc DK measure [84] with the R program whereas the similarity index between 10 replicates for the same K clusters (H9) was calculated via CLUMPP [85].
The relative kinship coefficient between genotypes was computed via SPAGEDI [86] through the coefficient of Loiselle et al. [87]. Negative values between two individuals, indicating that there was less relationship than that expected between two random individuals were replaced by 0, as proposed by Yu et al. [80]. The TASSEL 2.0 program [88] was used to estimate the LD (r 2 coefficient) among 17 nuclear loci after deletion of low frequency alleles (less than 0.05). A p-value for each LD score was computed through 1000 permutations to determine the significance. For the whole collection, only genotypes distinguished by more than three dissimilar alleles were considered when computing the kinship coefficient and LD in order to avoid considering variants of the same genotype.

Characterization of Worldwide Olive Germplasm Bank of Marrakech
Using 17 nuclear SSR loci, all 561 accessions of OWGB Marrakech were classified into 502 distinct SSR profiles (Table S1) whereas 457 genotypes were distinguished by more than 3 dissimilar alleles. A total of 279 alleles were revealed with a mean of 16.4 alleles per locus (Text S2). The set of plastid markers revealed the presence of 12 haplotypes in OWGB Marrakech, with one highly frequent one (E1.1, 83.2%; Text S2).

Comparison of Sampling Methods
This comparison was carried out using the 502 SSR profiles with a 16% sample size determined previously by MSTRAT. All core sets sampled by different methods outperformed CC9-80 (core chosen randomly) in which the D CE , He, and Sh values were quite similar to those of OWGB Marrakech whereas the allelic richness values were significantly different from those of the whole collection (p,0.05; Table 1; Figure 2). When optimizing each of the four genetic parameters independently with the ASLS method, the sampled core subsets had the highest scores of all the core subsets with respect to the parameter being optimized, whereas other parameters not considered during optimization were highly affected (Table 1). For instance, with the ''D CE strategy'', the selected core subset showed the highest D CE (CC2-80; 0.83360.07), while a low number of alleles was captured compared to the ''Cv strategy'' (only 234 among 279 alleles). For the MLST method, the CC8-80 core subset revealed higher D CE and similar Sh values as compared to CC6-60 and CC7-80, whereas fewer captured alleles were noted (only 236 alleles). Finally, four sampling strategies using the ASLS method showed better D CE and Sh scores than all other core subsets, including CC7-80, generated by the maximizing method (Table 1; Figure 2).
All methods allowed capture of at least 93.4% of the trait classes (CC9-80; Table 1) and all cpDNA haplotypes observed in OWGB Marrakech were captured in CC1-80 and CC7-80, whereas only 11 haplotypes (except E2-3 observed once for the ''Lechin de Table 1. Genetic parameters of core subsets selected by different sampling methods at 16% sample size: advanced stochastic local search (ASLS), maximizing (M), maximum length sub-tree (MLST) and random (R). Four sampling strategies using the ASLS method were found to be the most suitable for comparing different sampling sizes (in bold). Cv: allelic coverage or number of alleles, D CE : average genetic distance of Cavalli-Sforza and Edwards, SD: standard deviation, He: Nei diversity index, Sh: Shannon-Weaver diversity index. 1 Each parameter was optimized independently by performing 20 runs with 100% weight given to the respective parameters (''Cv strategy'', ''D CE '', ''Sh'', and ''He''). 2 Twenty independent runs were performed with equal weight given to each of the four parameters simultaneously (''multi strategy''). 3 Subset sampled when a weight of 60% was assigned to D CE and 40% to Cv (''D CE Cv strategy''). *Statistically significant difference (p,0.05) using the Mann-Whitney test between core subsets and OWGB Marrakech. doi:10.1371/journal.pone.0061265.t001 Sevilla'' variety from Spain) were captured when optimizing genetic parameters other than Cv using the ASLS method (Table 1).
According to the results, four allocation sampling strategies using the ASLS method were selected, i.e. ''D CE '', ''He'', ''Sh'', and ''multi-strategy'' ( Figure 2). Core subsets generated using the four strategies highlighted a trade-off in the genetic parameters considered in the study, including genetic distance (Table 1). These strategies were tested with different sample sizes (4, 8, 24, and 32%).

Comparison of Sampling Size
As shown in Figure 3, the sample size was inversely correlated with D CE and Sh, except for the 4% sample size, because of allelic redundancy within the core subset when the core size is increased. Increasing the sample size did not improve the capture of total alleles and trait classes, except for the ''multi-strategy'' where all alleles had been captured at 24% sample size (Table S3).
It would be unfeasible to design a core collection to fulfil all genetic measures at once because of the trade-off between genetic parameters. We thus propose a two-step method whereby one representative core subset of reasonable size is first selected, with a trade-off between D CE , Sh, He, and Cv genetic measures, and secondly a core subset is compiled with genotypes carrying missing alleles and trait classes. Hence, the CC2-40 core subset constructed using the ''Sh strategy'' with the CORE HUNTER program at 8% sample size was chosen as a starting point for the following steps since it nearly fulfilled all the required genetic parameters while being of suitable size (Table S3). However, eight among the 14 reference varieties defined above and two among the 12 haplotypes of OWGB Marrakech (E2.2 observed for ''Trillo'', ''Crastu'', and ''Gremigno di Fuglia'' varieties from Italy, and E2.3 for ''Lechin de Sevilla'' from Spain) were not captured in the CC2-40 core subset. When we examined alleles not captured in CC2-40 (54 among 279 alleles), it was found that 26 among the 54 alleles occurred once. Otherwise, all entries were conserved in successive constructed core subsets sampled by the ''Sh strategy'' while increasing the sample size, indicating the consistency of the sampling strategy and the robustness of the genetic parameter for selecting entries.

Development of Final Core Collections
A primary core collection of 50 entries (CC 50 ) was defined ( Figure 1, step 1). This core collection includes the 40 entries of the CC2-40, ''Lechin de sevilla'' and ''Trillo'' varieties which each carry the two missing cpDNA haplotypes, and 8 missing reference varieties among the 14 defined above (Table S4; Figure S2, level 1). The 50 entries enabled capture of 229 alleles, 12 haplotypes, and 207 trait classes (Table 2) and reflected the geographic distribution of olive since varieties from 11 countries among 14 were represented (Table 3).
Using the primary core collection (CC 50 ) as a kernel (Figure 1, step 2), we estimated the minimum number of entries needed to capture all alleles and trait classes using the MSTRAT program. The  (Table S4). For each core collection of 94 entries (CC 94 ), 72 genotypes were found to be common in all of the 200 independent runs, i.e. the 50 genotypes used as a kernel and 22 genotypes carrying alleles observed once, while a combination of 22 complementary genotypes were selected among a panel of 106 genotypes shared between 200 runs ( Figure 1; Figure S2, level 2). Arbitrarily selecting one core collection (CC1 in Table S4) revealed that all countries were represented, except for Slovenia which has 9 accessions in OWGB Marrakech (Table 3). Genotypes from this country were found in 73 of the 200 core collections (Table S4).
The effect of using phenotypic traits when sampling genotypes was tested by constructing core collections based only on nuclear data and CC 50 as a kernel (Figure 1, step 2-B). The redundancy function of MSTRAT program thus revealed that 92 entries (CC 92 ) were necessary to capture all 279 alleles. As for CC 94 , 72 genotypes were common between all 200 constructed core collections of 92 entries (result not shown), whereas a panel of 91 genotypes could be used to select a combination of 20 complement genotypes to capture the total allelic diversity. One core collection of 92 entries among 200 was arbitrary chosen and compared to the above described CC 94 . The results indicated that 99% of the trait classes (211 among 213) were captured in this core collection and similar values were obtained regarding D CE , Sh and He for both core collections ( Table 2). In addition, 85 genotypes  were shared between CC 92 and CC 94 . Hence, phenotypic data may have a limited effect since similar results were obtained regardless of the sampling method used, i.e. using trait classes or not.

Genetic Structure and Representativeness of the Core Collections
Using model-based Bayesian clustering, the STRUCTURE program allowed classification of the 502 genotypes into three gene pools according to their regional origins (western, central, and eastern Mediterranean; Figure 4; Table S1), while the second most likely genetic structure was found at K = 5 (DK = 155.12 and H9 = 0.992; Figure S3). Similar results were obtained when the analysis was conducted on genotypes distinguished by more than three dissimilar alleles (457 genotypes; results not shown). In both core collections (CC 50 and CC 94 ), the selected genotypes revealed a high level of admixture between gene pools. In fact, among the 50 and 94 genotypes, 23 (46%) and 71 (75.5%) were assigned to more than one gene pool with membership probabilities of less than 0.80, respectively. In addition, principal coordinate analysis (PCoA; Figure 5) revealed that both core collections encompassed the entire range of genotypes in the three gene pools, whereas 32 (64%) and 65 (69.1%) entries were classified into the central Mediterranean gene pool for the CC 50 and CC 94 core collections, respectively. Low DK and H9 scores at K = 3 were noted for both core collections compared to OWGB Marrakech, therefore highlighting the absence of stability in obtaining runs at K = 3. Although high DK and H9 scores at K = 5 were obtained for both core collections ( Figure S3), no consistency in genetic structure was noted when plotting the Q scores ( Figure S4), while the model at K = 3 indicated two subgroups for both CC 50 and CC 94 ; the first one contained entries originating from the western and central Mediterranean whereas the second included eastern Mediterranean varieties (Figure 4).
When considering only 457 genotypes distinguished by more than three dissimilar alleles, the LD scores (r 2 ) were significant for 59.5% of the pairwise comparisons (81 among 136 pairwise comparisons), while only 26.5% of the pairwise comparisons displayed a significant LD in CC 94 ( Figure 6). The relative kinship computed for both core collections showed a high pairwise frequency at 0-0.05 (87.6% for CC 50 and 84.9% for CC 94 ), whereas it decreased progressively between 0.05 and 0.45 (7.8% and 10.4% to 0.08% and 0.04% for CC 50 and CC 94 , respectively; Figure 7).

Discussion
The aim of the study was to construct flexible core collections for cultivated olive, of a manageable working size for conducting association mapping studies, by sampling the minimum number of entries that maximize the representativeness of allelic and phenotypic diversity. Such working core collections facilitate experimental trials to assess germplasm under contrasting environmental conditions. We analyzed our results with regards to: (1) the representativeness of the Marrakech OWGB, (2) tools and criteria used for defining the core collections, and (3) the efficiency of the developed core collections for genetic association mapping.

OWGB Marrakech is Representative of Mediterranean Olive Diversity
Despite the presence of similar proportions of alleles with frequencies ,1% and those observed only once in both OWGB collections (Text S2; 53.4% and 19.5% in OWGB Cordoba, respectively) [66], a higher allelic richness was noted in OWGB Marrakech than in OWGB Cordoba (16.41 and 11.38 alleles/ locus [51], respectively). OWGB Marrakech was found to be more diversified than OWGB Cordoba as shown by the presence of more accessions from different countries, particularly those from the eastern Mediterranean [52]. OWGB Marrakech has more Egyptian (19 genotypes), Syrian (47), and Lebanese (9) genotypes than OWGB Cordoba, while more than 55% of all accessions in OWGB Cordoba are from Spain [51,66]. The entire diversity observed in OWGB Marrakech is explained mainly by the scientific context when setting up the collection. The germplasm bank was set up with previously characterized genetic resources, including agro-morphological descriptors and/or molecular markers from each Mediterranean country, in order to optimize the available olive germplasm [52]. The olive germplasm available in OWGB Marrakech better reflects the genetic structure of cultivated olive in the Mediterranean basin, since three gene pools were distinguished, i.e. western, eastern and central Mediterranean, as also reported by Sarri et al. [57] and Baldoni et al. [58] using different sets of SSR markers, while only two were revealed in OWGB Cordoba by Belaj et al. [51], i.e. western and eastern/central Mediterranean. Therefore, we consider that OWGB Marrakech is particularly suitable for association mapping studies and also for establishing representative core collections since it encompasses a high range of olive germplasm from the Mediterranean Basin, including the eastern gene pool. Nevertheless, a simultaneous analysis of both germplasm banks, as one single dataset, with the same set of molecular markers to construct a real core collection representing Mediterranean olive germplasm will certainly provide complementary information and thus be an asset for olive genetic research.

Effectiveness of Processed Data in Constructing Core Collections
Accessions with similar phenotypes may not necessarily have a close genetic relationship [38] because of the polygenic properties of most traits and the effect of the environment on the expression of the trait being analyzed. Hence, applying molecular marker information reflecting the DNA polymorphism pattern is a powerful tool in core collection development. The cost, time, and effort required for phenotypic characterization, especially in a woody perennial species collection, are much greater than required for an assessment using molecular tools. As most of current 17 loci are well-scattered throughout linkage groups [89][90], we assume that the applied set of SSRs may be effective to obtain an overview of olive diversity as observed in other studies [29,36]. Further studies using other sets of molecular markers (e.g. SNP) could confirm our assumption. Furthermore, despite the fact that maternal lineage polymorphism of is lower within olive varieties than noted in olive oleasters [67], therefore chloroplast sequence information is substantial when establishing core collections. This information optimises sampling to clarify the evolutionary history of olive varieties and therefore their involvement in agronomic traits of interest alone or in association with nuclear genes.
Otherwise, the compiled phenotypic data was used with caution in the present study since not all varieties were completely characterized with the 72 agro-morphological traits and phenotypic data was gathered from different olive databases according to the variety names [68][69][70][71][72]. As we could not exclude the presence of distinct genotypes with the same name due to mislabeling and synonymy cases [55], such data could be useful to conduct a first screening on phenotypic variability of olive varieties in OWGB Marrakech. Their use could provide additional and qualitative information to choose entries covering the range of variability of phenotypic traits. Whatever their level of representativeness of phenotypic variability in Mediterranean olive, these traits may have a limited effect on the sampling entries since we obtained similar results using phenotypic trait classes or not. Further field assessments are clearly required to obtain more reliable and comprehensive data on the phenotypic diversity of selected entries.

Core Collections are Highly Representative of the Overall Olive Genetic Variability
The broad diversity in the Marrakech OWGB could be represented in two core collections of 50 (10%) and 94 (18.7%) entries capturing 82 and 100% of the total allelic diversity, respectively. A decrease in D CE , He, and Sh scores was noted when the core collection size was increased from 50 to 94 entries (Table 2). This could mainly be explained by the redundancy of the information provided by each additional genotype, since the entries added to the initial 50 genotypes contributed less than two alleles each, i.e. 44 added entries provided only 50 additional alleles (mean of 1.13 alleles/entry). A size of 94 entries, capturing the total diversity, is suitable for field assessments with many replicates for association mapping since many studies have been conducted on annual and perennial species represented by a similar number of accessions characterized by high genetic diversity in their original collections: 95 accessions for Triticum aestivum [19]; 96 for Arabidopsis thaliana [91] and Lolium perenne [92]; and 104 for Prunus persica [27].
Taking into account the trade-off between genetic parameters, we consider that the two-step method is a suitable to overcome these constraints and it could be applied to other annual and perennial species. The Shannon-Weaver diversity index was  50 , and CC 94 . H9 represents the similarity coefficient between runs, whereas DK represents the ad-hoc measure of Evanno et al. [84]. According to geographic and genetic criteria, three gene pools were revealed within Marrakech OWGB (western, central, and eastern Mediterranean groups) while the genetic structure was reduced to two sub-divisions in both core collections (eastern and western/central). doi:10.1371/journal.pone.0061265.g004 shown to be an adequate first criterion to be optimized to select core subsets with optimal allelic coverage and genetic distance. Basically, the index accounts for the allelic richness (number of distinct alleles) and the evenness (distribution of different alleles) within a given sample [43]. The Shannon-Weaver diversity index can be used for sampling individuals to capture the most allelic variation while eliminating those containing the most-represented alleles, i.e. all alleles are equally represented. To our knowledge, it is the first attempt to use the Shannon-Weaver diversity index as a first criterion to set up core collections, whereas it has been frequently used in other studies to validate the relevance of constructed core subsets [29][30]79,93]. This genetic parameter could be used as a first criterion to enhance field experimentation since it reduces artefacts resulting from the dominance of some categories (alleles and/or trait classes) over others.
Both core collections (CC 50 and CC 94 ) are of reasonable size as previous studies proposed 5-20% core sizes, capturing at least 70% of the genetic diversity [46]. CC 94 is similar in size to core collections previously obtained in Olea europaea [51,66] and Pyrus communis [94]. However, as compared to other perennial and highly heterozygotic species, this sample size is considered to be higher than those obtained in Annona cherimola (14.3%, 40 entries) [33], Malus sieversii (10.5%, 84) [34] and Vitis vinifera (4%, 92) [36]. This may be explained by the high diversity and the low redundancy in Marrakech OWGB as compared to the high redundancy and presence of many accessions of clonal origin in the Vitis collection [95].
By contrast to previously developed olive core collections, the proposed two-step method may be used to develop many core collections with one common set of 72 varieties and 22 different varieties. In fact, CC 94 is a flexible core collection in which 200 specific combinations of 22 varieties are available that can be chosen on the basis of many criteria, such as; geographic origin, economic importance, traits of interest, and/or previous use in breeding programs. This approach enables experimental flexibility and rational choice of varieties to be studied, with the possibility of adding supplementary genotypes to the initial core collection of 94 entries, if necessary.
Despite using different sampling algorithms, Belaj et al. [51] and Diez et al. [66] proposed core collections by maximizing only the number of alleles as the main criterion. Here we were able to construct core collections by taking many criteria at once into account, including sampling of genetically distant varieties. Moreover, a substantial over-representation of western accessions was noted in both previous olive core collections, since 46% of the entries originated from the western Mediterranean gene pool, mainly from Spain, versus 30% and 24% from eastern and central gene pools, respectively. By contrast, both core collections proposed in the current study accurately reflected the geographic distribution of cultivated olive, and demonstrated the high admixture level, since 48% and 52% of 50 and 94 entries, respectively, originated from the central Mediterranean zone. Our proposal is supported by the fact that the central Mediterranean zone is a hybrid area between the eastern and western zones, as shown by the admixed inferred ancestry of most of the genotypes sampled in this area [52,96]. Strikingly, when comparing the varietal composition in the CC 94 core collection with those previously published for olive, we found that only 11 and 12 varieties were shared with those reported by Diez et al. [66] and Belaj et al. [51], respectively. This finding could mainly be explained by the different sampling approaches used to construct core collections and by the differences in the original OWGB collections regarding the genetic diversity and varietal composition, since only 153 varieties are common to both OWGB collections [52].

Core Collections are Promising for Association Mapping
Unidentified population sub-divisions that have occurred through the evolutionary history of species (bottleneck effect, domestication processes), local adaptation and/or selection, is a major constraint for association mapping because of the many false positives that occur [23,[80][81][82]. Hence, information on genetic structures, the extent of LD and the relatedness between genotypes is crucial for association mapping. Ideally, samples should have a minimal population structure or familial relatedness to achieve the best statistical power [80]. Here we considered two sub-divisions within the proposed core collections depicting the genetic structure of OWGB Marrakech classified into three gene pools. In addition, there was evidence of spurious LD between unlinked SSR loci in nearly all of the pairwise tests in the whole collection ( Figure 6). This could mainly be explained by the genetic sub-division within OWGB Marrakech, as noted by the model-based Bayesian clustering, whereas a contrasting change in LD measurements was noted in the CC 94 core collection. As reported by Breseghello and Sorrells [19] and Pessoa-Filho et al. [44], the significant reduction in spurious disequilibrium is mainly due to sampling effects when diversity was maximized, while the spurious LD that remained in the CC 94 core collection was possibly caused by the low genetic structure in the 94 sampled entries. The assessment of relative kinship showed that most genotypes in OWGB Marrakech were significantly unrelated (80.6% of pairwise comparisons at 0-0.05). Similar genotype relatedness patterns were noted in both core collections (87.6 and  84.9% for CC 50 and CC 94 , respectively). Our findings were similar to those obtained in Brassica napus [97], B. rapa [98], and Zea mays [12] for which relative kinship estimates indicated a low level of relatedness between genotypes, with only a few pairs of genotypes being more related than any pair taken at random in the selected sub-sample. Basically, since a set of unrelated individuals displays variation in many phenotypic traits, many association traits/ markers can be studied in the same panel of individuals [80]. The proposed core collections are relevant for genetic association studies because of the genetic structures and relatedness [15,97]. These could be included as co-variance parameters in models to control false positive markers-traits in association mapping analyses [23,[80][81][82].

Conclusion
Our two-step method was shown to be well-adapted for constructing core collections of a size suitable for transfer within the scientific community. Such core collections are suitable for association mapping as they accommodate many genetic criteria and provide potential users with more flexibility for choosing varieties. It has been demonstrated that both proposed core collections clearly reflected the geographic and genetic diversity of olive, so they will be of major interest for breeding researchers to help them conduct comparative trails.
This work represents a preliminary step towards developing association mapping studies by sampling core collections and assessing the structure and relatedness within samples. Note that the proposed core collections should be periodically updated by including additional olive germplasm in the base collection and adding novel molecular markers such as SNPs. At the current state, the developed core collections will be useful for conducting field assessments and suitable for developing a long-term strategy for genome-wide association studies in olive. Figure S1 Maximizing average Cavalli-Sforza & Edwards genetic distance (D CE ) and allelic coverage (Cv). Values of D CE and Cv were maximized simultaneously with respect to a weight assigned to each measure. The CORE HUNTER program was run independently for 10 different weight values assigned to D CE and Cv measures; (1) When a weight of 100% was assigned to Cv, (2) when a weight of 40% was assigned to Cv and 60% to D CE , and (3) when a weight of 100% was assigned to D CE . (TIF) Figure S2 Three different levels proposed for core collections. Level 1 (L1) represents the primary core collection (CC 50 ), which includes the 40 entries selected using the ''Sh strategy'' implemented in CORE HUNTER program at 8%, two varieties carrying the two missing cpDNA haplotypes, and 8 nonselected reference varieties among the 14. Level 2 includes accessions carrying alleles observed once (22 genotypes). Level 3 represents final core collections (CC 94 ) constructed by adding a complement of 22 genotypes to the previous 72 among a panel of 106 genotypes to capture the total allelic and phenotypic diversity. (TIF) Figure S3 Plot of ad-hoc DK measurements and coefficients of similarity (H9) for K between 2 and 7. Arrows indicate the best genetic structure model for both core collections and OWGB Marrakech. According to both parameters, i.e. DK and H9, the best genetic structure model was not stable, while it is defined at K = 3 in Marrakech OWGB, indicating the absence of an obvious genetic structure in the core collections (see Figure S3).

Supporting Information
(TIF) Figure S4 Inferred structure for K = 5 clusters within OWGB Marrakech, CC 50 , and CC 94 core collections. H9 represents the similarity coefficient between runs, and DK represents the ad-hoc measure of Evanno et al. [84]. No consistency was observed in genetic structures based on more than three clusters. (TIF) Table S1 List of 502 genotypes used in the present study classified according to distinct genotypes (SSR profiles), origin, maternal lineage and inferred ancestry (Q matrix) at K = 3 clusters. (XLS)    94 ) generated with MSTRAT using the core collection of 50 entries as a kernel (CC 50 ). (x) Corresponds to the presence of the accession in the core collection concerned. The CC level column indicates the level of the core collection as shown in Figure S2. No differences between the 200 cores were observed for the Nei diversity index.

(XLS)
Text S1 Protocols of nuclear and chloroplast loci analyses.