Genetic diversity and divergence at the Arbutus unedo L. (Ericaceae) westernmost distribution limit

Mediterranean forests are fragile ecosystems vulnerable to recent global warming and reduction of precipitation, and a long-term negative effect is expected on vegetation with increasing drought and in areas burnt by fires. We investigated the spatial distribution of genetic variation of Arbutus unedo in the western Iberia Peninsula, using plastid markers with conservation and provenance regions design purposes. This species is currently undergoing an intense domestication process in the region, and, like other species, is increasingly under the threat from climate change, habitat fragmentation and wildfires. We sampled 451 trees from 15 natural populations from different ecological conditions spanning the whole species’ distribution range in the region. We applied Bayesian analysis and identified four clusters (north, centre, south, and a single-population cluster). Hierarchical AMOVA showed higher differentiation among clusters than among populations within clusters. The relatively low within-clusters differentiation can be explained by a common postglacial history of nearby populations. The genetic structure found, supported by the few available palaeobotanical records, cannot exclude the hypothesis of two independent A. unedo refugia in western Iberia Peninsula during the Last Glacial Maximum. Based on the results we recommend a conservation strategy by selecting populations for conservation based on their allelic richness and diversity and careful seed transfer consistent with current species’ genetic structure.


Introduction
The contemporary genetic structure of a species offers important information about its responses to past geological and climatic events, which had often played a crucial role in shaping the current distribution range. The spatial distribution of genetic variation is strongly a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 genetic diversity at the regional scale, we sampled 15 natural populations throughout the westernmost part of the IP, and genotyped 451 individuals using chloroplast microsatellite markers (cpSSRs). The chloroplast genome is maternally inherited in the strawberry tree (our unpublished data), reflecting seed flow only. Moreover, this genome has a smaller effective population size, which makes it more susceptible to genetic drift and population or species differentiation [25]. Our specific aims were to: i) investigate the spatial distribution of genetic variation of A. unedo in Portugal, and to ii) provide recommendations for species' conservation and help in provenance region design.

Study species
Arbutus unedo is an evergreen species, a shrub or a small tree up to 8-10 m. The fruit is a round berry (ca. 20 mm wide) and bright red when ripe in autumn. The berries with a fleshy pulp are mainly eaten by seed-disperser birds, which represent the main A. unedo seed dispersal vector [26,27]. The inflorescence is a panicle which appears in autumn, the flowers are insect pollinated [28 and our unpublished data], and crop fruit production starts when the plants are about six to seven years old (F. Gomes, personal communication). The A. unedo plants can easily resprout and survive if damaged (e.g., by fire or grazing) [29]. This species is a marginal woodland plant, unable to endure shading and to successfully compete for light [26]. Arbutus unedo prefers siliceous or decarbonated substrata and can grow on alkaline and relatively acidic soils, but grows satisfactorily under a very wide range of soil conditions, with pH varying from 4 to 7, excluding waterlogged soils [26].

Sample collection
The leaves for DNA extraction were collected in ca. 30  to avoid collecting related individuals and the coordinates for each individual were obtained with a GPS receiver. Our sampling took into account the environmental diversity in different soil and climate types [30,31] in Portugal (Table 1).

Ethics statement
Leaf sampling of A. unedo adult trees was non-destructive. No specific permission was required to sample the species in Portugal. Field studies did not involve endangered or protected species.

DNA extraction, amplification and sequencing
We extracted total DNA from frozen leaves with a Dneasy1 Plant Mini kit (QIAGEN, Hilden, Germany), following the manufacturer's instructions. Six chloroplast microsatellite primers ( Table 2) were designed based on the chloroplast genome [32]. The genome was scanned for cpSSRs loci using the software SciroKo 3.4 [33]. Primers were designed in the flanking regions of the detected SSR repeats using Primer3 v. 0.4.0 [34]. The primer design parameters used were: i) primer length (we selected 18-23 bp, with 20 bp being optimal), ii) PCR product sizes (100-450 bp, by using 60˚C as an optimum melting temperature of the PCR primers), and iii) GC > 40%. Each forward primer was fluorescently labelled with 6-FAM dye (AB PRISM Primer Applied Biosystems). All the individuals were genotyped with the four polymorphic markers: AU1, AU2, AU4, and AU7. Two multiplex PCR were performed using the Type-it1 Microsatellite PCR Kit (Qiagen). Forward and reverse primers for each locus were mixed at equal proportion and used in the multiplex PCR reactions. The first multiplex mix (with AU2 and AU7) was prepared in a 6 μL final reaction volume containing 1 μL of DNA (~10 ng), 3 μL of 2× Type-it mix, 1.4 μL dH2O and 0.6 μL of primer pre-mix 1 (0.1 μM each primer). The second mix (with AU1 and AU4) contained the same amount of total volume, DNA and master mix and 2 μL of primer pre-mix 2 (0.1 μM each AU1 primer and 0.2 μM each AU4 primer). The two multiplexes were amplified with the same PCR profile: an initial activation step of 5 min at 95˚C, followed by 30 cycles of 30 s denaturation at 95˚C, 90 s annealing at 57˚C and 30 s extension at 72˚C, concluding with a final extension for 30 min at 60˚C. The amplified products were mixed with Hi-Di™ Formamide (Applied Biosystems) and GeneScan™ 500 LIZ™ dye size standard (Applied Biosystems) according to standard protocols and separated on an AB 3500 DNA genetic analyzer (Applied Biosystems). The molecular sizes of the fragments were binned using the GeneMar-ker1 v2.4.0 software.

Data analysis
Genetic diversity. The four cpSSR fragments analysed were combined in order to derive the chloroplast haplotype of each individual. An haplotype network, a minimum spanning tree based on the differences in number of repeats between haplotypes, was created using the mst function and visualized through the plot.mst function in the pegas package in R [35,36]. The number of haplotypes (N h ), the unbiased haplotypic diversity based on haplotype frequencies (H e ) [37] and the haplotypic richness A R (number of different haplotypes found when a specific sample size is sampled from that population-in our study, the population size was fixed at 30), were computed for each population using the program CONTRIB v. 1.4 [38]. The effective number of haplotypes (A e ) and the number of private haplotypes (Ph) were calculated in GenAlEx v. 6.501 [39]. CONTRIB also provided a measure of the contribution of each population to the total diversity (CT%) and the total haplotypic richness (CTR%). In both cases, the contributions were split in a first component, due to population diversity, and a second component, due to the differentiation from the remaining populations [38]. Additionally, the average Goldstein genetic distance among individuals within each population was computed according to the stepwise mutation model (SMM) estimator D 2 sh, [40], as modified for fully linked markers [41]. D 2 sh was estimated on the original data and on a transformed dataset (ng = no gaps), where the gap between alleles were eliminated to mitigate the possible impact of indels, by in-house R scripts. The relationship between the estimates of diversity (N h and H e ) and the genetic distances among individuals within populations D 2 sh and D 2 sh ng was tested with Pearson's product-moment correlation in R. Genetic and phylogeographic structure. Bayesian analysis using a spatial clustering model implemented in BAPS 5.4 [42] was used to infer population genetic structure. The use of spatial information increases the power to correctly detect the underlying population structure [43]. The analysis was repeated 10 times for each K (from one to 15) to obtain the optimal number of clusters. The best partition of populations into K clusters was identified as the one with the highest log marginal likelihood.
The extent of the genetic structure was further explored using a non-hierarchical and hierarchical analysis of molecular variance (AMOVA) implemented in the Arlequin 3.5 software [44]. We used the infinite allele mutation model (IAM) considering the distances between haplotypes as the number of different alleles, and the stepwise mutation model (SMM) considering those distances as the number of repeat units for each locus. To perform the SMM analysis, we coded cpSSR data in a binary way, as in Heuertz et al. [45]. Overall, this test estimated the variance components among and within populations. The hierarchical AMOVA (based on BAPS clustering) estimated the partition of the genetic variation among clusters, among populations within clusters and within populations. The significance values were computed by 1,000 permutations.
We checked for the presence of a phylogeographic structure by comparing differentiation of unordered alleles (G ST ) and ordered alleles (N ST ) with the PermutCpSSR 1.2.1 software [46]. N ST is defined as a measure of genetic differentiation among populations when the distance between haplotypes is taken into account, whereas G ST is a measure of genetic differentiation that makes use only of haplotypic frequencies. The significance of the two differentiation statistics was tested with 10,000 random permutations of individuals among populations. To test if the observed N ST value is larger than the G ST , we counted how many permutated N ST values were larger than the observed N ST . Statistical significance may indicate the presence of a phylogeographic structure [46], i.e., closely related haplotypes are more often found in the same population than would be expected by chance. Finally, isolation by distance (IBD) was tested comparing genetic and log-transformed geographic distances between population pairs, using F ST /(1-F ST ) as genetic distance. The statistical significance of the Mantel test correlation coefficient was obtained from 1,000 permutations using Arlequin 3.5.

Genetic diversity
We obtained 15 haplotypes after genotyping 451 individuals with four polymorphic cpSSRs (S1 Table). With only one exception (i.e. population SM), the most frequent haplotype is H10 (Figs 1A and 2 The average haplotypic diversity within populations was 0.40 (Table 3), with populations EC and ON displaying the highest haplotypic diversity (H e ) and effective number of haplotypes (A e ). The population EC showed also the highest values for N h (6). Other populations from the central region (G, AV and PS), had H e values over 0.50 and A e > 2. The populations  Table 1  PG and M were fixed for the most common haplotype and also the population SM showed a very low value of diversity (0.13).
We found no correlation between D 2 sh and genetic diversity estimates ( Table 3). The populations with high values of D 2 sh , such as the B population, did not display high values of genetic diversity (N h and H e ). However, if we consider the genetic distance with no gaps (D 2 sh ng ), a significant correlation was found between the average genetic distance and the estimates of genetic diversity (N h and H e ) ( Table 3). This discrepancy was probably due to the presence of an insertion determining a 9 bp gap between alleles 285 and 294 at locus AU2 (see S1 Table).
The population SM showed the highest contribution to the differentiation of populations, while populations G and ON showed the highest contribution to the total diversity. The populations that gave the highest contribution to the population diversity were AV, ON and EC. The populations PG and M contributed negatively to both components of the total diversity (diversity and divergence components), showing only one haplotype (CT%, Fig 3A). The population EC had the highest contribution to the total allelic richness, followed by populations SM and BV (CTR%, Fig 3B).

Population differentiation and Bayesian cluster analysis
The Bayesian clustering suggests the presence of four major groups of populations: a northern cluster, a central cluster with populations from the coast and inland, a southern cluster, and, finally, one cluster with the outlier population SM (Fig 1B). The hierarchical AMOVA showed a stronger differentiation among BAPS clusters than among populations within clusters (Table 4). Contrary to other clusters, in the southern cluster the haplotype H11 is present (Figs  1 and 3). In the central cluster, the haplotype H6 is shared among all populations. The northern cluster had the lowest haplotypic diversity, with only 4 haplotypes and the PG population was fixed for H10 (Table 3).
Finally, the Mantel tests showed an absence of correlation between genetic and geographic distances with both mutation models and thus a lack of isolation by distance pattern. Considering all the populations, the G ST = 0.29±0.11 was among population differentiation based on unordered alleles and the N ST = 0.28±0.11 was based on ordered alleles. The observed N ST was not significantly different from the observed G ST , showing absence of a phylogeographic structure.

Spatial distribution of genetic diversity
There is evidence about two possible refugial areas for A. unedo in Portugal during the LGM, although it is worth noting that continuous pollen records in the region are missing [47] and that A. unedo is a low pollen producer often underrepresented in the pollen diagrams [48]. Nevertheless, Mateus & Queiroz [49] found palaeobotanical evidence of scrublands dominated by Quercus coccifera L., including A. unedo and other thermophiles shrubs, during the Middle Holocene in the southern region. Santiso et al. [5] also postulated a glacial refugium for the species in North Africa-Atlantic Iberia during the LGM. In addition, fossil charcoal records from central Portugal (Estremadura region) support the presence of Mediterranean taxa, including A. unedo, during the last glacial period [50]. According to these authors, survival of the thermophilous species in central Portugal during the LGM was possible due to the Atlantic Ocean proximity, enough protection from northerly winds and good solar exposure, and average temperatures and precipitations comparable to today. Therefore, it seems likely that most of these species expanded from central Portugal during climate amelioration and that a rapid reforestation occurred after the LGM [51]. In the application of the "refugia within refugia" model to the Iberian Peninsula, the paradigm of Iberia as a single refugium during Pleistocene glacial maxima was challenged, supporting instead the presence of several Iberian refugia for a range of flora and fauna [4]. Thus, combining already available knowledge [12] with our results, which are based on a more intensive genetic sampling of A. unedo in this area, the hypothesis of the presence of different refugia cannot be excluded. Two well-differentiated genetic clusters were found, one in central and another in southern Portugal. In both clusters, there are populations displaying high genetic diversity (e.g. EC and BV in the southern cluster; G, AV, ON, PS and SF in the central cluster) and some private haplotypes (e.g. high-frequency haplotype H6 is present only in the central and northern cluster). Such findings contrast with the hypothesis of a common postglacial origin of A. unedo populations in Portugal.
There are several points about the recent evolutionary history of A. unedo at its westernmost distribution limit that need to be clarified, and for which the use of genetic markers characterized by higher mutation rates (e.g. nuclear microsatellites) would be helpful. We observed an absence of phylogeographic structure and IBD patterns in our study that could be explained by not having considered the genetic structure found in the analysis. Nevertheless, when tests were repeated only on the central cluster, the only one with sufficient data to obtain meaningful results, the lack of phylogeographic structure and IBD was confirmed (results not shown). Overall, this points towards low levels of gene flow among populations probably due to limitations in the seed dispersal by birds. Whether limited gene flow may increase the effect of genetic drift in erasing IBD patterns, depends on overall gene exchanges among populations and, therefore, on both pollen and seed gene flow [52]. This could be disentangled by assessing both the genetic connectivity through pollen at different spatial scales and by the seed-to-pollen gene flow ratio in A. unedo to understand the impact of genetic drift on the spatial distribution of genetic diversity found in the present study. Table 4. Analysis of molecular variance (AMOVA) of the A. unedo populations. Among all populations (a,c) and among the groups defined by the BAPS analysis (b,d) (hierarchical AMOVA). Distances between haplotypes were measured as number of different alleles (IAM) (a,b) and as number of repeat units for each microsatellite for the i-th locus (SMM) (c,d). SS = sum of squared deviations, d.f. = degrees of freedom, and P = probability of obtaining a more extreme value by chance alone. Genetic structure in strawberry tree Another important aspect to be addressed is the effect of contemporary factors on the distribution of genetic variation, with suitable markers. The occurrence of wildfires in the region potentially drives to abrupt changes in population size. Wildfires help in keeping the canopy open, a key factor for A. unedo survival and reproduction that prefers open-canopy environments. However if fires are too frequent or intense, the impact on population genetic diversity can be negative, as it is probably the case for the PG population (suffering on average wildfires with 5.7 years interval, since 1975) [53]. We should take into consideration that between 1975-2005, the total area burnt in Portugal was 3,353,000 ha, equivalent to 38% of the country, with a mean annual percent of burn area of 1.2% [22], yet with large inter-annual variability (see S1 Fig). Indeed, increase in fire frequency is seen as a major threat, since short inter-fire periods may exhaust soil-stored and fire-stimulated seed banks, reducing both the effective population size and genetic diversity [54] and eventually increasing the probability of extinction of forest tree populations.

Design of conservation and provenance regions
In the studied region A. unedo is, currently, undergoing an intense domestication process, with the commercialisation of improved genetic material (clones) to produce new plantations and the introduction of genetic material through seed from unknown origin [11,17]. Domestication together with growing climate change will affect the species' natural resources. Moreover, the use of vegetative propagation could lead to genetic erosion and inbreeding problems [55], this last effect might also affect the production of fruits, which is the interesting economic product.
Populations are not similar in their capacity to adapt to varying environmental conditions and genetic information can ensure a better use of the available resources by maximizing the evolutionary contribution of the set of populations to conserve [38]. Diversity alone, regardless of the way it is measured, is not a sufficient criterion to identify populations that deserve higher priority for conservation, and when different populations are chosen for conservation, redundancy should be avoided. For conservation purposes the uniqueness of a population, in terms of its allelic composition, may be at least as important as its diversity level, as advised by Petit et al. [38]. To select A. unedo candidate populations for conservation, besides the genetic diversity parameters, we used the contribution of each population to the total diversity partitioned into two components. The first was related to the level of diversity of the population and the second to its 'uniqueness', i.e., its divergence from the other populations [38].
We found that the SM population, representing a single-population genetic cluster, was the most differentiated, with a high contribution to divergence, but not to diversity levels. We suggest including this population in a conservation programme for A. unedo in Portugal since rare haplotypes will be conserved from a putative remnant population. The southern cluster EC population should be also considered as a potential gene reserve, due to its high genetic diversity and haplotypic richness, and for harbouring a private haplotype as SM. This population has the highest contribution to the total haplotypic richness in both the differentiation and diversity components. Similarly, the ON and AV populations, from the central cluster, should also be considered in conservation programs. The first population show the second highest values of genetic diversity and number of haplotypes, besides the average genetic distance among individuals, and the later has the highest haplotypic richness and number of private haplotypes in this cluster. Conservation should include the i) possible impact of very frequent wildfires in the genetic diversity, and the ii) impact of the species domestication at the region scale. In this region, maritime pine was a paradigmatic example of domestication impact in the species genetic structure: the use of seeds from unknown origin has completely blurred it and affected the species' forest resources [56], and this should be avoided in the strawberry tree case.
Considering the species' genetic improvement, the breeding population used in improvement program, including the clones used for deployment, should mirror the diversity and the haplotypic richness of the species in the region, and care should be taken to monitor the breeding generations ahead. This could be done by comparing the haplotypic and genetic diversity of the breeding population with the current study populations' haplotypic and genetic diversity and by using additional markers, such as nuclear microsatellites and single nucleotide polymorphisms [57,58].
The information retrieved in the current study will also be helpful to design provenance regions aiming at collecting reproductive material for plantations. Those provenance regions should be built based on: i) genetic clustering results, and ii) ecological attributes, including wildfire information [59]. In order to further capture adaptive information, there would be a need for more genetic information, principally about adaptive traits and the genes underlying these traits important both for species preservation and genetic improvement. Direct information about adaptive variation in controlled environments could be done by installing genetic tests, including populations' representative of the clusters detected in the current study. Provenance tests provide information about geographic patters of adaptive genetic variation, which can further help in defining the gene conservation populations. Additionally, they can direct the delineation of breeding zones and seed transfer guidelines, assuring that well adapted genotypes are deployed in artificial regeneration programs [59,60].