Reduced SNP Panels for Genetic Identification and Introgression Analysis in the Dark Honey Bee (Apis mellifera mellifera)

Beekeeping activities, especially queen trading, have shaped the distribution of honey bee (Apis mellifera) subspecies in Europe, and have resulted in extensive introductions of two eastern European C-lineage subspecies (A. m. ligustica and A. m. carnica) into the native range of the M-lineage A. m. mellifera subspecies in Western Europe. As a consequence, replacement and gene flow between native and commercial populations have occurred at varying levels across western European populations. Genetic identification and introgression analysis using molecular markers is an important tool for management and conservation of honey bee subspecies. Previous studies have monitored introgression by using microsatellite, PCR-RFLP markers and most recently, high density assays using single nucleotide polymorphism (SNP) markers. While the latter are almost prohibitively expensive, the information gained to date can be exploited to create a reduced panel containing the most ancestry-informative markers (AIMs) for those purposes with very little loss of information. The objective of this study was to design reduced panels of AIMs to verify the origin of A. m. mellifera individuals and to provide accurate estimates of the level of C-lineage introgression into their genome. The discriminant power of the SNPs using a variety of metrics and approaches including the Weir & Cockerham’s FST, an FST-based outlier test, Delta, informativeness (In), and PCA was evaluated. This study shows that reduced AIMs panels assign individuals to the correct origin and calculates the admixture level with a high degree of accuracy. These panels provide an essential tool in Europe for genetic stock identification and estimation of admixture levels which can assist management strategies and monitor honey bee conservation programs.


Introduction
The role of introgression and admixture in conservation is a dilemma: While natural admixture may be an important evolutionary force in speciation and maintenance of genetic diversity [1][2], admixture induced by human activities may contribute, either directly or indirectly, to the extinction of many taxa [3]. Introduction of species, subspecies and habitat modifications has caused increased rates of admixture with native flora and fauna and introgression that can generate extinction and irretrievable loss of combinations of genotypes throughout the entire genome [4].
The honey bee, Apis mellifera L., represents a valuable model to study human-mediated change. Beekeeping has been practiced in Europe for many centuries [5], which has led to loss of native genetic diversity through three major mechanisms: (i) replacement of native populations by human-selected more docile and productive colonies, (ii) spread of honey bee pests and parasites, such as the mite Varroa destructor and the microsporidian Nosema ceranae, that have contributed to worldwide population declines [6][7], and (iii) recurrent introductions of commercial colonies (reviewed by De la Rúa et al. [8]).
The genetic diversity harbored in native honey bee subspecies is amongst the most important legacies that we can leave to future generations of beekeepers and farmers [9][10]. Native honey bee subspecies are important reservoirs of local adaptations; their extinction means the loss of unique combinations of traits shaped by natural selection over extended periods of time. These combinations can be important for a more sustainable beekeeping, as shown by a recent pan-European experiment [11].
In Europe, honey bees show considerable differences in morphological, behavioural and biological characters across their range as a result of historical patterns of isolation and adaptation to environmental conditions [8]. Those differences are materialized in 10 extant European subspecies, among the 30 subspecies currently recognized worldwide [12][13][14][15][16], representing thereby a substantial component of the total honey bee diversity. These 10 European subspecies have been grouped by morphological and molecular tools [12,[17][18][19][20][21] into two evolutionary lineages: the M-lineage, in Western Europe, and the C-lineage, in Eastern Europe.
Subspecies-specific genetic footprints can still be identified in Europe [22][23][24][25][26][27][28], in spite of centuries of beekeeping [5], although introgression and admixture events have also been detected in eastern [28][29][30] and western [9,26,[31][32] European populations. The M-lineage A. m. mellifera (dark honey bee) has been recognized as the most threatened, with most of the threat due to introgression from the C-lineage [9,[31][32]. In addition to the documented intentional replacement of A. m. mellifera by A. m. carnica in Germany [33][34], the increasing trade of commercial breeds (mainly C-lineage A. m. carnica, A. m. ligustica and the hybrid buckfast) is threatening the genetic integrity of the native A. m. mellifera as many beekeepers prefer using commercial as opposed to native honey bees.
Increasing awareness that native honey bee diversity represents a valuable asset for sustainable beekeeping is fuelling local breeding and conservation efforts across Europe. One of the earliest, and until recently the single conservation program enacted by law, is that implemented by the Danish Beekeepers Association and the Laesø Beekeepers Association on behalf of the Danish Government in 1993 and the European Union in 1998 [35] to create a reserve and protect the A. m. mellifera. Following approval by the Scottish government of an order to protect the A. m. mellifera on the islands of Colonsay and Oronsay [The Bee Keeping (Colonsay and Oronsay) Order 2013], a second European reserve was recently created in the United Kingdom. Other A. m. mellifera conservation efforts, although not enacted by law, are underway in France, Holland, Norway, Switzerland, Ireland, and Belgium, among others (see the website "http://www.sicamm.org" run by the International Association for the Protection of the European Dark bee). The success or failure of all these efforts will be tightly linked to efforts that monitor the integrity of these protected populations.
Assessing introgression is an important activity in honey bee breeding programs, especially when conservation of native subspecies is a major concern. This activity requires molecular tools that are reliable, inexpensive and preferably automated. Previous studies have monitored introgression between the endemic A. m. mellifera and introduced C-lineage subspecies using microsatellite and PCR-RFLP markers [31][32]36]. However, with the publication of the honey bee genome [37], development of single-nucleotide polymorphism (SNP) markers [20,38], and next generation sequencing becoming fast and affordable, particularly for a small genome as that of the honey bee (236 Mb), increasingly powerful tools are available to measure genomic ancestry and admixture levels occurring in both native and introduced honey bee populations [21,[39][40]. However, the genomic approach is not always cost-effective and low quality and/ or degraded DNA can be a handicap to using genomic re-sequencing. Alternatively, ancestry can be estimated using a subset of highly informative SNPs ranging in number from a few dozens to several hundreds. The selected SNPs, commonly known as Ancestry-Informative Markers (AIMs), are those that exhibit large allele frequency differences between populations. AIMs can be used for inferring geographic origin of individuals [41][42][43], detecting illegal trade and translocation of animals [44], food authentication [45], for estimating overall admixture proportions efficiently and inexpensively [43,46], among others. It is possible, using a panel of AIMs distributed throughout the genome, to estimate the relative ancestral proportions in admixed individuals, and infer the time since the admixture process [47][48].
The ability of an AIMs panel to measure ancestry is generally evaluated empirically, by examining its performance on a given set of samples for which ancestry is known [49]. In this paper, we employed five analytical methods to select different combinations of SNPs to form five nested panels of 48-, 96-, 144-, 192-and 384-AIMs optimized to estimate admixture proportions of C-lineage (A. m. ligustica and A. m. carnica) into the M-lineage A. m. mellifera. This was done in two successive stages. In the first stage, we evaluated the performance of the five selection methods [Weir & Cockerham's F ST , an F ST -based outlier test, Delta, informativeness (I n ), and PCA] on a training dataset, in an effort to select AIMs and to rank them by decreasing level of informativeness. In the second stage, we tested the power of the reduced five designed panels and validated their performance on holdout and simulated sets, by comparing the admixture estimates produced by the panels with those produced by an initial dataset of 1183 SNPs.  [9]. Colonies of protected populations have been identified by morphological (B. Dahle, pers. comm.) and molecular tools (mtDNA tRNA leu -cox2 and microsatellites; [31,32,[50][51]) as the best representatives of A. m. mellifera and have therefore been integrated into conservation programs. To prevent C-lineage introgression and assure pure breeding, these colonies have been maintained in islands or in isolated mating stations. Despite careful management to protect the threatened A. m. mellifera from C-lineage introgression, a recent SNP survey detected variable, although generally low, levels of introgression in these protected populations (see Pinto et al. [9] for details). A reference collection of 36 samples representing C-lineage diversity was obtained from the natural range of A. m. carnica in Serbia (N = 8) and Croatia (N = 11) and from the natural range A. m. ligustica in Italy (N = 17). The owners of all the sampled apiaries gave permission to collect honey bee individuals from the hives. In each location, samples were taken from the inner part of hives, placed into absolute ethanol and stored at −20°C until molecular analysis.

Samples, DNA Extraction and SNP Genotyping
Using a phenol/chloroform isoamyl alcohol (25:24:1) protocol [52], total DNA was extracted from the thorax of the 113 individuals, each representing a single colony. A total of 1536 SNP loci were genotyped for those individuals using Illumina's BeadArray Technology and the Illumina GoldenGate Assay with a custom Oligo Pool Assay (Illumina, San Diego, CA, USA) following manufacturer's protocols. The Oligo Pool consisted of the 1536 SNPs, which included the 768 most informative SNPs of Whitfield et al. [20] and 768 newly developed SNPs employed by Chávez-Galarza et al [38]. The 1536 SNP array was used previously to study diversity and introgression levels in populations of A. m. mellifera sampled across Western Europe [9] and to detect signatures of selection in the Iberian honey bee genome [38]. Genotype calling was performed using Illumina's GenomeStudio Data Analysis software. Of the initial 1536 SNPs, 353 did not meet the quality criteria for analysis and were therefore excluded from the dataset. The SNP filtering was as follows: 124 exhibited poorly separated intensity clusters or low signal intensity when visualized in the GenomeStudio software; 167 were monomorphic (defined by a cut-off criterion of >0.98 for the most common allele, as in Chávez-Galarza et al. [38]) across all populations; 54 did not map in the honey bee genome assembly Amel_4.0; and 8 hit two different genomic positions (the first with 100% identity and the second with 96-98%) in the honey bee genome assembly Amel_4.0 during the mapping process using the 100 bp flanking sequence. Allele frequencies were calculated for each of the remaining 1183 biallelic SNPs (S1 Table) in each population using the program Plink [53].

Selection of AIMs
Five different methods were employed on the initial 1183 SNP dataset for estimating marker information content. The first method, which has been one of the most popular for selecting informative loci, was the pairwise F ST of Weir & Cockerham [54] as calculated at each locus using Genepop software [55]. The second method was the F ST -based outlier test developed by Foll & Gaggiotti [56], which employs a Bayesian likelihood approach to detect loci deviating from neutral expectations (outliers). This outlier test was implemented in Bayescan 2.01 [56] using 20 pilot runs of 5 000 iterations (sample size of 5 000 and thinning interval of 10) and an additional burn-in of 50 000 iterations. The third method was based on the estimate of allelefrequency differential (Delta), which is one of the most straightforward ways to evaluate the information content of a SNP. For a bi-allelic marker, like a SNP, the Delta value is estimated as | pA i -pA j |, where pAi and pA j are the frequencies of allele A in the i th and j th populations, respectively. When more than two populations were analyzed, the Delta value for each SNP locus was estimated as the mean across all pair-wise comparisons. The fourth method was the informativeness for assignment (I n , natural logarithm of the number of populations) proposed by Rosenberg et al. [41]. I n provides the amount of information gained about population assignment from observation of a single randomly chosen allele at a locus. This method assumes a uniform prior across K potential source populations for the origin of the allele. For a given set of populations, the minimum value of I n (0) occurs when all alleles have equal frequencies in all populations whereas the maximum value (1) occurs when alleles are not shared among populations. I n was calculated using the software Infocalc available at http://www.stanford.edu/ group/rosenberglab/infocalc.html. Finally, the fifth selection method was principal component analysis (PCA), which was performed using the PAST software [57]. The first eight principal components were used to calculate the information content of each SNP following the approach of Paschou et al. [58]. The loadings for each SNP were squared and summed over the eight most significant principal components to produce an estimate of informativeness.
SNPs were ranked and panels of SNPs tested using reference populations and the Anderson's Simple Training and Holdout method to reduce the potential for upward bias, which is introduced when loci are ranked and assessed using the same individuals [59]. To that end, a total of 34 pure (sensu Soland-Reckeweg et al. [32]) individuals of A. m. mellifera, previously identified in Pinto et al. [9], and all reference individuals (

Ranking of SNPs
The five selection methods were implemented on the four training datasets producing a total of 20 information content values for each of the 1183 SNPs. These values were ranked and analyzed individually and then were averaged in two steps to obtain a single global value per SNP.
In the first step the information content values were averaged across the four training datasets for each of the five selection methods. In the second step the information content values produced by each selection method were converted into a 0-1 scale and then averaged to obtain a global score for each of the 1183 SNPs. After standardizing the values produced by the five selection methods, the global ranking was obtained for the 1183 SNPs using the global score. Given that linked loci yield redundant information, having therefore similar resolving power, markers were excluded if they were within a predefined genetic distance (<1 cM) of higher ranking selected SNPs. The genetic distance of the remaining SNPs ranged from 1.01 to 24.25 cM with a mean of 4.64 cM. Prior to obtaining the global score for each SNP, pairwise associations between information content values produced by the five methods and between the four training datasets were calculated using the Spearman`s rank correlation coefficient, in order to compare the five selection methods and examine the effect of clusters of populations.

Panel Testing
Five panels of 48-, 96-, 144-, 192-and 384-SNPs (sets defined by multiplex sizes of commercial assays) were designed from the top-ranked SNPs. These nested panels were tested against a holdout set and a simulated set to obtain the admixture proportions estimated by each SNP panel. The holdout set (113 individuals) consisted of 34 pure individuals plus 43 reserved individuals of A. m. mellifera and the reference A. m. ligustica (17 individuals) and A. m. carnica (19 individuals), as described above. The simulated set (1000 individuals) was generated with the program ONCOR [62] using the function "simulate a single mixture". Ten populations, each with 100 simulated genotypes, were simulated using different levels of introgression (0, 1, 5, 10, 20, 30, 40, 50, 75, and 90%).
Two approaches were used to validate the five reduced AIMs panels. First, a PCA was performed with SNPs in each AIMs panel on the holdout set using the software PAST to generate two-dimensional PCA and to visualize the stability of population assignment produced by the panels. Second, ancestry and admixture was analyzed. Admixture proportions were estimated with SNPs in each AIMs panel for the holdout and simulated sets using a model-based maximum likelihood estimation of individual ancestries implemented in the software Admixture v1.23 [63]. Coancestry spanning 1-6 populations (K = 1-6, using the default termination criterion that stops the runs when the log-likelihood increases by less than ε <0.0001 between iterations) was explored for each AIMs panel and the optimal K was identified with the inferred number of populations producing the lowest cross-validation error (CV) during the clustering analysis.
The performance of each reduced panel was examined using different approaches. First, the pairwise differences between admixture proportions inferred from the initial 1183 SNP dataset and the five panels were tested using a Mann-Whitney test. Second, the precision of each panel was tested against the initial 1183 SNP dataset by calculating linear regression coefficients (r 2 ) and the standard deviations of the differences between admixture proportions. Finally, the accuracy of the reduced panels was estimated via percentage of absolute error of admixture estimates obtained with the five panels in relation to the initial 1183 SNP dataset.

Identification and Ranking of AIMs
The majority of the 1183 SNPs assessed in this study using five selection methods (pairwise Weir & Cockerham's F ST , F ST -based outlier test, Delta, I n and PCA) contain high levels of information content (Fig 1, S2 Table), facilitating the design of reduced panels for genetic identification and introgression analysis in the dark honey bee, A. m. mellifera.  (Fig 1).
The level of similarity (Spearman's rank correlation, r s ) between the different estimates of genetic information content produced by the five selection methods across the four training datasets is shown in Table 1. The highest correlation values were observed for Weir & Cockerham's F ST , Delta and I n (0.7648 r s 0.9985, P<0.001) whereas a moderate correlation was detected between the F ST -based outlier test and Weir & Cockerham's F ST , Delta and I n (0.2864 r s 0.6592, P<0.001). The lowest correlations were observed between PCA and the other four methods (-0.2228 r s 0.1025, 0.000 P 0.9412). Regarding the four training datasets (Fig 2), high correlation values were observed across selection methods (0.7557 r s 0.9727, P<0.001).
Using an information content cutoff value !0.25, which indicates very great genetic differentiation [64], a total of 627 AIMs were identified by the methods of Weir & Cockerham's F ST , Delta, I n , and F ST -based outlier test. Of these, the top-ranked 384 AIMs were selected using the five methods and the four training datasets. The extent of overlap of the 384 AIMs across the five selection methods and the four training datasets is shown in Fig 2. Overlap between any two methods and across datasets ranged between 382 (Weir & Cockerham's F ST and Delta for dataset I; Fig 2A) and 134 (Delta and PCA for dataset III; Fig 2C). The number of AIMs that were simultaneously selected by the five methods was lower, ranging from 82 (dataset I; Fig  2A) to 97 (dataset IV; Fig 2D). A substantially higher amount of overlap (273 AIMs; Fig 2E), supported by high correlation values (r s !0.7557, P<0.001; Fig 2F), was observed across the    four training datasets, suggesting that the different population groupings have a small effect on the AIMs ranking. The global ranking of the 384 AIMs was used to design reduced panels of 192-, 144-, 96-, and 48 that included SNPs with the highest respective global scores. The performance of these reduced panels was subsequently assessed using the holdout and simulated sets.

Validation of the AIMs Panels
The performance of the five AIMs panels (48-, 96-, 144-, 192-, 384-AIMs) was first validated by using PCA to produce a visual summary of the observed genetic variation carried by the holdout set (Fig 3). The overall diversity pattern is characterized by the presence of two distinct clusters, which are coincidental with the M and C evolutionary lineages. This pattern was captured by every single AIMs panel, although a greater dispersion was observed for the smaller panels. Additionally, the panels with less than 192 AIMs were unable to distinguish the two Clineage subspecies, A.m. ligustica and A.m. carnica, which were clearly identified by the initial 1183 SNPs and, to a lesser degree, by the 384-AIMs panel.
Ancestry and admixture analyses based on admixture estimates confirm the overall pattern captured by the PCA (S3 Table and S1 Fig). At the optimal K = 2 (inferred by the initial 1183 SNP dataset and the five AIMs panel), the two clusters corresponded to the C and M-lineages. However, C-lineage individuals formed a more homogeneous cluster than those of the Mlineage individuals. While membership proportions in the C-lineage cluster were greater than 95% for the five AIMs panels, the M-lineage cluster comprised 13 (384-AIMs and 1183 SNPs), 14 (48-and 192-AIMs) and 15 (96-and 144-AIMs) individuals with membership proportions lower than 85%, a pattern that was already evident in the PCA plots.  Table).
In addition to the admixture analyses using the holdout set, the AIMs panels were further validated using a simulated set of 10 different levels of C-lineage introgression (0, 1, 5, 10, 20, 30, 40, 50, 75, and 90%). As for the analyses with the holdout set, the simulated set produced two clusters corresponding to M and C lineages with no significant differences in admixture proportions between the different AIMs panels and the initial 1183 SNP dataset (Mann-Whitney test, P !0.2313; S5 Table).

Assignment's precision and accuracy
The power of the reduced AIMs panels in identifying A. m. mellifera and estimating admixture proportions was evaluated on the holdout set. Estimates of C-lineage introgression into A. m. mellifera inferred from the five panels were greatly concordant with those inferred from the initial 1183 SNP dataset, as indicated by the high correlation values (r !0.997; Fig 4). Despite the high correlations obtained for each comparison, the error rate in admixture estimates, which is very low for all the panels (0.0012-0.0042 with the simulated set and 0.4-1.3 with the holdout set), does increase as the size of the panel decreases (S2 Fig). Nevertheless, the reduced AIMs panels provide good precision in estimating admixture proportions.
As another assessment of the performance of the panels, the accuracy was calculated via absolute error. The success of assignment of the 113 individual genotypes of the holdout set to genetic origin and level of admixture inferred from the different AIMs panels is shown in Fig 5. The average percentage of correct assignment was high varying from 98.2, 98.8, 99.0, 99.2 to 99.4% for the 48-, 96-, 144-, 192-and 384-AIMs panels, respectively. The chosen AIMs panels accurately distinguish M/C admixture, therefore these results suggest that a small number of AIMs are sufficient to identify A. m. mellifera and estimate introgression from C-lineage colonies with great accuracy.

Discussion
The recognition that native honey bee genetic diversity is fundamental for sustainable beekeeping and for facing the challenges of a rapidly changing world (e.g. climate change, novel diseases and parasites) is stimulating implementation of conservation programs across Europe in an attempt to recover and protect A. m. mellifera, which is the European honey bee subspecies with the widest natural range [12], and at the same time the most threatened by introgression [9,31]. The need of a reliable, high-throughput, and cost-effective tool for identifying candidate A. m. mellifera colonies targeted for conservation, a crucial step when managing conservatoires, motivated the design of reduced AIMs panels containing the most informative SNPs to verify ancestry and introgression from C-lineage subspecies. In this study we developed, validated and tested the first reduced AIMs panels for honey bees. Our results provide strong confidence in a panel of 384 AIMs and show that even smaller subsets of 192-, 144-, 96and 48-AIMs are able to identify ancestry and estimate introgression with great accuracy. These reduced panels promise to be a useful tool for routine identification of A. m. mellifera colonies maintained in the breeding populations of conservation programs.
The AIMs included in the five reduced panels were simultaneously selected by pairwise Weir & Cockerham's F ST , F ST -based outlier test, Delta, I n and PCA, in order to balance out the limitations of each individual method [41,58,65]. These selection methods have proved to be powerful, although with varying performances, in identifying population informative markers in a wide range of organisms [43,58,[60][61]65]. A great extent of overlap of top-ranked AIMs was obtained for the five selection methods, especially for pairwise Weir & Cockerham's F ST , Delta, and I n suggesting that they capture the same information. Nonetheless, the smaller panels (48-, 96-, 144-, 192-AIMs) did not necessarily include all AIMs simultaneously detected by the five methods as the global ranking depended on the average score. High pairwise correlation values were obtained for Weir & Cockerham's F ST , Delta and I n but not for PCA, as found by Wilkinson et al. [65]. PCA has been recommended for ranking markers because it has the advantage of generating an overall estimate for a single SNP locus whereas the other methods require estimate of an average from pairwise calculations when the number of populations is greater than two [58].
The five reduced panels tested with the holdout and simulated sets performed virtually as well as the initial 1183 SNP dataset, as revealed by the strong correlations obtained between admixture estimates and low associated error rates. The assignment power was high across the five panels with average values of correct assignment varying between 98.2 and 99.4%, although the accuracy decreased slightly with panel size. Nonetheless, even the 48-AIMs panel exhibited high accuracy levels, which is not surprising as it includes the AIMs with the greatest resolution power. Studies on other organisms have also found good performances with panels of similar sizes [43,45,60,65], detecting sharp drops in accuracy for a number of SNPs below 25 [45,60].
Evaluation of different combinations of the focal A. m. mellifera and the two most common sources of foreign genes, A. m. ligustica and A. m. carnica, revealed a negligible effect of population groupings on the AIMs ranking. These results suggest that the designed panels are suited for identifying and assessing introgression of A. m. ligustica, A. m. carnica or both into A. m. mellifera. While these panels will possibly perform well in the presence of other C-lineage subspecies, more complex combinations that include sources of different evolutionary lineages will require further testing and, most likely, new panels developed from broader baseline datasets. Additionally, it should be noted, that these reduced panels are not suitable for standard population genetic analyses, including determining allelic diversity or measuring isolation by distance, genetic drift or bottleneck effect. The bias introduced through selection for markers that segregate among target populations would seriously compromise these calculations [66][67].
Ancestry identification of honey bee subspecies is undergoing steady development (reviewed by Meixner et al. [68]) from classical morphometry, analysis of allozymes, mitochondrial DNA, nuclear microsatellites, and now SNP tools. Because researchers must balance the cost of genotyping many samples versus many loci, herein we developed five nested reduced panels that include AIMs with the highest resolution power for discriminating subspecies of the divergent M and C evolutionary lineages. While the 384-AIMs panel is also capable of discriminating the C-lineage A. m. ligustica and A. m. carnica, for estimating C-lineage introgression into A. m. mellifera we recommend using the 96-AIMs panel because it is accurate; and highthroughput 96-plex genotyping assays can be outsourced at an affordable cost ($8 900 for 480 samples), representing a saving of 92.4% when compared with the 1536-plex assay ($116 800 for 480 samples).
In conclusion, the proposed AIMs panels can be actively used as a tool in conservation management of A. m. mellifera populations that suffer from hybridization and introgression with the most commonly introduced and beekeepers' preferred A. m. ligustica and A. m. carnica subspecies. This can be an important advance because the current European regulation on organic beekeeping states that "preference shall be given to the use of European breeds of Apis mellifera and their local ecotypes" and several conservation programs have been undertaken in Europe (reviewed by De la Rúa et al. [8]). The use of these panels will apply well to monitoring, management and conservation programs of A. m. mellifera in Western Europe, which usually require high-sample throughput, and will be a resource for the honey bee community to obtain accurate genetic information at reduced costs.   Table. Information content values of the initial 1183 SNP dataset estimated by the five selection methods (Weir & Cockerham's F ST , Delta, informativeness (I n ), PCA and the F STbased outlier test) and for the four training datasets (I to IV). The SNPs are ordered from high to low information content. The top 48,96,144,192 and 384 SNPs were included in the five reduced panels. SNPs marked with an asterisk ( Ã ) were excluded from the reduced panels because they were within a genetic distance <1 cM of other informative SNPs. (DOCX) S3 Table. Admixture proportion estimates inferred from the five AIMs panels (48-, 96-, 144-, 192-, 384-AIMs) and the initial 1183 SNP dataset for the holdout set. The holdout set consisted of 34 pure (training set) and 43 reserved individuals of A. m. mellifera and all reference individuals of A. m. ligustica (17) and A. m. carnica (19). Ã Samples marked with an asterisk ( Ã ) are of A. m. mellifera from protected populations (pure breeding for conservation purposes; see Pinto et al. 2014 [9] for details).  Table. P-values of Mann-Whitney pairwise several-sample-test. Values obtained from comparing admixture proportions inferred from the five AIMs panels and the 1183 initial SNP dataset using the simulated set. The simulated set was generated with the program ONCOR (Kalinowski et al. 2007) using the function "simulate a single mixture". Ten populations, each with 100 genotypes, were simulated using different levels of C-lineage introgression (0, 1, 5, 10, 20, 30, 40, 50, 75, and 90%). (DOCX)